Consuming Keystone CADF Events From RabbitMQ

This started with a simple requirement: “I’d like to know when users or projects are added or removed and who did the action”

As it turns out there’s no great way to do this. Sure you can log it when a user is deleted:

"DELETE /v2.0/users/702b12ec7f0e4f7d93945eebb95705e1 HTTP/1.1" 204 - "-" "python-keystoneclient"

The only problem is that ‘702b12ec7f0e4f7d93945eebb95705e1’ is meaningless without the DB entry which is now conveniently gone.

But if you had an async way to get events from Keystone, you could solve this yourself. That was my idea with my Keystone CADF Event Logger tool. Before we dive into the tool, some quick background on CADF events. You can read the DMTF mumbo-jumbo at the link in the previous sentence, but just know, Keystone CADF events log anything interesting that happens in Keystone. They also tell you who did it, from where they did it, and when they did it. All important things for auditing. (This article from Steve Martinelli has some more great background)

So how does this solve my problem? CADF events still just log ids, not names. My solution was a simple rabbit consuming async daemon that cached a user and project names locally and used it to do lookups. Here’s an example of what it does:

Logs user auth events

Note that V2 doesn’t log much info on these, although that is fixed in Liberty I believe.

INFO 2015-09-24 15:09:27.172 USER AUTH: success: nova
INFO 2015-09-24 15:09:27.524 USER AUTH: success: icinga
INFO 2015-09-24 15:09:27.800 USER AUTH: success: neutron
INFO 2015-09-24 15:09:27.800 USER AUTH: failure: neutron

Log user/project crud events

Note again V2 issues here with Kilo leave us with less than full info.

USER CREATED: success: user ffflll at 2015-09-18 16:00:10.426372 by unknown (unknown) (project: unknown (unknown)).
USER DELETED: success: user ffflll at 2015-09-18 16:02:13.196172 by unknown (unknown) (project: unknown (unknown)).

Figures it out when rabbit goes away

INFO 2015-11-11 20:46:59.325 Connecting to
ERROR 2015-11-11 22:16:59.514 Socket Error: 104
WARNING 2015-11-11 22:16:59.515 Socket closed when connection was open
WARNING 2015-11-11 22:16:59.515 Disconnected from RabbitMQ at (0): Not specified
WARNING 2015-11-11 22:16:59.516 Connection closed, reopening in 5 seconds: (0) Not specified


This requires that Keystone is configured to talk to rabbit and emit CADF events. The previously referenced blog from Steve Martinelli has good info on this. Here’s what I set:


This code also assumes that /var/log/keystone_cadf is there and writable. I setup this with puppet in my environment.

You should ensure Keystone is talking to Rabbit and has made the queues and exchanges before trying the program.


I designed this to run in a docker container, which explains the overly full requirements.txt, you can probably get away with the requirements.txt.ORIG. After you build it (python ./ build && python ./ install, just run it by passing in creds for Keystone and for RabbitMQ. You can also use environment variables which is I how I ran in it my docker container.

source openrc
keystone-cadf-logger --rabbit_user rabbit --rabbit-pass pass1 --rabbit-host


So what issues exist with this? First some small ones. The code that parses the events is horrible and I hate it, but it worked. You can probably improve it. Second, the big issue. In our environment this code introduced a circular dependency between our control nodes, where rabbit runs, and our keystone nodes which now need to talk to rabbit. For this reason, we ended up not deploying this code, even thought I had all the puppet and docker portions working. If you don’t have this issue, then this code will work well for you. I also don’t have much operating experience with this, it might set all your disks on fire and blow up in spectacular fashion. I planned to deploy it to our dev environment and tweak things as needed. So if you operate it, do it cautiously.


If you are interested in more event types, just change the on_message code. You might also want to change the action that happens. Right now it just logs, but how about emailing the team anytime a user is removed or noting it in your team chat.


This code consists of a few parts and I hope at least some of it is useful to someone. It was fun to write and I was a bit disappointed that we couldn’t fully use it, but I hope that something in here, even if it’s just the async rabbit code might be useful to you. But what about our requirement? Well, we’ll probably still log CADF events locally on the Keystone node and consume them, or we might write a pipeline filter that does something similar, whatever we decide I will update on this site. So please pull the code and play with it!

Github Link

Tagged , ,

Keystone Token Revocations Cripple Validation Performance

Having keystone token revocation events cripples token validation performance. If you’ve been following any of the mailing lists posts on this topic, then you already know this since it’s been discussed (here) and (here). In this post I explore the actual impact and discuss what you can do about it.

What Are Revocations?

A token is revoked for any number of reasons, but basically when it’s revoked, it’s invalid. Here are some of the reasons that revocation events will be generated:

  • The token is intentionally invalidated via the API
  • A user is deleted
  • A user has a role removed
  • A user is removed from a project
  • A user logs out of Horizon
  • A user switches projects in Horizon

Of these events the last two are by far the most common reasons that revocation events are being generated in your cloud.

How Are Revocation Events Used?

How this works varies some based on the token type, but lets assume that a token comes in that is non-expired. We know that either from decrypting it (Fernet) or from looking it up in the DB (UUID). But before Keystone can bless the token, it needs to check the revocation table to ensure that the token is still valid. So it loads the table called revocation_event and takes a peek. Also when it does this load, Keystone does a little house-keeping and removes and revocation events that are on tokens that are already expired. The time a revocation event lives is the same as the token. It does not make sense to have a 3 hour old revocation event when the longest token can live is 1 hour. The unfortunate thing with this algorithm is that it locks the table, slowing down other revocations even more and if it takes too long, leads to deadlocks and failed API calls.

Why Should You Care About Token Validation?

Keystone token validation underlies every single API call that OpenStack makes. If keystone token validation is slow, everything is slow. Validation takes place for example when you make a nova call, nova has to be sure that the token is okay first before performing the action.


If you want to see the experimental setup, skip below, but most of you will want the numbers first!

The chart below shows two runs of the benchmark which checks concurrent token validations. You will see that as soon as you have revocation events, performance falls significantly. There are two lines on the chart. The first line, blue, is our current packaged version of Keystone which is Kilo++/Liberty. The second line in red, shows the performance of a version of Liberty from July 17 with this patch applied. The hope with the patched code is that smarter use of deletes would improve performance, it does not in a measurable way. It may however reduce deadlocks, but I am unable to validate that since my environment is not under any real load.

Benchmark Results

Note: Do not put too much stock into the fact that the red line starts slower than the blue, instead focus on the shape of the curve. There’s too many possible variables in my testing (like what my hypervisor is doing and all the other changes between versions) to compare them apples to apples.

Experimental Setup

For the experimental setup all the systems are guests running in our production cloud built using vagrant-openstack and our standard puppet automation code. The nodes are as follows:

  • 3 keystone nodes
  • 1 haproxy load balancer
  • a puppet master, which also runs the benchmarks

The nodes are running Ubuntu and a version of Keystone from master from May 2015. They are using Fernet tokens that expire after two hours. mysql is setup as a 3 node Galera cluster that preferentially uses one node. The systems were not otherwise busy or doing much else.

The test itself tries to do 20 validations at once up to 4000 of them. It talks to the load balancer which is setup to do round-robin connections.

Given all the variables here, I don’t expect you to replicate these numbers, but rather viewed relative to each other.

Running the Benchmark

For the benchmark code, I used a modified version of Dolph’s benchmark experiment. The modified code is here (note that the detection of whether ab is installed is broken feel free to send me a fix).


./ [Keystone Node or LB] [admin_password]

Generating Revoked Tokens

Here’s my kinda hacky script to generate and revoke tokens, it could be better if it just used curls for both. Usage is to pass in a number of tokens to create and then revoke as arg1 and then a valid token as arg2 that you’ve previously generated.

echo "getting & revoking $1 tokens"
for i in $(eval echo "{1..$1}")
TOKEN=`keystone token-get | grep id | grep -v tenant_id | grep -v user_id | awk '{ print $4 }'`
curl -X DELETE -i -H "X-Auth-Token: $2" "${OS_AUTH_URL}/tokens/${TOKEN}"


Here are a few ideas I’d recommend. First get a baseline of how many revocations you have on a regular basis, this should mainly be from people signing out of Horizon or switching projects in Horizon. For us it’s about 20-30. This is how you check.

mysql -u root keystone -e "select count(id) from revocation_event;"

Once you get a normal number, I’d recommend putting a threshold check into Icinga.

Watch your testing too, we have some regression tests that create users, roles, etc and generates about 500 revocation events.

If you have a spike of events, and you’re not worried about rogue users, you can simply truncate the table.

mysql -u root keystone -e "truncate table revocation_event;"

This has security implications so make sure you know what you are doing.

Another idea is writing a no-op driver for revocations, this essentially disables the feature and again has security implications.

Finally, I’d recommend enabling caching for revocation events you still get the same curve, but you’ll start out at a higher performance value.

Tagged , ,

Fernet Tokens in Prod

This post is a follow-up to my previous post about Fernet Tokens which you may want to read first.

Last night we upgraded our production OpenStack to a new version of keystone off of master from a couple weeks ago and at the same time switched on Fernet tokens. This is after we let the change soak in our dev and staging environments for a couple weeks. We used this time to assess performance, look for issues, and figure out our key rotation strategy.

The Upgrade

All of our upgrade process is run via ansible. We cherry-pick the change which includes pointing to the repo with the new keystone along with enabling the Fernet tokens and then let ansible drive puppet to upgrade and switch providers. During the process, we go down to a single keystone node because it simplifies the active/active database setup when running migrations. So when this node is upgraded we take a short outage as the package is installed and then the migrations run. This took about 16 seconds.

Once this is done, the other OpenStack services start freaking out. Because we’ve not upgraded to Kilo yet, our version of Keystone middleware is too dumb to request a new token when the old one stops working. So this means we have to restart services that talk to Keystone. We ended up re-using our “rabbit node died, reboot OpenStack” script and added glance to the list since restarting it is fairly harmless even though it doesn’t talk to rabbit. Due to how the timing works, we don’t start this script until puppet is completely done upgrading the single keystone node, so while the script to restart services is quick, it doesn’t start for about 90 seconds after Keystone is ready. This means that we have an API outage of 1-2 minutes. For us, this is not a big deal, our customers are sensitive to “hey I can’t get to my VM” way more than a few minutes of API outage, especially one that’s during a scheduled maintenance window. This could be optimized down substantially if I manually ran the restarts instead of waiting on the full puppet run (that upgrades keystone) to finish.

Once the first node is done we run a full validation suite of V2 and V3 keystone tests. This is the point at which we can decide to go back if needed. The test suite for us took about 2 minutes.

Once we have one node upgraded, OpenStack is rebooted, and validation passes, we then deploy the new package and token provider to the rest of the nodes and they rejoin the cluster one by one. We started in the opposite region so we’d get a endpoint up in the other DC quickly. This is driven by another ansible job that runs puppet and does the nodes one by one.

All in all we finished in about 30 minutes, most of that time was sitting around. We then stayed an extra 30 to do a full set of OpenStack regression tests and everything was okay.

At the end I also truncated the token table to get back all the disk space it was using.

Key Rotation

We are not using any of the built-in Keystone Fernet key rotation mechanisms. This is because we already have a way to get code and config onto all our nodes and did not want to run the tooling on a keystone node directly. If you do this, then you inadvertently declare one node a master and have to write special code to handle this master node in puppet or ansible (or whatever you are using). Instead we decided to store the keys in eyaml in our hiera config. I wrote a simple python script that decrypts the eyaml and then generates and rotates the keys. Then I will take the output and propose it into our review system. Reviewing eyaml encrypted keys is somewhat useless, but the human step is there to prevent something dumb from happening. For now we’re only using 3 keys, since our tokens last 2 hours, we can’t do two rotations in under two hours. The reviewer would know the last time a rotation was done and the last time one was deployed. Since we don’t deploy anywhere near a two hour window, this should be okay. Eventually we’ll have Jenkins do this work rather than me. We don’t have any firm plans right now on how often we’ll do the key rotation, probably weekly though.

To answer a question that’s come up, there is no outage when you rotate keys, I’ve done five or six rotations including a few in the same day, without any issues.


I will be doing a full post later on about performance once I have more numbers, but the results so far is that token generation is much faster, while validation to be a bit slower. Even if it was about the same, the number of problems and database sync issues that not storing tokens in the DB solves make them worthwhile. We’re also going to (finally) switch to WSGI and I think that will further enhance performance.


Today one of my colleagues bought a bottle of Fernet-Branca for us. All I can say is that I highly recommend not doing a shot of it. Switching token providers is way less painful. (Video of said shot is here)

Tagged , ,

Fernet Tokens for Fun & Profit

I’ve been digging into Fernet tokens this past week and getting ready to switch us over to using them. This is the first in a series of blog posts I plan on writing about them. This one will mainly be background on why we’re switching and what we hope to gain. The next post will cover rolling them out which will probably be in a few weeks. For now we’re running these keys in our dev environments for more testing while we focus resources on Kilo upgrades.

What are Fernet Tokens?
How do you explain Fernet tokens? Rather than some lengthy treatise mathematical and identity management theory, just know this: Fernet tokens use shared private keys to avoid having to store or replicate tokens in your database. This makes them super fast, reduces load on your database, and solves replication lag between data centers and nodes within a data center. If your manager asks you, “they’re faster, small, and reduce load on the db”, that should suffice. Dolph Mathews has a good write-up on how much faster they are here. You can also dive into the different token formats for comparison on another of his posts, here.

What Issues will this solve for us?

Now about the DB replication issues… I cannot tell you how much stuff we had to do to deal with database and replication issues with UUID tokens, here’s a few samples:

  • custom cron job to reap expired tokens
  • force db transactions to a master, despite us being active/active so tokens would be there when asked for
  • hacks to our cross-region icinga checks to allow the tokens to replicate, literally sleep(3)

We’ve even had a service accidentally DOS us by requesting so many tokens the DB couldn’t keep up and keystone ran out of DB threads. Hopefully all this is solved by Fernet tokens.

Will this cause an outage to switch to?

Switching token providers will cause an outage. All the old tokens you’ve issued are now 100% useless. So prep accordingly. I will give some updates to the next blog post on how long this was and what issues we saw when we did it.

Do I need to be only Kilo/Liberty?

  • Horizon – you need a newish copy of django_openstack_auth which I think is in Liberty
  • Keystone – you need to be on Kilo
  • python-keystonemiddleware it’s best to have at least 1.1.0. If you have 1.0, you MUST restart all OpenStack services are switching tokens
  • Everything Else – Shouldn’t matter!

A note on python-keystonemiddleware. In 1.0.0 if a service (say Nova) can’t use it’s token for some reason, it won’t try to get a new one until the old one expires. So if you switch to Fernet’s you have to restart all OpenStack services that talk to Keystone or they will not work. We already have some ansible to do this mainly in response to RabbitMQ issues but it works here too.

How do I get Keys onto the boxes?

All keystone nodes in your cluster need to have the same keys. Fortunately there is the concept of rotation so there’s no outage when switching keys, there’s always a key thats “up next” or “on-deck” so that when you’re rotating you switch to a key that’s already on every box. Now as for getting the keys there. I’m going to use puppet to deploy keys that I store in hiera and rotate with a jenkins job, but there are other ways like a shared FS or rsync. More details on my method once I know it works in a later blog post!

How does key rotation work?

What Fernet Rotation Looks Like

What Fernet Rotation Looks Like

If you read through the information on Fernet tokens, key rotation is by far the most confusing. I’ve sat down with pen and paper and now think I get it, so allow me to explain. I’m going to use a 3 key example here, they keys are named with numbers. I highly encourage you to setup a throwaway Keystone box and use keystone-manage fernet_rotate if you don’t follow this.

You need to know 4 rules about how these keys work first:

  1. The highest numbered key is the current signing key.
  2. The 0 key is the key that will become the next signing key.
  3. All other keys are old keys, they’ve been used in the past and there might be old tokens out there still signing with them depending on your expiration schedule
  4. New keys are always created as key 0.

Starting position, per the rules above.

  • 0 – this is the on-deck key, after the next rotation, it’s primary.
  • 1 – this is the old key, it used to be primary, and its still here in case any old tokens are still signed with it. Next rotation it gets deleted.
  • 2 – this is the current primary key thats used for signing.

Now we do a Rotation…

  • 0 becomes 3
  • 1 gets deleted
  • 2 stays 2
  • a new key becomes 0
A Fernet Rotation in Action

A Fernet Rotation in Action

So How does this work?

Let’s pretend we have a few tokens since this is a running OpenStack cluster. All tokens before the rotation above are signed with 2. We do the rotation, now new tokens are signed with 3. When a token comes in, Keystone tries both 3 and 2 to decode the token, and either should work. At this point we CANNOT rotate again until no more active keys are signed with 2, because 2 is going to be deleted! This means you need to have more tokens if you plan on rotating more frequently or have a long token expiration time. We’re going to rotate roughly weekly, and we have a 2 hour token timeout, so 3 is plenty.


If you think you get this, try this a homework problem. Assume that you have max_active_keys set to 5 and that you have 5 keys: 0, 4, 5, 6, 7.

  • Which is the current signing key?
  • Which is on-deck or the next key to be used? What will it’s number be after the rotation?
  • Which key will be deleted on next rotation?
  • What happens if a token comes in signed with key 5?
  • What happens if a token comes in signed with key 3?

Other Sources

I gathered a lot of this info from trying stuff but also a lot from blog posts. I’ve referenced two above, but I also want to recommend Lance Bragstad’s blog. Note, Lance’s blog is the only blog in the world where you can read about quinoa recipes and shotgun shot patterns.

Tagged , ,

What I Hope to Get From the OpenStack Vancouver Summit

reproduced from content I wrote for SuperUser.

Matt Fischer, principal engineer at Time Warner Cable, shares survival tips for the Summit: backup plans, a beer list and a talk that he promises will be better than “Cats.”

May 11, 2015

The Vancouver summit is about a week away and so it’s time to start my prep work for the summit. This first means making a list of talks I want to go to. It also means making a list of any people I want to meet in person to talk to or people I owe a beer (or three) to. Finally, it means a list of things I want to accomplish at design sessions. If you’ve never been to a Summit before, what you get out of it really depends on how well you dig into the schedule and do some advanced planning.

So with that in mind, here’s my thinking.

Planning for Talks

I have some main focus areas for talks, things I want to come away from the summit knowing more about. These include things for which I’m the “owner” of in the Time Warner Cable OpenStack cloud, but also things I’m just curious about, so in no particular order:

Operations I always want to know how to do things differently or better, so you’ll see me around these rooms a lot. I’m specifically interested in Upgrades, CI/CD, and integrating new features into our cloud.

Heat I’m specifically interested in application catalog capabilities. Designate – I’ve worked some on this in the past few months. Neutron – I always want and need to know more about Neutron, even though it’s not my focus area.

Now with talks you can simply find the ones you like, and if you create an account and sign-in, you can add them to your personalized schedule. This is what I’d recommend to help plan your day. But, there’s a few more tricks that I’d recommend you use.

Review your schedule every morning. If you’re in Vancouver with a team you can divide up talks if you have time conflicts. You may also find that you’ve changed your mind or the schedule has changed, hence the morning review.

Have a backup talk. Sometimes talks are full or maybe you go and it’s not for you. You always need a backup talk.

Make a list of talks you want to watch later. All the talks end up on the OpenStack Foundation’s Youtube channel. Make a list of talks that you didn’t or won’t get to and watch them later. This does not work well for hands-on sessions, so I always opt to go to them.

If you can’t pick a talk by subject, pick by speaker. The speaker makes the talk sometimes more than the subject.

Don’t be afraid to have free time. The summit can be grueling. Leave a space or two in your schedule and visit the vendors or go take a nap.

Planning for People

OpenStack is community driven and the community is made of people. Take time to say “hi” to the people you’ve talked to on IRC or mailing lists.

Take the time to thank someone who fixed a bug for you or better yet buy them a beer. You cannot underestimate the value of having a beer with someone you’ve only previously met online. I cannot emphasize this enough.

Who are the lucky ones on my beer list this year? You’ll have to wait and see, hopefully I’m on yours!

Planning for Design Sessions

For the first time, OpenStack Puppet will be a real OpenStack project and so while previously we’ve had an hour to discuss stuff, this time we will have a full day for design work. Through lots of work over the past year, I’ve become core in OpenStack Puppet, and I hope to spend a good part of my day Tuesday participating in live discussions and work sessions. We have lots of stuff to discuss, the largest item which is dear to me, is when the master branch can drop support for the old stable release. If you’ve been active in other projects, you may have similar issues like this that need closure. These are usually easier to figure out in a room rather than in an IRC meeting. However, many of the design sessions are already planned so it may be too late to get something on the agenda, but it’s not too late to attend and participate. I’d recommend making a list of the things you want to cover and seeing how that lines up with the design session schedule. Please note that the Design Summit and the OpenStack Summit use different schedules hosted on different pages.

Planning for Parties

The OpenStack Summits have a large after-hours social aspect. These are valuable to attend just for camaraderie — you don’t have to drink beer to go and have fun. I generally go to as many of these as possible, they are usually pretty great.

You’re Invited!

I hope to see everyone in Vancouver, and would like to especially invite you to see some of my talks. The descriptions are pretty good, but I thought I’d say a few things about what you could expect to get out of each of them, here they are in chronological order with links so you can add them to your schedule.

Building Clouds with OpenStack Puppet Modules If you’re curious about how companies use Puppet to deploy OpenStack or about how our community works, you should attend this talk. You will get a couple different views and ideas on using Puppet from Mike Dorman and I, and hear about the community from Emilien Macchi. Monday, May 18 • 4:40 p.m. – 5:20 p.m.

A CI/CD Alternative to Push and Pray for OpenStack There are lots of CI talks this year, but I promise you will learn something new at this one. Clayton O’Neill and I will cover lots of topics and tools and you will see how we use these tools to get code and config from concept to production. Tuesday, May 19 • 12:05 p.m. – 12:45 p.m.

Getting DNSaaS to Production with Designate Have your customers been asking for DNSaaS (DNS as-a-Service)? Do you plan on having several people working on it full time as core Designate developers or would you rather just get it deployed with the minimum of pain? If the latter, then this is the talk for you. Clayton and I will cover what work is required, what work we did, and what to watch out for. One special thing that we will cover is how to write your own (or use our) Designate Sink which lets you automatically create records every time a new floating IP is assigned. Wednesday, May 20 • 9:50 a.m. – 10:30 a.m.

Real World Experiences with Upgrading OpenStack at Time Warner Cable There are also lots of Upgrade talks at this summit. In this one, Clayton and I will be telling a story of what happened when we upgraded to Juno. Even though I wasn’t smart enough to put “beer” in the title of my upgrade talk, maybe you can learn some lessons or get some ideas from us. You’ll laugh, you’ll cry, it will be better than “Cats.” Thursday, May 21 • 2:20 p.m. – 3:00 p.m.

Tagged ,

Scale & Maturity: Thoughts on the OpenStack Mid-Cycle Operators Meetup

A re-post of an article I wrote last week for SuperUser.

A couple weeks back I attended the OpenStack Operators Mid-Cycle Meetup in Philadelphia.

One of the best parts of these meetups is that the agenda is all formed in public and everyone gets a say in it. Some of the things I was most concerned about made the list (RabbitMQ, packaging), some did not get on the main agenda (database maintenance, Puppet), but many side groups formed during the event and these topics were covered at lunch conversations and in the lobby.

The interesting part of this summit was hearing from other operators problems and solutions. This was more focused, yet with a larger audience, than the previous sessions in Paris. I think a real sense of camaraderie and support for shared solutions was achieved.

Puppet OpenStack Discussion at Lunch

As I was listening to people discuss their issues with OpenStack and how others had solved them, I realized that OpenStack operators have different issues at different scales and maturity levels. When I think about scale and maturity, it’s not necessarily about number of nodes or the number of customers, its more about the number of resources you have, the number of services you provide, the maturity of your processes (such as deployment), and to a some extent how many of your problems you’ve solved with automation.

Our team started at a small scale. We had four people and the goal was to stand-up OpenStack. With four people, you are limited in scope and have to judiciously focus your resources. As our team grew and we worked through forming our deployment and automation processes, we’re able to spend more time on improving our service offerings and adding more services. We can also go back and clean up technical debt which you accumulate as you build an OpenStack deployment. Before these tools and processes are fully in place (and they are never perfect), making changes can take away valuable time.

For example, when an operator finds a bug, it takes a lot of resources and time to get a fix for that bug into production. This includes debugging, filing a bug, fixing the code, pushing a fix, begging for reviews, doing a cherry-pick, begging for more reviews, waiting for or building a package, deploying the new code. Many operators stop around the “filing a bug” step of this process.

Medium-sized operators or ones with more mature processes will sometimes work on a fix (depending on the issue), and may or may not have systems in place to allow them to hold the patch locally and build local packages. On the other hand, larger operators who have good systems in place can give the issue to the Team. They may have 10 people working on it, some of them core members. They have a full continuous integration/automation team that has solved package builds and deployments for them.

Our goal has always been to increase our scale of services, not only via more resources but through automation and creating tools/processes that allow us to offer services for the same amount of resource investment. The main reason for this that is our customers don’t care about Keystone or Neutron, these are just building blocks for them, they really want services like Domain Name System (DNS,) load-balancing-as-a-service (LBaaS), firewall-as-a-service (FWaaS) and database-as-a-service (DBaaS). But until the processes and tools are solid for the core components, it’s hard to find time to work on those, because while customers may not know what Keystone is, they sure care when it doesn’t work.

So how does any of this relate to the conference besides I was daydreaming about it in the lobby? What is clear to me after the sessions is that we have some specific areas where we’re going to work on process improvements and tooling.

My top three are:

  1. RabbitMQ monitoring and tooling
  2. Speeding up and clarifying our development & deployment process
  3. Investigating alternatives to heavy-weight OS packages for deploying OpenStack code

When we revisit this again in six months at the next Mid-Cycle, I suspect that number two will remain on the list, probably forever, since you can always make this process better. I’m certain we’ll have a new number one, and I’m pretty hopeful about the options for number three.

What will these investments get us? Hopefully more time for second-level service offerings and happier customers.

Tagged , ,

Using Keystone’s LDAP Connection Pools to Speed Up OpenStack

If you use LDAP with Keystone in Juno you can give your implementation a turbo-boost by using LDAP connection pools. Connection pooling is a simple idea. Instead of bringing up and tearing down a connection every time you talk to LDAP, you just reuse an existing one. This feature is widely used in OpenStack when talking to mysql and adding it here really makes sense.

After enabling this feature, using the default settings, I got a 3x-5x speed-up when getting tokens as a LDAP authenticated user.

Using the LDAP Connection Pools

One of the good things about this feature is that it’s well documented (here). Setting this up is easy. The tl;dr is that you can just enable two fields and then use the defaults and they seem to work pretty well.

First turn the feature on, nothing else works without this master switch

# Enable LDAP connection pooling. (boolean value)

Then if you want to use pools for user authentication, add this one:

# Enable LDAP connection pooling for end user authentication. If use_pool
# is disabled, then this setting is meaningless and is not used at all.
# (boolean value)

Experimental Setup

For my experiment I used a virtual keystone node that we run on top of our cloud, pointing at a corporate AD box using ldaps. Using an LDAP user, I requested 500 UUID tokens in a row. We use a special hybrid driver that uses the user creds to bind against ldap and ensure that the user/pass combo is valid. I also changed my OS_AUTH_URI to point directly at localhost to avoid hitting the load balancer. Finally I’m using the eventlet (keystone-all) vs apache2 to run Keystone. According to the Keystone PTL, Morgan Fainberg, “under apache I’d expect less benefit” If you’re not using the eventlet, ldaps, or my hybrid driver you might get different results, but I’d still expect it to be faster.

Here’s my basic test script:

export OS_TENANT_NAME=admin
export OS_PASSWORD=password
export OS_TENANT_NAME=admin
export OS_REGION_NAME='dev02'
export OS_AUTH_STRATEGY=keystone
export OS_AUTH_URL=http://localhost:5000/v2.0/
echo "getting $1 tokens"
for i in $(eval echo "{1..$1}")
curl -s -X POST http://localhost:5000/v2.0/tokens \
-H "Content-Type: application/json" \
-d '{"auth": {"tenantName": "'"$OS_TENANT_NAME"'", "passwordCredentials": {"username": "'"$OS_USERNAME"'", "password": "'"$OS_PASSWORD"'"}}}' > /dev/null


Using the default config, It took 7 mins, 25s to get the tokens.

getting 500 tokens
real 7m25.527s
user 0m2.312s
sys 0m1.557s

I then enabled use_pool and use_auth_pool and restarted keystone, the results were quite a bit faster, a 5x speed-up. Wow.

getting 500 tokens
real 1m25.774s
user 0m2.302s
sys 0m1.539s

I ran this several times and the results were all within a few seconds of each other.

I also tried this test using the keystone CLI and the results were closer to 3.5x faster, still a respectable number.

Watching the Connections

I have a simple command so I can see how many connections are being used:

watch -n1 "netstat -an p tcp | grep :3269"

Using this simple code I can see it bounce between 0 and 1 without connection pools.

Using the defaults but with connection pools enabled, the number of connections was a solid 4. Several minutes after the test ran they died after a bit and went to 0.

I’m not sure why I didn’t get more than 4, but raising the pool counts did not change this value. Any ideas on this are welcome. This is because I have 4 workers on this node.

Tokens Are Fundamental

The coolest part of this is that this change speeds everything up. Since you need a token to do anything, I re-ran the test but just had it run nova list, cinder list, and glance image-list 50 times using the clients. Without the pooling, it took 316 seconds but with the pooling it took 231 seconds.


There are lots of ways to improve the performance of OpenStack, but this one is simple and easy to setup. The puppet code to configure this is in progress now. Once it lands, I plan to move this to dev and then to staging and prod in our environments. If I learn any other interesting things there, I’ll update the post.

Tagged , , ,

What Do Operators Do All Day?

When I was a kid, Richard Scarry’s book, “What Do People Do All Day” was one of my favorites. I saw this book at my parents house and I was thinking about trying to categorize everything I’ve worked on in the past few months, so the result of that thinking is this post. “What Do Operators Do All Day?”

Being an operator means that you need to by necessity be a jack of all trades, unless you’re at a very large provider. And so, over the past 6 months, I’ve worked on almost every piece of our cloud, and in almost all cases I learn something new and grow my skillset (which is my favorite part of working on Openstack).

Collecting My Data

Over the last 6 months or so, I’ve resolved 106 JIRA issues, and looking back at these provides a decent picture of where I spend my tracked work time. I’ve also done upstream reviews and commits, for which stackalytics will provide good details. Using this information, I’ll present where I do my work in order from most time to least.

Puppet Automation

I spend most of my time these days working on puppet modules or configuring services with puppet. Some of this work includes:

  • Fixing/configuring/enabling new features in services like Keystone/Nova/etc
  • Upgrading our puppet branches from Icehouse to master
  • Configuring build server & infra (cobbler, puppet, package repos etc)
  • Configuring/deploying Icinga, or writing new checks
  • Refactoring and cleanup, like moving all our keystone roles/users to YAML so that they’re simpler to add


A close second is Ansible automation. We use ansible to manage our internode dependencies and also to drive our deployments. One example of what we’d use ansible for is to upgrade mysql, one node at a time, managing state between nodes while doing so. Over the past six months I’ve written ansible jobs to:

  • Deploy a new hand-built version of ovs
  • Perform a live upgrade mysql from 5.5 to 5.6
  • Upgrade openstack services from I to J or J to K
  • Improve to our weekly deployment process


Some of these tasks don’t show up in Jira, but they do take a good amount of my time.

  • Travel/training: Openstack conf, RabbitMQ training, etc
  • Planning: sprint planning, feature planning, expansion planning, etc
  • Mentoring and on-boarding: we’ve grown a lot and this one cannot be underestimated. I do about 5-10 code reviews per day even when I’m not answering questions
  • Working on Ubuntu packaging for openstack, we roll (some) of our own
  • MySQL/Galera DBA-esque work

On Call/Issues

Every few months, I do an on-call rotation for a week, these can be quiet or not, depending on the shape of our monitoring and cloud. Whether it’s a good or bad on call is usually of our own doing however. Even when not on call, I deal with issues though, although we do our best, we occasionally have problems. When you get enough nodes, you’ll get failures. They could be hardware failures, kernel issues, or even simply software failures, we’ve had them all. I could do an entire post on the issues we see here, but the ones that stick out to me as focus areas for software are ovs, mysql, and rabbitmq. Those are probably the three most complex and most important pieces of our software stack, and so they get lots of my attention.


I think that the community of one of the best things about OpenStack, so I spend what little time I have left here. I participate in IRC and mailing list discussions as a part of the Operators and Puppet-Openstack community. I also do reviews and submit fixes primarily for puppet-openstack but also for Openstack itself. Although my commits to openstack itself have slowed, I’ve earned my 3rd ATC for Vancouver and I think it’s important to participate in this process.


One of my first concerns that I expressed when interviewing for this job was that we’d have Openstack setup in a year and then we’d be done. That has been far from the truth. In reality, the life of an Openstack operator is always interesting. There’s no shortages of things to fix, things to improve, and things to learn, and that’s why I love it. Although each release of Openstack generally makes things easier and more robust, they also always add more features along the edges to keep us busy.

Does this list match what everyone else spends their time on? Let me know in the comments.

Tagged ,

My Key Learning from Paris

I learned a bunch of things in Paris: French food is amazing, always upgrade OVS, but the most important thing I learned in Paris?

All operators have the same problems.

Every operator session was a revelation for me. As it turns out, I’m not the only one writing Icinga check jobs, Ansible scripts, trying to figure out how to push out updates, or fighting with ovs. These sessions not only provided by validation to let me know that we’re not doing stuff totally wrong, but also allowed everyone to share solutions. Below are some of the themes that I found particularly interesting or relevant to this premise.


One topic in which operators have a common interest was upgrades. Only a few of the operators have any real experience with this, but its been a pain point before for many. Some have resorted to fork-lift style “bring up a new cluster” upgrades, which is not nice for customers and requires extra hardware. How do we solve this? It seems that some projects have upgrade guides, but finding documentation on a holistic upgrade is difficult, especially gathering information on ordering (Cinder before Nova, for a contrived example). Issues specifically called out in the session were config changes (deprecations and new additions) and database migrations (including rollback). Rollback is especially worrying as its not well tested. There was no solution for this except a resolve to share information on upgrading via the Operators Mailing List.

Etherpad for Upgrades

CI/CD & Packaging

Another issue that operators face after an Openstack deployment is how to get fixes and new code out to the nodes. An upstream bug fix might take a few days, plus a week for a backport, plus a month for the distro to pick it up. This means that even if the operator fixes it themselves, they still have a delay in getting it releases. During this delay you might be impacting customers who may not be that patient. The solution for many is using a custom CI/CD system that builds “packages” of some sort, whether distro packages or custom built vdevs. It was interesting to hear that people have a myriad of solutions here. We use a toolchain quite similar to the upstream Openstack tool chain that outputs Ubuntu packages. However even with this method we still rely on dependencies and libraries provided by the Ubuntu Cloud Archive, as much as possible anyway.

There was a bunch of talks on this subjects, not just operator talks, here are a link to the ones I attended:

  • CI/CD Pipeline to Deploy and Maintain an OpenStack Iaas Cloud (HP)
  • CI/CD in Practice (Comcast)
  • Building the RackStack (Rackspace)
  • If you know other ones that I should go back and watch please comment here.

    Automation (Puppet)

    We use Puppet to configure and manage our Openstack deployment, so this was an area of interest to me. It was great to see a 100% full room with everyone focused on improving how Openstack is configured with puppet. From discussions on new check jobs to a conversation about how to better handle HA, it was great to see the real sense of community around Puppet.


    The second puppet session was more of a “how are you solving X” and “how could it be better”. This was also a great session with some interesting notes.

    Towards a Project?

    Finally, one of the more interesting things happened later in the week when Michael Chapman and Dan Bode grabbed me in the lobby and wanted me to preview an email that was about to go out. In brief, they were proposing an operations project. This was born of the realization that I’d also made, we’re all solving the same issues. This email led to an impromptu meeting of about 30 people in the lobby of the Meridien hotel and the beginnings of an Operations Project.

    Birth of a New Project

    Birth of a New Project

    It was not the ideal venue, but we still had a great discussion on a few topics. The first was, can we have a framework that will allow us to share code? We all agreed that this project’s purpose is not to bless specific tools (for example Icinga) but instead to allow us to share Icinga scripts alongside other monitoring tools. Although this had been tried before with a github repo it had only a couple contributions, hopefully this will improve. We then dove into all the different tools that people use for packaging and whether or not any could be adapted as general purpose. Having a good tool for Ubuntu/Debian packages seemed to be something in great demand. Adding debian support for Anvil seemed to be something worth investigating further.

    Other topics of discussion included:

  • log aggregation tools and filters
  • ops dashboarding tools
  • Ansible playbooks
  • You can see the Etherpad link below to see what options for each were discussed, but the best idea is to catch up on the mailing list threads which are still ongoing and add if you have to share anything.


    Operations Wiki

    A Path Towards Contributing (via Commits) in Open Stack

    For 2014, I had a goal of doing 12 contributions (commits) to OpenStack. I set this goal because I wanted to learn more about the different components of OpenStack and I wanted to contribute back to the project. In order to do this, I continued a pattern that I started when I first began working on Ubuntu, one which I consider to be a great way to come up to speed on and contribute to OpenStack. Sometime around May I stopped counting contributions and consider my goal to have been met(*). Since I consider that a success, I’d like to share my process with you here and hopefully it can inspire someone else.

    Step 1: Learn About the Project
    The first step was to learn more about the project, reading is interesting, but diving in and using is better, so I started with DevStack. This should be everyone’s first step. When you encounter something you don’t understand, go find a blog post or OpenStack Summit video about the subject.

    Step 2: Contributing with Bug Triage
    Once I had a basic understanding of the parts of OpenStack, I started looking at bugs. Joining BugSquad and BugControl was the first thing I did when I did community work on Ubuntu and doing triage for OpenStack is also a good first step. Start with this wikipage, especially if you’re new to Launchpad. By triaging bugs you can get an idea where the issues are and get a second level understanding of the projects. A bug report might cause you to ask yourself, why is neutron talking to nova in this way? It could be something that was not obvious from playing with DevStack. During this process you can mark bug dupes, ask follow-up questions, and perform other triage work. For details on OpenStack bug triage work, read more here. The best part about Bug Triage for new committers? There is an unlimited and never-ending supply of bugs. So which to pick? I like to focus on areas that I have a basic understanding of and interest in. Pick a few projects that make sense to you and look there.

    Step 3: Finding a Bug
    The real mission during bug triage is digging for a golden nugget, your first bug. This is more difficult than you’d think. You need to find a bug that you can fix, something relatively simple, because your goal in fixing your first bug is to learn the process. During your bug triage work you may have seen some bugs tagged with “low-hanging-fruit”, these are issues that have been identified as ones that would be simple to fix, a good first choice. If you don’t see an obvious bug tagged thusly, I’d recommend starting with the python-*client projects. I find that this code is easier to understand, easier to test, and has more unclaimed issues. You can also even just pull the source for one you like and look for FIXME notes. Another idea is unit tests, all projects love more tests. And finally, the docs can always use help.

    Step 4: Working on the Bug
    After finding your bug, get to work on fixing it. Before doing so you have some pre-work to do, like creating a launchpad account and signing the developer agreement. I’m not going to list all these steps, but it’s not too difficult. The next steps are fairly obvious, testing, reviewing, fixing, etc. A brief interlude on testing: I do all my testing during this process against DevStack or in DevStack. I have an up-to-date DevStack VM with me at all times and a git repo with my config changes for it. I’d recommend you do the same. Note, when you make your DevStack VM, give it enough RAM and disk, 2GB RAM + 4GB swap and a 20GB disk should be enough. As for code reviews, keep in mind that you will have to likely go multiple rounds on your code reviews. Most of my changes go at least 3-4 rounds, some more than 10, but don’t lose hope, it will eventually land and it will be awesome!

    Anyway that’s my process for getting to a first commit with OpenStack. Hopefully if you’re just getting started this method will work for you. Good luck finding your bug and doing your first commit! If you have some other ideas on any of the stuff I covered, please comment below.

    * – I’m including StackForge commits in my count.