Tag Archives: fernet

Keystone Token Performance: Liberty vs Mitaka

A number of performance improvements were made in Keystone Mitaka, including caching the catalog, which should make token creation faster according to the Keystone developers. In this blog post, I will test this assertion.

My setup is unique to how I run keystone, you may be using different token formats, different backends, different web servers, and a different load balancer architecture. The point here is just to test Mitaka vs Liberty in my setup.

Keystone Setup

I’m running a 3 node Keystone cluster on virtual machines running in my OpenStack cloud. The nodes are fronted by another virtual machine running haproxy. The keystone cluster is using round-robin load balancing. I am requesting the tokens from a third virtual machine via the VIP provided by haproxy. The keystone nodes have 2 VCPUs + 4G RAM.

Keystone is running inside a docker container, which runs uwsgi. uwsgi has 2 static threads.

  • The Mitaka code is based on stable/mitaka from March 22, 2016.
  • The Liberty code is based on stable/liberty from March 16, 2016.

Note: I retested again with branches from April 17 and April 15 respectively, results were the same.

Keystone is configured to use Fernet tokens and the mysql backend.

I did not rebuild the machines, the mitaka runs are based on nodes upgraded to Mitaka from Liberty.

Experimental Setup

I am doing 20 benchmark runs against each setup, delaying 120 seconds in between each run. The goal here is to even out performance changes based on the fact that these are virtual machines running in a cloud. The tests run as follows:

  • Create 200 tokens serially
  • Validate 200 tokens serially
  • Create 1000 tokens concurrently (20 at once)
  • Validate 500 tokens concurrently (20 at once)

The code for running these benchmarks, which I borrowed from Dolph Mathew’s and made a bit easier to use, is available on github. Patches welcome.

Results

Is Mitaka faster? Answer: No.

Something is amiss in Mitaka Fernet token performance and there is a serious degradation here.

The charts tell the story, each chart below shows how many requests per second can be handled, and concurrent validation is the most concerning because this is the standard model of what a cloud is doing. Dozens of API calls being made at once to tens of services and each one wants to validate a token.

Liberty vs Mitaka: No Caching

So you can see that concurrent validation is much slower. Let’s also compare with memcache enabled:

Liberty vs Mitaka with Caching

Let’s look at raw data which is more damning due to the scale on the charts:

Data

Notes

I spent some time thinking about why this might be slower and I see one clue, the traffic to memcache (shown using stats command) in Mitaka is 3-4x what it is in Liberty. Perhaps Keystone is caching too much or too often? I don’t really know but that is an interesting difference here.

I’m hopeful that this gets fixed or looked at in Newton and backported to Mitaka.

Possible sources of error:

  • These are VMs, could have noisy neighbors. Mitigation: Run it a whole lot. Re-run on new VMs. Run at night.
  • Revocation events. Mitigation: Check and clear revocation table entries before running perf tests.

I’d really like someone else to reproduce this, especially using a different benchmarking tool to confirm this data. If you did, please let me know.

Tagged , ,

Keystone Token Revocations Cripple Validation Performance

Having keystone token revocation events cripples token validation performance. If you’ve been following any of the mailing lists posts on this topic, then you already know this since it’s been discussed (here) and (here). In this post I explore the actual impact and discuss what you can do about it.

What Are Revocations?

A token is revoked for any number of reasons, but basically when it’s revoked, it’s invalid. Here are some of the reasons that revocation events will be generated:

  • The token is intentionally invalidated via the API
  • A user is deleted
  • A user has a role removed
  • A user is removed from a project
  • A user logs out of Horizon
  • A user switches projects in Horizon

Of these events the last two are by far the most common reasons that revocation events are being generated in your cloud.

How Are Revocation Events Used?

How this works varies some based on the token type, but lets assume that a token comes in that is non-expired. We know that either from decrypting it (Fernet) or from looking it up in the DB (UUID). But before Keystone can bless the token, it needs to check the revocation table to ensure that the token is still valid. So it loads the table called revocation_event and takes a peek. Also when it does this load, Keystone does a little house-keeping and removes and revocation events that are on tokens that are already expired. The time a revocation event lives is the same as the token. It does not make sense to have a 3 hour old revocation event when the longest token can live is 1 hour. The unfortunate thing with this algorithm is that it locks the table, slowing down other revocations even more and if it takes too long, leads to deadlocks and failed API calls.

Why Should You Care About Token Validation?

Keystone token validation underlies every single API call that OpenStack makes. If keystone token validation is slow, everything is slow. Validation takes place for example when you make a nova call, nova has to be sure that the token is okay first before performing the action.

Results

If you want to see the experimental setup, skip below, but most of you will want the numbers first!

The chart below shows two runs of the benchmark which checks concurrent token validations. You will see that as soon as you have revocation events, performance falls significantly. There are two lines on the chart. The first line, blue, is our current packaged version of Keystone which is Kilo++/Liberty. The second line in red, shows the performance of a version of Liberty from July 17 with this patch applied. The hope with the patched code is that smarter use of deletes would improve performance, it does not in a measurable way. It may however reduce deadlocks, but I am unable to validate that since my environment is not under any real load.

Benchmark Results

Note: Do not put too much stock into the fact that the red line starts slower than the blue, instead focus on the shape of the curve. There’s too many possible variables in my testing (like what my hypervisor is doing and all the other changes between versions) to compare them apples to apples.

Experimental Setup

For the experimental setup all the systems are guests running in our production cloud built using vagrant-openstack and our standard puppet automation code. The nodes are as follows:

  • 3 keystone nodes
  • 1 haproxy load balancer
  • a puppet master, which also runs the benchmarks

The nodes are running Ubuntu and a version of Keystone from master from May 2015. They are using Fernet tokens that expire after two hours. mysql is setup as a 3 node Galera cluster that preferentially uses one node. The systems were not otherwise busy or doing much else.

The test itself tries to do 20 validations at once up to 4000 of them. It talks to the load balancer which is setup to do round-robin connections.

Given all the variables here, I don’t expect you to replicate these numbers, but rather viewed relative to each other.

Running the Benchmark

For the benchmark code, I used a modified version of Dolph’s benchmark experiment. The modified code is here (note that the detection of whether ab is installed is broken feel free to send me a fix).

Usage:

./benchmark.sh [Keystone Node or LB] [admin_password]

Generating Revoked Tokens

Here’s my kinda hacky script to generate and revoke tokens, it could be better if it just used curls for both. Usage is to pass in a number of tokens to create and then revoke as arg1 and then a valid token as arg2 that you’ve previously generated.

echo "getting & revoking $1 tokens"
for i in $(eval echo "{1..$1}")
do
TOKEN=`keystone token-get | grep id | grep -v tenant_id | grep -v user_id | awk '{ print $4 }'`
curl -X DELETE -i -H "X-Auth-Token: $2" "${OS_AUTH_URL}/tokens/${TOKEN}"
done

Solutions?

Here are a few ideas I’d recommend. First get a baseline of how many revocations you have on a regular basis, this should mainly be from people signing out of Horizon or switching projects in Horizon. For us it’s about 20-30. This is how you check.

mysql -u root keystone -e "select count(id) from revocation_event;"

Once you get a normal number, I’d recommend putting a threshold check into Icinga.

Watch your testing too, we have some regression tests that create users, roles, etc and generates about 500 revocation events.

If you have a spike of events, and you’re not worried about rogue users, you can simply truncate the table.

mysql -u root keystone -e "truncate table revocation_event;"

This has security implications so make sure you know what you are doing.

Another idea is writing a no-op driver for revocations, this essentially disables the feature and again has security implications.

Finally, I’d recommend enabling caching for revocation events you still get the same curve, but you’ll start out at a higher performance value.

Tagged , ,

Fernet Tokens in Prod

This post is a follow-up to my previous post about Fernet Tokens which you may want to read first.

Last night we upgraded our production OpenStack to a new version of keystone off of master from a couple weeks ago and at the same time switched on Fernet tokens. This is after we let the change soak in our dev and staging environments for a couple weeks. We used this time to assess performance, look for issues, and figure out our key rotation strategy.

The Upgrade

All of our upgrade process is run via ansible. We cherry-pick the change which includes pointing to the repo with the new keystone along with enabling the Fernet tokens and then let ansible drive puppet to upgrade and switch providers. During the process, we go down to a single keystone node because it simplifies the active/active database setup when running migrations. So when this node is upgraded we take a short outage as the package is installed and then the migrations run. This took about 16 seconds.

Once this is done, the other OpenStack services start freaking out. Because we’ve not upgraded to Kilo yet, our version of Keystone middleware is too dumb to request a new token when the old one stops working. So this means we have to restart services that talk to Keystone. We ended up re-using our “rabbit node died, reboot OpenStack” script and added glance to the list since restarting it is fairly harmless even though it doesn’t talk to rabbit. Due to how the timing works, we don’t start this script until puppet is completely done upgrading the single keystone node, so while the script to restart services is quick, it doesn’t start for about 90 seconds after Keystone is ready. This means that we have an API outage of 1-2 minutes. For us, this is not a big deal, our customers are sensitive to “hey I can’t get to my VM” way more than a few minutes of API outage, especially one that’s during a scheduled maintenance window. This could be optimized down substantially if I manually ran the restarts instead of waiting on the full puppet run (that upgrades keystone) to finish.

Once the first node is done we run a full validation suite of V2 and V3 keystone tests. This is the point at which we can decide to go back if needed. The test suite for us took about 2 minutes.

Once we have one node upgraded, OpenStack is rebooted, and validation passes, we then deploy the new package and token provider to the rest of the nodes and they rejoin the cluster one by one. We started in the opposite region so we’d get a endpoint up in the other DC quickly. This is driven by another ansible job that runs puppet and does the nodes one by one.

All in all we finished in about 30 minutes, most of that time was sitting around. We then stayed an extra 30 to do a full set of OpenStack regression tests and everything was okay.

At the end I also truncated the token table to get back all the disk space it was using.

Key Rotation

We are not using any of the built-in Keystone Fernet key rotation mechanisms. This is because we already have a way to get code and config onto all our nodes and did not want to run the tooling on a keystone node directly. If you do this, then you inadvertently declare one node a master and have to write special code to handle this master node in puppet or ansible (or whatever you are using). Instead we decided to store the keys in eyaml in our hiera config. I wrote a simple python script that decrypts the eyaml and then generates and rotates the keys. Then I will take the output and propose it into our review system. Reviewing eyaml encrypted keys is somewhat useless, but the human step is there to prevent something dumb from happening. For now we’re only using 3 keys, since our tokens last 2 hours, we can’t do two rotations in under two hours. The reviewer would know the last time a rotation was done and the last time one was deployed. Since we don’t deploy anywhere near a two hour window, this should be okay. Eventually we’ll have Jenkins do this work rather than me. We don’t have any firm plans right now on how often we’ll do the key rotation, probably weekly though.

To answer a question that’s come up, there is no outage when you rotate keys, I’ve done five or six rotations including a few in the same day, without any issues.

Performance

I will be doing a full post later on about performance once I have more numbers, but the results so far is that token generation is much faster, while validation to be a bit slower. Even if it was about the same, the number of problems and database sync issues that not storing tokens in the DB solves make them worthwhile. We’re also going to (finally) switch to WSGI and I think that will further enhance performance.

Aftermath

Today one of my colleagues bought a bottle of Fernet-Branca for us. All I can say is that I highly recommend not doing a shot of it. Switching token providers is way less painful. (Video of said shot is here)

Tagged , ,

Fernet Tokens for Fun & Profit

I’ve been digging into Fernet tokens this past week and getting ready to switch us over to using them. This is the first in a series of blog posts I plan on writing about them. This one will mainly be background on why we’re switching and what we hope to gain. The next post will cover rolling them out which will probably be in a few weeks. For now we’re running these keys in our dev environments for more testing while we focus resources on Kilo upgrades.

What are Fernet Tokens?
How do you explain Fernet tokens? Rather than some lengthy treatise mathematical and identity management theory, just know this: Fernet tokens use shared private keys to avoid having to store or replicate tokens in your database. This makes them super fast, reduces load on your database, and solves replication lag between data centers and nodes within a data center. If your manager asks you, “they’re faster, small, and reduce load on the db”, that should suffice. Dolph Mathews has a good write-up on how much faster they are here. You can also dive into the different token formats for comparison on another of his posts, here.

What Issues will this solve for us?

Now about the DB replication issues… I cannot tell you how much stuff we had to do to deal with database and replication issues with UUID tokens, here’s a few samples:

  • custom cron job to reap expired tokens
  • force db transactions to a master, despite us being active/active so tokens would be there when asked for
  • hacks to our cross-region icinga checks to allow the tokens to replicate, literally sleep(3)

We’ve even had a service accidentally DOS us by requesting so many tokens the DB couldn’t keep up and keystone ran out of DB threads. Hopefully all this is solved by Fernet tokens.

Will this cause an outage to switch to?

Switching token providers will cause an outage. All the old tokens you’ve issued are now 100% useless. So prep accordingly. I will give some updates to the next blog post on how long this was and what issues we saw when we did it.

Do I need to be only Kilo/Liberty?

  • Horizon – you need a newish copy of django_openstack_auth which I think is in Liberty
  • Keystone – you need to be on Kilo
  • python-keystonemiddleware it’s best to have at least 1.1.0. If you have 1.0, you MUST restart all OpenStack services are switching tokens
  • Everything Else – Shouldn’t matter!

A note on python-keystonemiddleware. In 1.0.0 if a service (say Nova) can’t use it’s token for some reason, it won’t try to get a new one until the old one expires. So if you switch to Fernet’s you have to restart all OpenStack services that talk to Keystone or they will not work. We already have some ansible to do this mainly in response to RabbitMQ issues but it works here too.

How do I get Keys onto the boxes?

All keystone nodes in your cluster need to have the same keys. Fortunately there is the concept of rotation so there’s no outage when switching keys, there’s always a key thats “up next” or “on-deck” so that when you’re rotating you switch to a key that’s already on every box. Now as for getting the keys there. I’m going to use puppet to deploy keys that I store in hiera and rotate with a jenkins job, but there are other ways like a shared FS or rsync. More details on my method once I know it works in a later blog post!

How does key rotation work?

What Fernet Rotation Looks Like

What Fernet Rotation Looks Like

If you read through the information on Fernet tokens, key rotation is by far the most confusing. I’ve sat down with pen and paper and now think I get it, so allow me to explain. I’m going to use a 3 key example here, they keys are named with numbers. I highly encourage you to setup a throwaway Keystone box and use keystone-manage fernet_rotate if you don’t follow this.

You need to know 4 rules about how these keys work first:

  1. The highest numbered key is the current signing key.
  2. The 0 key is the key that will become the next signing key.
  3. All other keys are old keys, they’ve been used in the past and there might be old tokens out there still signing with them depending on your expiration schedule
  4. New keys are always created as key 0.

Starting position, per the rules above.

  • 0 – this is the on-deck key, after the next rotation, it’s primary.
  • 1 – this is the old key, it used to be primary, and its still here in case any old tokens are still signed with it. Next rotation it gets deleted.
  • 2 – this is the current primary key thats used for signing.

Now we do a Rotation…

  • 0 becomes 3
  • 1 gets deleted
  • 2 stays 2
  • a new key becomes 0
A Fernet Rotation in Action

A Fernet Rotation in Action

So How does this work?

Let’s pretend we have a few tokens since this is a running OpenStack cluster. All tokens before the rotation above are signed with 2. We do the rotation, now new tokens are signed with 3. When a token comes in, Keystone tries both 3 and 2 to decode the token, and either should work. At this point we CANNOT rotate again until no more active keys are signed with 2, because 2 is going to be deleted! This means you need to have more tokens if you plan on rotating more frequently or have a long token expiration time. We’re going to rotate roughly weekly, and we have a 2 hour token timeout, so 3 is plenty.

Homework

If you think you get this, try this a homework problem. Assume that you have max_active_keys set to 5 and that you have 5 keys: 0, 4, 5, 6, 7.

  • Which is the current signing key?
  • Which is on-deck or the next key to be used? What will it’s number be after the rotation?
  • Which key will be deleted on next rotation?
  • What happens if a token comes in signed with key 5?
  • What happens if a token comes in signed with key 3?

Other Sources

I gathered a lot of this info from trying stuff but also a lot from blog posts. I’ve referenced two above, but I also want to recommend Lance Bragstad’s blog. Note, Lance’s blog is the only blog in the world where you can read about quinoa recipes and shotgun shot patterns.

Tagged , ,