I’ve been digging into Fernet tokens this past week and getting ready to switch us over to using them. This is the first in a series of blog posts I plan on writing about them. This one will mainly be background on why we’re switching and what we hope to gain. The next post will cover rolling them out which will probably be in a few weeks. For now we’re running these keys in our dev environments for more testing while we focus resources on Kilo upgrades.
What are Fernet Tokens?
How do you explain Fernet tokens? Rather than some lengthy treatise mathematical and identity management theory, just know this: Fernet tokens use shared private keys to avoid having to store or replicate tokens in your database. This makes them super fast, reduces load on your database, and solves replication lag between data centers and nodes within a data center. If your manager asks you, “they’re faster, small, and reduce load on the db”, that should suffice. Dolph Mathews has a good write-up on how much faster they are here. You can also dive into the different token formats for comparison on another of his posts, here.
What Issues will this solve for us?
Now about the DB replication issues… I cannot tell you how much stuff we had to do to deal with database and replication issues with UUID tokens, here’s a few samples:
- custom cron job to reap expired tokens
- force db transactions to a master, despite us being active/active so tokens would be there when asked for
- hacks to our cross-region icinga checks to allow the tokens to replicate, literally sleep(3)
We’ve even had a service accidentally DOS us by requesting so many tokens the DB couldn’t keep up and keystone ran out of DB threads. Hopefully all this is solved by Fernet tokens.
Will this cause an outage to switch to?
Switching token providers will cause an outage. All the old tokens you’ve issued are now 100% useless. So prep accordingly. I will give some updates to the next blog post on how long this was and what issues we saw when we did it.
Do I need to be only Kilo/Liberty?
- Horizon – you need a newish copy of django_openstack_auth which I think is in Liberty
- Keystone – you need to be on Kilo
- python-keystonemiddleware it’s best to have at least 1.1.0. If you have 1.0, you MUST restart all OpenStack services are switching tokens
- Everything Else – Shouldn’t matter!
A note on python-keystonemiddleware. In 1.0.0 if a service (say Nova) can’t use it’s token for some reason, it won’t try to get a new one until the old one expires. So if you switch to Fernet’s you have to restart all OpenStack services that talk to Keystone or they will not work. We already have some ansible to do this mainly in response to RabbitMQ issues but it works here too.
How do I get Keys onto the boxes?
All keystone nodes in your cluster need to have the same keys. Fortunately there is the concept of rotation so there’s no outage when switching keys, there’s always a key thats “up next” or “on-deck” so that when you’re rotating you switch to a key that’s already on every box. Now as for getting the keys there. I’m going to use puppet to deploy keys that I store in hiera and rotate with a jenkins job, but there are other ways like a shared FS or rsync. More details on my method once I know it works in a later blog post!
How does key rotation work?
If you read through the information on Fernet tokens, key rotation is by far the most confusing. I’ve sat down with pen and paper and now think I get it, so allow me to explain. I’m going to use a 3 key example here, they keys are named with numbers. I highly encourage you to setup a throwaway Keystone box and use keystone-manage fernet_rotate if you don’t follow this.
You need to know 4 rules about how these keys work first:
- The highest numbered key is the current signing key.
- The 0 key is the key that will become the next signing key.
- All other keys are old keys, they’ve been used in the past and there might be old tokens out there still signing with them depending on your expiration schedule
- New keys are always created as key 0.
Starting position, per the rules above.
- 0 – this is the on-deck key, after the next rotation, it’s primary.
- 1 – this is the old key, it used to be primary, and its still here in case any old tokens are still signed with it. Next rotation it gets deleted.
- 2 – this is the current primary key thats used for signing.
Now we do a Rotation…
- 0 becomes 3
- 1 gets deleted
- 2 stays 2
- a new key becomes 0
So How does this work?
Let’s pretend we have a few tokens since this is a running OpenStack cluster. All tokens before the rotation above are signed with 2. We do the rotation, now new tokens are signed with 3. When a token comes in, Keystone tries both 3 and 2 to decode the token, and either should work. At this point we CANNOT rotate again until no more active keys are signed with 2, because 2 is going to be deleted! This means you need to have more tokens if you plan on rotating more frequently or have a long token expiration time. We’re going to rotate roughly weekly, and we have a 2 hour token timeout, so 3 is plenty.
If you think you get this, try this a homework problem. Assume that you have max_active_keys set to 5 and that you have 5 keys: 0, 4, 5, 6, 7.
- Which is the current signing key?
- Which is on-deck or the next key to be used? What will it’s number be after the rotation?
- Which key will be deleted on next rotation?
- What happens if a token comes in signed with key 5?
- What happens if a token comes in signed with key 3?
I gathered a lot of this info from trying stuff but also a lot from blog posts. I’ve referenced two above, but I also want to recommend Lance Bragstad’s blog. Note, Lance’s blog is the only blog in the world where you can read about quinoa recipes and shotgun shot patterns.