This post is a follow-up to my previous post about Fernet Tokens which you may want to read first.
Last night we upgraded our production OpenStack to a new version of keystone off of master from a couple weeks ago and at the same time switched on Fernet tokens. This is after we let the change soak in our dev and staging environments for a couple weeks. We used this time to assess performance, look for issues, and figure out our key rotation strategy.
All of our upgrade process is run via ansible. We cherry-pick the change which includes pointing to the repo with the new keystone along with enabling the Fernet tokens and then let ansible drive puppet to upgrade and switch providers. During the process, we go down to a single keystone node because it simplifies the active/active database setup when running migrations. So when this node is upgraded we take a short outage as the package is installed and then the migrations run. This took about 16 seconds.
Once this is done, the other OpenStack services start freaking out. Because we’ve not upgraded to Kilo yet, our version of Keystone middleware is too dumb to request a new token when the old one stops working. So this means we have to restart services that talk to Keystone. We ended up re-using our “rabbit node died, reboot OpenStack” script and added glance to the list since restarting it is fairly harmless even though it doesn’t talk to rabbit. Due to how the timing works, we don’t start this script until puppet is completely done upgrading the single keystone node, so while the script to restart services is quick, it doesn’t start for about 90 seconds after Keystone is ready. This means that we have an API outage of 1-2 minutes. For us, this is not a big deal, our customers are sensitive to “hey I can’t get to my VM” way more than a few minutes of API outage, especially one that’s during a scheduled maintenance window. This could be optimized down substantially if I manually ran the restarts instead of waiting on the full puppet run (that upgrades keystone) to finish.
Once the first node is done we run a full validation suite of V2 and V3 keystone tests. This is the point at which we can decide to go back if needed. The test suite for us took about 2 minutes.
Once we have one node upgraded, OpenStack is rebooted, and validation passes, we then deploy the new package and token provider to the rest of the nodes and they rejoin the cluster one by one. We started in the opposite region so we’d get a endpoint up in the other DC quickly. This is driven by another ansible job that runs puppet and does the nodes one by one.
All in all we finished in about 30 minutes, most of that time was sitting around. We then stayed an extra 30 to do a full set of OpenStack regression tests and everything was okay.
At the end I also truncated the token table to get back all the disk space it was using.
We are not using any of the built-in Keystone Fernet key rotation mechanisms. This is because we already have a way to get code and config onto all our nodes and did not want to run the tooling on a keystone node directly. If you do this, then you inadvertently declare one node a master and have to write special code to handle this master node in puppet or ansible (or whatever you are using). Instead we decided to store the keys in eyaml in our hiera config. I wrote a simple python script that decrypts the eyaml and then generates and rotates the keys. Then I will take the output and propose it into our review system. Reviewing eyaml encrypted keys is somewhat useless, but the human step is there to prevent something dumb from happening. For now we’re only using 3 keys, since our tokens last 2 hours, we can’t do two rotations in under two hours. The reviewer would know the last time a rotation was done and the last time one was deployed. Since we don’t deploy anywhere near a two hour window, this should be okay. Eventually we’ll have Jenkins do this work rather than me. We don’t have any firm plans right now on how often we’ll do the key rotation, probably weekly though.
To answer a question that’s come up, there is no outage when you rotate keys, I’ve done five or six rotations including a few in the same day, without any issues.
I will be doing a full post later on about performance once I have more numbers, but the results so far is that token generation is much faster, while validation to be a bit slower. Even if it was about the same, the number of problems and database sync issues that not storing tokens in the DB solves make them worthwhile. We’re also going to (finally) switch to WSGI and I think that will further enhance performance.
Today one of my colleagues bought a bottle of Fernet-Branca for us. All I can say is that I highly recommend not doing a shot of it. Switching token providers is way less painful. (Video of said shot is here)