Tag Archives: keystone

Token Revocation Performance Improvements in Keystone Ocata

The awesome Keystone team has been working hard in Ocata to improve overall keystone performance when token revocations are present. If you’ve not read my previous post as to why this is an issue you should start there. Here’s the tl;dr background: when any token revocations are present in your database, token validation performance suffers and suffers greatly. Token validations are at the heart of your cloud. Every single OpenStack API call requires a token validation. Revocations happen when a user or project is deleted, a token revoke API call is made, or until recently, someone logged out of Horizon.

So here’s the simplification of this path: Revoked tokens slow down token validation which slows down all OpenStack API calls, ergo, revoked tokens slow down your OpenStack APIs.

Here is what this looks like in Liberty, can you see when our regression tests run and generate revocations?

Can you tell in Cinder when we have revoked tokens?

Fortunately the team focused on fixing this in Ocata and the good news is that it seemed to work. In Ocata (currently on master) there is now no longer a correlation between revoked tokens and token validation performance.

Experimental Setup

The experimental setup is the same as my previous post, except different software. The nodes are running keystone in docker containers with uwsgi using stable/newton as of Nov 12 2016. The Ocata round is using master as of commit 498d700c. Both tests are using Fernet tokens with caching.

Results

Validations Per Second as a Function of Number of Revocations

The first chart will show the number of token validations that can be completed per second. For this test more is better, it means more validations get pushed through and the test completes faster.

As you can see we no longer have the exponential decay which is good. Now the rate is steady and we will not have the spike in timings that we see after we our regression tests run. You may also notice that the base rate is a bit slower in Ocata. If you never have any token revokes this may be concerning, but this timing is still pretty fast. As I said before I was doing 20 threads at a time, if this is raised to 50 or 100 the rate would be much higher. In other words this is not a performance ceiling you are seeing, just comparing N to O under the same conditions.

99% Percentile Time to Complete a Validation

This chart examines the same data in a different way. This chart shows the time in milliseconds in which 99% of the token revocations are completed. In this chart, lower is better. You can see a linear progression in the amount of time to complete the token validation. In Newton, by the time you have 1000 revocations it goes from 99ms to validate a token to 1300 ms.

image-1

More Fixes to Come

This work is great news for having predictable keystone token performance. I won’t have to tell anyone to go truncate the revocation_event table when things get slow and we shouldn’t have graph spikes anymore. Also there is more work to come. The Keystone team is working on more fixes and improvements in this area. You can track the progress of that here: https://review.openstack.org/#/q/project:openstack/keystone+branch:master+topic:bug/1524030

Tagged ,

New PCI-DSS Features in Keystone Newton

Keystone Newton offers up some new PCI-DSS features which will help secure your Keystone deployment. I recently built a Newton dev environment (we’re running Mitaka still for the next few weeks) and tried them out. All of these features are disabled by default (commented out in the config file) and need to be enabled in order to use.

Before diving into this post you may want to read the docs on the new features which are available here. These new features are not mentioned in the Newton release notes or config guides (bug). However you can look at the generated config files which explain more about the settings here. Also note if you have a custom identity driver or you use the ldap identity driver, some of these features will not work (but the password expiry/complexity stuff should still work). You need to use the standard sql identity driver. Finally when you consider my conclusions, please keep in mind that I’m not a PCI-DSS expert, but I do know how to deploy and run OpenStack clouds.

And now the features…

Account Lockout

The first feature we will cover is lock-out. If a user fails N login attempts they will be disabled. They will be disabled for the specified duration (in seconds) and if that is not set they will be disabled indefinitely. Let’s try this one out:

Modify your config file and set:

lockout_failure_attempts=3
lockout_duration=1800

Then bounce keystone.

[DEV] root@dev01-keystone-001:~# service keystone restart
...

Next create a user and give them a role in a project:

[DEV] root@dev01-build-001:~# openstack user create --password secretpass bob
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | 630b22fe7f814feeb5a498dc124d814c |
| name                | bob                              |
| password_expires_at | None                             |
+---------------------+----------------------------------+
[DEV] root@dev01-build-001:~# openstack role add --user bob --project admin _member_

Now let’s have bob use the wrong password a few times and get locked out:

[DEV] root@dev01-build-001:~# export OS_PASSWORD=wrongpassword
[DEV] root@dev01-build-001:~# openstack token issue
The request you have made requires authentication. (HTTP 401) (Request-ID: req-077dc218-65dd-44d0-ba23-2020abb125c3)
[DEV] root@dev01-build-001:~# openstack token issue
The request you have made requires authentication. (HTTP 401) (Request-ID: req-abf8e31f-a846-4ba3-9b08-8a60fcc48b5c)
[DEV] root@dev01-build-001:~# openstack token issue
The request you have made requires authentication. (HTTP 401) (Request-ID: req-e19a2d8b-4bb7-4ce2-bd1a-590facf2e149)
[DEV] root@dev01-build-001:~# openstack token issue
The account is locked for user: 630b22fe7f814feeb5a498dc124d814c (HTTP 401) (Request-ID: req-1b891922-01f0-485e-91f2-52bb1b6c3f79)

So bob is now locked out. One thing that surprised me is that bob is not in fact disabled if you look at his user object (openstack user show bob), just locked out. It’s a separate table entry in the local_user table. In order to unlock bob, you need to wait for the lock to expire or manually change the database.

mysql> select * from local_user where name="bob";
+----+----------------------------------+-----------+------+-------------------+---------------------+
| id | user_id                          | domain_id | name | failed_auth_count | failed_auth_at      |
+----+----------------------------------+-----------+------+-------------------+---------------------+
| 22 | 630b22fe7f814feeb5a498dc124d814c | default   | bob  |                 3 | 2016-11-09 16:21:41 |
+----+----------------------------------+-----------+------+-------------------+---------------------+

Interestingly once locked you can’t change your password either:

[DEV] root@dev01-build-001:~# openstack user password set
The account is locked for user: 630b22fe7f814feeb5a498dc124d814c (HTTP 401) (Request-ID: req-d6433510-615f-48af-b2ea-b4fd3b8407c7)

DOSing Your Cloud for Fun & Profit

To be honest this feature scares me because of the way it can DOS your cloud. The first way is what if someone (or some tool) misconfigures a service account for say, nova. Nova tries to do its job and it starts trying to get tokens. After it fails 3 times, you’ve now essentially bricked nova for 30 minutes. Here’s a worse way, what if I just pretend I’m nova by setting OS_USERNAME and trying to get a token.

[DEV] root@dev01-build-001:~# export OS_TENANT_NAME=services
[DEV] root@dev01-build-001:~# export OS_USERNAME=nova
[DEV] root@dev01-build-001:~# export OS_PASSWORD=evil_hax0r_dos_nova
[DEV] root@dev01-build-001:~# openstack token issue
The request you have made requires authentication. (HTTP 401) (Request-ID: req-47390107-236b-4317-b97a-b5e811af87b2)
[DEV] root@dev01-build-001:~# openstack token issue
The request you have made requires authentication. (HTTP 401) (Request-ID: req-99f88665-d653-4e73-8e03-1b897e14440f)
[DEV] root@dev01-build-001:~# openstack token issue
The request you have made requires authentication. (HTTP 401) (Request-ID: req-7ca5878a-f633-4f23-9b67-fdb5bec13dcf)
[DEV] root@dev01-build-001:~# openstack token issue
The account is locked for user: 8935e905d8ab416d96509022fa6c00d1 (HTTP 401) (Request-ID: req-cbff7497-88c2-45ce-bf9e-14bfee9d1e78)

Nova is now broken. Repeat for all services or even better lock the admin account or lock the account of the guy next to you who chews too loudly at lunch…

So with that in mind, here’s some things to consider before enabling this:

  • The failed_auth_count resets anytime there is a successful login.
  • There’s no method I saw for an admin to unlock the user without hacking the database
  • When the user is locked they can’t change their password
  • This feature can DOS your cloud

Inactive Users

The next feature added is the ability to disable users who have not logged in within the past N days. To enable this, set the feature and restart keystone, I will choose 1 day for this example.

disable_user_account_days_inactive = 1

Once enabled Keystone will begin to track last login with day granularity in the user table.

+----------------------------------+---------+---------------------+----------------+
| id                               | enabled | created_at          | last_active_at |
+----------------------------------+---------+---------------------+----------------+
| 630b22fe7f814feeb5a498dc124d814c |       1 | 2016-11-09 15:51:35 | 2016-11-09     |
+----------------------------------+---------+---------------------+----------------+

Since this has day granularity only, you will end up with an effective timeout of < the number of days you put in. In other words although bob logged in above at 3pm, he will get expired at 12:01AM the next day, not 3pm the next day. (Also don't set this to 1 day this is just an experiment). I was too impatient to wait a whole day so I just hacked the DB and so it looked like bob hadn't logged in since yesterday. Just like the failed logins feature, the account ends up disabled.

[DEV] root@dev01-build-001:~# openstack token issue
The account is disabled for user: 630b22fe7f814feeb5a498dc124d814c (HTTP 401) (Request-ID: req-8ce84bd3-911e-4bab-a972-664ff4ff03f1)

You should consider if you have both this and the failed logins enabled you could end up with a user who is essentially 2x disabled. First for failing logins and then for not having successfully logged in.

Password Expiration

By default all user objects returned by the v3 APIs will include a password expiry field with Keystone Newton. The field is null unless this feature is enabled however. Here’s what it looks like:

{"user": {"password_expires_at": null, "links": {"self": "http://dev01-admin.os.cloud.twc.net:35357/v3/users/3607079988c140a3a49311a2c6b75f86"}, "enabled": true, "email": "root@localhost", "id": "3607079988c140a3a49311a2c6b75f86", "domain_id": "default", "name": "admin"}}

To enable this feature, we set “password_expires_days = 1” and bouncing keystone, but then also we need to change the existing password or make a new user. This setting is not retroactive on existing accounts and passwords.

So to make this take full effect we will reset bob’s password.

[DEV] root@dev01-build-001:~# openstack user password set
Current Password:
New Password:
Repeat New Password:

And then look at his user object:

[DEV] root@dev01-build-001:~# openstack user show bob
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | 630b22fe7f814feeb5a498dc124d814c |
| name                | bob                              |
| password_expires_at | 2016-11-10T17:28:49.000000       |
+---------------------+----------------------------------+

Note that unlike the inactive user settings this stores data with second-level granularity.

Interestingly when the password was updated it kept the old one in the DB and just expired it, so in the DB bob has 2 passwords, one valid, one expired. This is new behavior.

mysql> select * from password where local_user_id=22;
+----+---------------+-------------------------------------------------------------------------------------------------------------------------+---------------------+--------------+---------------------+
| id | local_user_id | password                                                                                                                | expires_at          | self_service | created_at          |
+----+---------------+-------------------------------------------------------------------------------------------------------------------------+---------------------+--------------+---------------------+
| 23 |            22 | $6$rounds=10000$eif6VK1cIZshn4v.$4B5zaiGTiTc7BFD5OAP0uro0HAdNzh0SttL6Lt3CjYKDt8Esvt./y3rTlQS7XTqVhhVJvpvpxb7UDeeATZVxn1 | 2016-11-09 17:28:50 |            0 | 2016-11-09 15:51:35 |
| 24 |            22 | $6$rounds=10000$anOfjEPsi92JXxO2$HySHiUPI6JI4wmWoRJkMc7X4lvOgFbXu.AXTSBbAuYwYUUEjbcg4xhqkrjlAKFXhM0Mbvb/J0pzQXv1uq65mD. | 2016-11-10 17:28:49 |            1 | 2016-11-09 17:28:50 |
+----+---------------+-------------------------------------------------------------------------------------------------------------------------+---------------------+--------------+---------------------+

Now lets try to login using the new password we set:

[DEV] root@dev01-build-001:~# export OS_PASSWORD=newsecretpass
[DEV] root@dev01-build-001:~# openstack token issue
The password is expired and needs to be reset by an administrator for user: 630b22fe7f814feeb5a498dc124d814c (HTTP 401) (Request-ID: req-ef525f73-d6bf-4f07-b70f-34e7d1ae29c2)

This is a clear error message that the user will hopefully see and take action on. Note the message that an admin has to do the reset. bob cannot reset his own password once it expires.

There is also a related setting here called password_expires_ignore_user_ids which exempts certain user ids from password expiry requirements. You would probably set this for your service accounts. Note that this is id not name which may make your deployment automation to be more complex since ids vary across environments..

Password Complexity & Uniqueness

One of the most requested features was enforcing password complexity for users and now with Keystone Newton it is available. There are two settings here that can be enabled, the first is a regex for complexity and the 2nd is a user-readable description of that regex. This is used for error messages. To enable let’s set both.

As an admin, I’ve decided that all passwords must have “BeatTexas” in them because I’m a WVU fan and they play on Saturday. So lets set the settings, bounce keystone, and try it:

password_regex = ^.*BeatTexas.*$
password_regex_description = "passwords must contain the string BeatTexas"

Now let’s have bob try to set his password:

[DEV] root@dev01-build-001:~# openstack user password set --original-password newsecretpass --password HookEmHorns
Password validation error: The password does not meet the requirements: passwords must contain the string BeatTexas (HTTP 400) (Request-ID: req-9bfcf550-e822-4053-b566-f08f3b52e26e)

UT fan denied! But it works if you follow my rules:

[DEV] root@dev01-build-001:~# openstack user password set --original-password newsecretpass --password BeatTexas100-0

Password Uniqueness/Frequency of Change

Password uniqueness works along the same lines as the complexity rules, it will keep you from setting it to something that was previously used. If you recall from above Keystone is storing old passwords, even when this setting is not enabled. If you have a lot of users and a lot of them change passwords a lot you may want to prune the password table at some point.

unique_last_password_count= 5

You can also enforce a limit to how often a password is changed, this is set in days. Be cautious that this value is less than the length of the password expiry.

minimum_password_age = 1

User Messaging and Tooling

One thing that is unclear here is how the user will know that they are disabled and what they should do about it. Especially if they don’t use the CLI. Once disabled the user cannot tell if it is from bad logins or from inactivity. Will Horizon show any of this? Answer: yes it will: https://review.openstack.org/#/c/370973/

Additionally, as an operator you will probably also want a report of people who are disabled or about to be disabled in the next week so you can email them. There will need to be Horizon changes, user education, and tooling around these features.

Summary

Keystone has added some long-requested PCI-DSS features in Newton. I’m happy with the direction these are moving but you should use caution and consider tooling and user messaging before enabling them. Look for more improvements on these in the future.

Tagged , ,

Keystone Token Performance: Liberty vs Mitaka

A number of performance improvements were made in Keystone Mitaka, including caching the catalog, which should make token creation faster according to the Keystone developers. In this blog post, I will test this assertion.

My setup is unique to how I run keystone, you may be using different token formats, different backends, different web servers, and a different load balancer architecture. The point here is just to test Mitaka vs Liberty in my setup.

Keystone Setup

I’m running a 3 node Keystone cluster on virtual machines running in my OpenStack cloud. The nodes are fronted by another virtual machine running haproxy. The keystone cluster is using round-robin load balancing. I am requesting the tokens from a third virtual machine via the VIP provided by haproxy. The keystone nodes have 2 VCPUs + 4G RAM.

Keystone is running inside a docker container, which runs uwsgi. uwsgi has 2 static threads.

  • The Mitaka code is based on stable/mitaka from March 22, 2016.
  • The Liberty code is based on stable/liberty from March 16, 2016.

Note: I retested again with branches from April 17 and April 15 respectively, results were the same.

Keystone is configured to use Fernet tokens and the mysql backend.

I did not rebuild the machines, the mitaka runs are based on nodes upgraded to Mitaka from Liberty.

Experimental Setup

I am doing 20 benchmark runs against each setup, delaying 120 seconds in between each run. The goal here is to even out performance changes based on the fact that these are virtual machines running in a cloud. The tests run as follows:

  • Create 200 tokens serially
  • Validate 200 tokens serially
  • Create 1000 tokens concurrently (20 at once)
  • Validate 500 tokens concurrently (20 at once)

The code for running these benchmarks, which I borrowed from Dolph Mathew’s and made a bit easier to use, is available on github. Patches welcome.

Results

Is Mitaka faster? Answer: No.

Something is amiss in Mitaka Fernet token performance and there is a serious degradation here.

The charts tell the story, each chart below shows how many requests per second can be handled, and concurrent validation is the most concerning because this is the standard model of what a cloud is doing. Dozens of API calls being made at once to tens of services and each one wants to validate a token.

Liberty vs Mitaka: No Caching

So you can see that concurrent validation is much slower. Let’s also compare with memcache enabled:

Liberty vs Mitaka with Caching

Let’s look at raw data which is more damning due to the scale on the charts:

Data

Notes

I spent some time thinking about why this might be slower and I see one clue, the traffic to memcache (shown using stats command) in Mitaka is 3-4x what it is in Liberty. Perhaps Keystone is caching too much or too often? I don’t really know but that is an interesting difference here.

I’m hopeful that this gets fixed or looked at in Newton and backported to Mitaka.

Possible sources of error:

  • These are VMs, could have noisy neighbors. Mitigation: Run it a whole lot. Re-run on new VMs. Run at night.
  • Revocation events. Mitigation: Check and clear revocation table entries before running perf tests.

I’d really like someone else to reproduce this, especially using a different benchmarking tool to confirm this data. If you did, please let me know.

Tagged , ,

Consuming Keystone CADF Events From RabbitMQ

This started with a simple requirement: “I’d like to know when users or projects are added or removed and who did the action”

As it turns out there’s no great way to do this. Sure you can log it when a user is deleted:

"DELETE /v2.0/users/702b12ec7f0e4f7d93945eebb95705e1 HTTP/1.1" 204 - "-" "python-keystoneclient"

The only problem is that ‘702b12ec7f0e4f7d93945eebb95705e1’ is meaningless without the DB entry which is now conveniently gone.

But if you had an async way to get events from Keystone, you could solve this yourself. That was my idea with my Keystone CADF Event Logger tool. Before we dive into the tool, some quick background on CADF events. You can read the DMTF mumbo-jumbo at the link in the previous sentence, but just know, Keystone CADF events log anything interesting that happens in Keystone. They also tell you who did it, from where they did it, and when they did it. All important things for auditing. (This article from Steve Martinelli has some more great background)

So how does this solve my problem? CADF events still just log ids, not names. My solution was a simple rabbit consuming async daemon that cached a user and project names locally and used it to do lookups. Here’s an example of what it does:

Logs user auth events

Note that V2 doesn’t log much info on these, although that is fixed in Liberty I believe.

INFO 2015-09-24 15:09:27.172 USER AUTH: success: nova
INFO 2015-09-24 15:09:27.524 USER AUTH: success: icinga
INFO 2015-09-24 15:09:27.800 USER AUTH: success: neutron
INFO 2015-09-24 15:09:27.800 USER AUTH: failure: neutron

Log user/project crud events

Note again V2 issues here with Kilo leave us with less than full info.

USER CREATED: success: user ffflll at 2015-09-18 16:00:10.426372 by unknown (unknown) (project: unknown (unknown)).
USER DELETED: success: user ffflll at 2015-09-18 16:02:13.196172 by unknown (unknown) (project: unknown (unknown)).

Figures it out when rabbit goes away

INFO 2015-11-11 20:46:59.325 Connecting to 1.2.3.4:5672
ERROR 2015-11-11 22:16:59.514 Socket Error: 104
WARNING 2015-11-11 22:16:59.515 Socket closed when connection was open
WARNING 2015-11-11 22:16:59.515 Disconnected from RabbitMQ at top-secret-internal-company-url.com:5672 (0): Not specified
WARNING 2015-11-11 22:16:59.516 Connection closed, reopening in 5 seconds: (0) Not specified

Setup

This requires that Keystone is configured to talk to rabbit and emit CADF events. The previously referenced blog from Steve Martinelli has good info on this. Here’s what I set:

notification_format=cadf
notification_driver=messaging
notification_topics=keystone_to_cadf_logger

This code also assumes that /var/log/keystone_cadf is there and writable. I setup this with puppet in my environment.

You should ensure Keystone is talking to Rabbit and has made the queues and exchanges before trying the program.

Usage

I designed this to run in a docker container, which explains the overly full requirements.txt, you can probably get away with the requirements.txt.ORIG. After you build it (python ./setup.py build && python ./setup.py install, just run it by passing in creds for Keystone and for RabbitMQ. You can also use environment variables which is I how I ran in it my docker container.

source openrc
keystone-cadf-logger --rabbit_user rabbit --rabbit-pass pass1 --rabbit-host dev-lb.twc.net

Issues

So what issues exist with this? First some small ones. The code that parses the events is horrible and I hate it, but it worked. You can probably improve it. Second, the big issue. In our environment this code introduced a circular dependency between our control nodes, where rabbit runs, and our keystone nodes which now need to talk to rabbit. For this reason, we ended up not deploying this code, even thought I had all the puppet and docker portions working. If you don’t have this issue, then this code will work well for you. I also don’t have much operating experience with this, it might set all your disks on fire and blow up in spectacular fashion. I planned to deploy it to our dev environment and tweak things as needed. So if you operate it, do it cautiously.

Tweaks

If you are interested in more event types, just change the on_message code. You might also want to change the action that happens. Right now it just logs, but how about emailing the team anytime a user is removed or noting it in your team chat.

Conclusion

This code consists of a few parts and I hope at least some of it is useful to someone. It was fun to write and I was a bit disappointed that we couldn’t fully use it, but I hope that something in here, even if it’s just the async rabbit code might be useful to you. But what about our requirement? Well, we’ll probably still log CADF events locally on the Keystone node and consume them, or we might write a pipeline filter that does something similar, whatever we decide I will update on this site. So please pull the code and play with it!

Github Link

Tagged , ,

Keystone Token Revocations Cripple Validation Performance

Having keystone token revocation events cripples token validation performance. If you’ve been following any of the mailing lists posts on this topic, then you already know this since it’s been discussed (here) and (here). In this post I explore the actual impact and discuss what you can do about it.

What Are Revocations?

A token is revoked for any number of reasons, but basically when it’s revoked, it’s invalid. Here are some of the reasons that revocation events will be generated:

  • The token is intentionally invalidated via the API
  • A user is deleted
  • A user has a role removed
  • A user is removed from a project
  • A user logs out of Horizon
  • A user switches projects in Horizon

Of these events the last two are by far the most common reasons that revocation events are being generated in your cloud.

How Are Revocation Events Used?

How this works varies some based on the token type, but lets assume that a token comes in that is non-expired. We know that either from decrypting it (Fernet) or from looking it up in the DB (UUID). But before Keystone can bless the token, it needs to check the revocation table to ensure that the token is still valid. So it loads the table called revocation_event and takes a peek. Also when it does this load, Keystone does a little house-keeping and removes and revocation events that are on tokens that are already expired. The time a revocation event lives is the same as the token. It does not make sense to have a 3 hour old revocation event when the longest token can live is 1 hour. The unfortunate thing with this algorithm is that it locks the table, slowing down other revocations even more and if it takes too long, leads to deadlocks and failed API calls.

Why Should You Care About Token Validation?

Keystone token validation underlies every single API call that OpenStack makes. If keystone token validation is slow, everything is slow. Validation takes place for example when you make a nova call, nova has to be sure that the token is okay first before performing the action.

Results

If you want to see the experimental setup, skip below, but most of you will want the numbers first!

The chart below shows two runs of the benchmark which checks concurrent token validations. You will see that as soon as you have revocation events, performance falls significantly. There are two lines on the chart. The first line, blue, is our current packaged version of Keystone which is Kilo++/Liberty. The second line in red, shows the performance of a version of Liberty from July 17 with this patch applied. The hope with the patched code is that smarter use of deletes would improve performance, it does not in a measurable way. It may however reduce deadlocks, but I am unable to validate that since my environment is not under any real load.

Benchmark Results

Note: Do not put too much stock into the fact that the red line starts slower than the blue, instead focus on the shape of the curve. There’s too many possible variables in my testing (like what my hypervisor is doing and all the other changes between versions) to compare them apples to apples.

Experimental Setup

For the experimental setup all the systems are guests running in our production cloud built using vagrant-openstack and our standard puppet automation code. The nodes are as follows:

  • 3 keystone nodes
  • 1 haproxy load balancer
  • a puppet master, which also runs the benchmarks

The nodes are running Ubuntu and a version of Keystone from master from May 2015. They are using Fernet tokens that expire after two hours. mysql is setup as a 3 node Galera cluster that preferentially uses one node. The systems were not otherwise busy or doing much else.

The test itself tries to do 20 validations at once up to 4000 of them. It talks to the load balancer which is setup to do round-robin connections.

Given all the variables here, I don’t expect you to replicate these numbers, but rather viewed relative to each other.

Running the Benchmark

For the benchmark code, I used a modified version of Dolph’s benchmark experiment. The modified code is here (note that the detection of whether ab is installed is broken feel free to send me a fix).

Usage:

./benchmark.sh [Keystone Node or LB] [admin_password]

Generating Revoked Tokens

Here’s my kinda hacky script to generate and revoke tokens, it could be better if it just used curls for both. Usage is to pass in a number of tokens to create and then revoke as arg1 and then a valid token as arg2 that you’ve previously generated.

echo "getting & revoking $1 tokens"
for i in $(eval echo "{1..$1}")
do
TOKEN=`keystone token-get | grep id | grep -v tenant_id | grep -v user_id | awk '{ print $4 }'`
curl -X DELETE -i -H "X-Auth-Token: $2" "${OS_AUTH_URL}/tokens/${TOKEN}"
done

Solutions?

Here are a few ideas I’d recommend. First get a baseline of how many revocations you have on a regular basis, this should mainly be from people signing out of Horizon or switching projects in Horizon. For us it’s about 20-30. This is how you check.

mysql -u root keystone -e "select count(id) from revocation_event;"

Once you get a normal number, I’d recommend putting a threshold check into Icinga.

Watch your testing too, we have some regression tests that create users, roles, etc and generates about 500 revocation events.

If you have a spike of events, and you’re not worried about rogue users, you can simply truncate the table.

mysql -u root keystone -e "truncate table revocation_event;"

This has security implications so make sure you know what you are doing.

Another idea is writing a no-op driver for revocations, this essentially disables the feature and again has security implications.

Finally, I’d recommend enabling caching for revocation events you still get the same curve, but you’ll start out at a higher performance value.

Tagged , ,

Fernet Tokens in Prod

This post is a follow-up to my previous post about Fernet Tokens which you may want to read first.

Last night we upgraded our production OpenStack to a new version of keystone off of master from a couple weeks ago and at the same time switched on Fernet tokens. This is after we let the change soak in our dev and staging environments for a couple weeks. We used this time to assess performance, look for issues, and figure out our key rotation strategy.

The Upgrade

All of our upgrade process is run via ansible. We cherry-pick the change which includes pointing to the repo with the new keystone along with enabling the Fernet tokens and then let ansible drive puppet to upgrade and switch providers. During the process, we go down to a single keystone node because it simplifies the active/active database setup when running migrations. So when this node is upgraded we take a short outage as the package is installed and then the migrations run. This took about 16 seconds.

Once this is done, the other OpenStack services start freaking out. Because we’ve not upgraded to Kilo yet, our version of Keystone middleware is too dumb to request a new token when the old one stops working. So this means we have to restart services that talk to Keystone. We ended up re-using our “rabbit node died, reboot OpenStack” script and added glance to the list since restarting it is fairly harmless even though it doesn’t talk to rabbit. Due to how the timing works, we don’t start this script until puppet is completely done upgrading the single keystone node, so while the script to restart services is quick, it doesn’t start for about 90 seconds after Keystone is ready. This means that we have an API outage of 1-2 minutes. For us, this is not a big deal, our customers are sensitive to “hey I can’t get to my VM” way more than a few minutes of API outage, especially one that’s during a scheduled maintenance window. This could be optimized down substantially if I manually ran the restarts instead of waiting on the full puppet run (that upgrades keystone) to finish.

Once the first node is done we run a full validation suite of V2 and V3 keystone tests. This is the point at which we can decide to go back if needed. The test suite for us took about 2 minutes.

Once we have one node upgraded, OpenStack is rebooted, and validation passes, we then deploy the new package and token provider to the rest of the nodes and they rejoin the cluster one by one. We started in the opposite region so we’d get a endpoint up in the other DC quickly. This is driven by another ansible job that runs puppet and does the nodes one by one.

All in all we finished in about 30 minutes, most of that time was sitting around. We then stayed an extra 30 to do a full set of OpenStack regression tests and everything was okay.

At the end I also truncated the token table to get back all the disk space it was using.

Key Rotation

We are not using any of the built-in Keystone Fernet key rotation mechanisms. This is because we already have a way to get code and config onto all our nodes and did not want to run the tooling on a keystone node directly. If you do this, then you inadvertently declare one node a master and have to write special code to handle this master node in puppet or ansible (or whatever you are using). Instead we decided to store the keys in eyaml in our hiera config. I wrote a simple python script that decrypts the eyaml and then generates and rotates the keys. Then I will take the output and propose it into our review system. Reviewing eyaml encrypted keys is somewhat useless, but the human step is there to prevent something dumb from happening. For now we’re only using 3 keys, since our tokens last 2 hours, we can’t do two rotations in under two hours. The reviewer would know the last time a rotation was done and the last time one was deployed. Since we don’t deploy anywhere near a two hour window, this should be okay. Eventually we’ll have Jenkins do this work rather than me. We don’t have any firm plans right now on how often we’ll do the key rotation, probably weekly though.

To answer a question that’s come up, there is no outage when you rotate keys, I’ve done five or six rotations including a few in the same day, without any issues.

Performance

I will be doing a full post later on about performance once I have more numbers, but the results so far is that token generation is much faster, while validation to be a bit slower. Even if it was about the same, the number of problems and database sync issues that not storing tokens in the DB solves make them worthwhile. We’re also going to (finally) switch to WSGI and I think that will further enhance performance.

Aftermath

Today one of my colleagues bought a bottle of Fernet-Branca for us. All I can say is that I highly recommend not doing a shot of it. Switching token providers is way less painful. (Video of said shot is here)

Tagged , ,

Fernet Tokens for Fun & Profit

I’ve been digging into Fernet tokens this past week and getting ready to switch us over to using them. This is the first in a series of blog posts I plan on writing about them. This one will mainly be background on why we’re switching and what we hope to gain. The next post will cover rolling them out which will probably be in a few weeks. For now we’re running these keys in our dev environments for more testing while we focus resources on Kilo upgrades.

What are Fernet Tokens?
How do you explain Fernet tokens? Rather than some lengthy treatise mathematical and identity management theory, just know this: Fernet tokens use shared private keys to avoid having to store or replicate tokens in your database. This makes them super fast, reduces load on your database, and solves replication lag between data centers and nodes within a data center. If your manager asks you, “they’re faster, small, and reduce load on the db”, that should suffice. Dolph Mathews has a good write-up on how much faster they are here. You can also dive into the different token formats for comparison on another of his posts, here.

What Issues will this solve for us?

Now about the DB replication issues… I cannot tell you how much stuff we had to do to deal with database and replication issues with UUID tokens, here’s a few samples:

  • custom cron job to reap expired tokens
  • force db transactions to a master, despite us being active/active so tokens would be there when asked for
  • hacks to our cross-region icinga checks to allow the tokens to replicate, literally sleep(3)

We’ve even had a service accidentally DOS us by requesting so many tokens the DB couldn’t keep up and keystone ran out of DB threads. Hopefully all this is solved by Fernet tokens.

Will this cause an outage to switch to?

Switching token providers will cause an outage. All the old tokens you’ve issued are now 100% useless. So prep accordingly. I will give some updates to the next blog post on how long this was and what issues we saw when we did it.

Do I need to be only Kilo/Liberty?

  • Horizon – you need a newish copy of django_openstack_auth which I think is in Liberty
  • Keystone – you need to be on Kilo
  • python-keystonemiddleware it’s best to have at least 1.1.0. If you have 1.0, you MUST restart all OpenStack services are switching tokens
  • Everything Else – Shouldn’t matter!

A note on python-keystonemiddleware. In 1.0.0 if a service (say Nova) can’t use it’s token for some reason, it won’t try to get a new one until the old one expires. So if you switch to Fernet’s you have to restart all OpenStack services that talk to Keystone or they will not work. We already have some ansible to do this mainly in response to RabbitMQ issues but it works here too.

How do I get Keys onto the boxes?

All keystone nodes in your cluster need to have the same keys. Fortunately there is the concept of rotation so there’s no outage when switching keys, there’s always a key thats “up next” or “on-deck” so that when you’re rotating you switch to a key that’s already on every box. Now as for getting the keys there. I’m going to use puppet to deploy keys that I store in hiera and rotate with a jenkins job, but there are other ways like a shared FS or rsync. More details on my method once I know it works in a later blog post!

How does key rotation work?

What Fernet Rotation Looks Like

What Fernet Rotation Looks Like

If you read through the information on Fernet tokens, key rotation is by far the most confusing. I’ve sat down with pen and paper and now think I get it, so allow me to explain. I’m going to use a 3 key example here, they keys are named with numbers. I highly encourage you to setup a throwaway Keystone box and use keystone-manage fernet_rotate if you don’t follow this.

You need to know 4 rules about how these keys work first:

  1. The highest numbered key is the current signing key.
  2. The 0 key is the key that will become the next signing key.
  3. All other keys are old keys, they’ve been used in the past and there might be old tokens out there still signing with them depending on your expiration schedule
  4. New keys are always created as key 0.

Starting position, per the rules above.

  • 0 – this is the on-deck key, after the next rotation, it’s primary.
  • 1 – this is the old key, it used to be primary, and its still here in case any old tokens are still signed with it. Next rotation it gets deleted.
  • 2 – this is the current primary key thats used for signing.

Now we do a Rotation…

  • 0 becomes 3
  • 1 gets deleted
  • 2 stays 2
  • a new key becomes 0
A Fernet Rotation in Action

A Fernet Rotation in Action

So How does this work?

Let’s pretend we have a few tokens since this is a running OpenStack cluster. All tokens before the rotation above are signed with 2. We do the rotation, now new tokens are signed with 3. When a token comes in, Keystone tries both 3 and 2 to decode the token, and either should work. At this point we CANNOT rotate again until no more active keys are signed with 2, because 2 is going to be deleted! This means you need to have more tokens if you plan on rotating more frequently or have a long token expiration time. We’re going to rotate roughly weekly, and we have a 2 hour token timeout, so 3 is plenty.

Homework

If you think you get this, try this a homework problem. Assume that you have max_active_keys set to 5 and that you have 5 keys: 0, 4, 5, 6, 7.

  • Which is the current signing key?
  • Which is on-deck or the next key to be used? What will it’s number be after the rotation?
  • Which key will be deleted on next rotation?
  • What happens if a token comes in signed with key 5?
  • What happens if a token comes in signed with key 3?

Other Sources

I gathered a lot of this info from trying stuff but also a lot from blog posts. I’ve referenced two above, but I also want to recommend Lance Bragstad’s blog. Note, Lance’s blog is the only blog in the world where you can read about quinoa recipes and shotgun shot patterns.

Tagged , ,

Using Keystone’s LDAP Connection Pools to Speed Up OpenStack

If you use LDAP with Keystone in Juno you can give your implementation a turbo-boost by using LDAP connection pools. Connection pooling is a simple idea. Instead of bringing up and tearing down a connection every time you talk to LDAP, you just reuse an existing one. This feature is widely used in OpenStack when talking to mysql and adding it here really makes sense.

After enabling this feature, using the default settings, I got a 3x-5x speed-up when getting tokens as a LDAP authenticated user.

Using the LDAP Connection Pools

One of the good things about this feature is that it’s well documented (here). Setting this up is easy. The tl;dr is that you can just enable two fields and then use the defaults and they seem to work pretty well.

First turn the feature on, nothing else works without this master switch

# Enable LDAP connection pooling. (boolean value)
use_pool=true

Then if you want to use pools for user authentication, add this one:

# Enable LDAP connection pooling for end user authentication. If use_pool
# is disabled, then this setting is meaningless and is not used at all.
# (boolean value)
use_auth_pool=true

Experimental Setup

For my experiment I used a virtual keystone node that we run on top of our cloud, pointing at a corporate AD box using ldaps. Using an LDAP user, I requested 500 UUID tokens in a row. We use a special hybrid driver that uses the user creds to bind against ldap and ensure that the user/pass combo is valid. I also changed my OS_AUTH_URI to point directly at localhost to avoid hitting the load balancer. Finally I’m using the eventlet (keystone-all) vs apache2 to run Keystone. According to the Keystone PTL, Morgan Fainberg, “under apache I’d expect less benefit” If you’re not using the eventlet, ldaps, or my hybrid driver you might get different results, but I’d still expect it to be faster.

Here’s my basic test script:

export OS_TENANT_NAME=admin
export OS_USERNAME=LDAPUSER
export OS_PASSWORD=password
export OS_TENANT_NAME=admin
export OS_REGION_NAME='dev02'
export OS_AUTH_STRATEGY=keystone
export OS_AUTH_URL=http://localhost:5000/v2.0/
echo "getting $1 tokens"
for i in $(eval echo "{1..$1}")
do
curl -s -X POST http://localhost:5000/v2.0/tokens \
-H "Content-Type: application/json" \
-d '{"auth": {"tenantName": "'"$OS_TENANT_NAME"'", "passwordCredentials": {"username": "'"$OS_USERNAME"'", "password": "'"$OS_PASSWORD"'"}}}' > /dev/null
done

Results

Using the default config, It took 7 mins, 25s to get the tokens.

getting 500 tokens
real 7m25.527s
user 0m2.312s
sys 0m1.557s

I then enabled use_pool and use_auth_pool and restarted keystone, the results were quite a bit faster, a 5x speed-up. Wow.

getting 500 tokens
real 1m25.774s
user 0m2.302s
sys 0m1.539s

I ran this several times and the results were all within a few seconds of each other.

I also tried this test using the keystone CLI and the results were closer to 3.5x faster, still a respectable number.

Watching the Connections

I have a simple command so I can see how many connections are being used:

watch -n1 "netstat -an p tcp | grep :3269"

Using this simple code I can see it bounce between 0 and 1 without connection pools.

Using the defaults but with connection pools enabled, the number of connections was a solid 4. Several minutes after the test ran they died after a bit and went to 0.

I’m not sure why I didn’t get more than 4, but raising the pool counts did not change this value. Any ideas on this are welcome. This is because I have 4 workers on this node.

Tokens Are Fundamental

The coolest part of this is that this change speeds everything up. Since you need a token to do anything, I re-ran the test but just had it run nova list, cinder list, and glance image-list 50 times using the clients. Without the pooling, it took 316 seconds but with the pooling it took 231 seconds.

Plans

There are lots of ways to improve the performance of OpenStack, but this one is simple and easy to setup. The puppet code to configure this is in progress now. Once it lands, I plan to move this to dev and then to staging and prod in our environments. If I learn any other interesting things there, I’ll update the post.

Tagged , , ,

Quick Update on Stacked Auth

A quick update on Stacked/Hybrid auth for Keystone. I spent some time this week updating the driver for Icehouse. That branch is available here and includes packaging info. I’ve been too lazy to make a non-packaging copy of the branch but it should be easy to do. During my debugging of the new driver, I realized one thing, even though this driver does a bind using the credentials of the user you are trying to authenticate, you still need a service account to bind with if you are using subtree search. The reason is that the LDAP code does an initial bind to map your user id to a dn so that then it can do the “real” bind. If you see any issues with this code please let me know!

Tagged , ,

Keystone: Using Stacked Auth (SQL & LDAP)

Most large companies have an Active Directory or LDAP system to manage identity and integrating this with OpenStack makes sense for some obvious reasons, like not having multiple passwords and automatically de-authenticating users who leave the company. However doing this in practice can be a bit tricky. OpenStack needs some basic service accounts. Then your users probably want their own service accounts, since they don’t want to stick their AD password in a Jenkins config. So multiple the number of teams that may use your OpenStack by 4 or 5 and you may have a ballpark estimate on the number of service accounts you need. Then take that number to your AD team and be sure to run it by your security guys too. This is the response you might get.

Note: All security guys wear cloaks.

The way you can tell how a conversation with a security guy is going is to count the number of facepalms they do.

In this setup you probably also do not have write access to your LDAP box. This means that you need a way to track role assignment outside of LDAP, using the SQL back-end (a subject I’ve previous covered)

So this leaves us with an interesting problem, fortunately one with a few solutions. The one we chose was to use a “stacked” or “hybrid” auth in Keystone. One where service accounts live in Keystone’s SQL backend and if users fail to authenticate there they fallback to LDAP. For role Assignment we will store everything in Keystone’s SQL back-end. The good news is that Ionuț Arțăriși from SUSE has already written a driver that does exactly what you need. His driver for Identity and Assignment is available here on github and has a nice README to get you started. I did have a few issues getting it to work, which I will cover below which hopefully saves you some time in setting this up.

The way Ionut’s driver works is pretty simple, try the username and password first locally against SQL. If that fails, try to use the combo to bind to LDAP. This saves the time of having to bind with a service account and search, so it’s simpler and quicker. You will need to be able to at least bind RO with your standard AD creds for this to work though.

From this point forward, you probably will only have the issues I have if you have an AD/LDAP server with a large number of users. If you’re in a lab environment, just following the README in Ionut’s driver is all you need. For everyone else, read on…

Successes and Problems

After configuring the LDAP driver, I switched and enabled hybrid identity and restarted keystone. The authentication module seemed to be working great. I did my first test by doing some basic auth and then running a user-list. The auth worked, although my user lacked any roles that allowed him to do a user list. When I set the SERVICE_TOKEN/ENDPOINT and re-ran the user list, well that was interesting. I noticed that since I had enabled the secret LDAP debug option (hopefully soon to be non-secret) that it was doing A TON of queries and I eventually hit the server max before setting page_size=1000. Below is an animation that shows what it was like running keystone user-list:

This isn't really animated, but is still an accurate representation of what happened.

This isn’t really animated, but is still an accurate representation of how slow it was

It did eventually finish, but it took 5 minutes and 25 seconds, and ain’t nobody got time for that. One simple and obvious fix for this is to just modify the driver to not pull LDAP users in the list. There are too many to manage anyway, so I made the change and now it only showed local accounts. This was a simple change to the bottom of the driver, to just return SQL users and not query LDAP.

root@mfisch:~# keystone user-list
+----------------------------------+------------+---------+--------------------+
| id | name | enabled | email |
+----------------------------------+------------+---------+--------------------+
| e0ede62ebcb64e489a41f4a9b18cd63c | admin | True | root@localhost |
| f9d76feb7081463f973eeda82c918547 | ceilometer | True | root@localhost |
| 176e232176c74164a42e2f773695671b | cinder | True | cinder@localhost |
| fc8007f82b7542e88ca28fab5eda1b5c | glance | True | glance@localhost |
| 635b4022aafc4f5488c048109c7b1665 | heat | True | heat@localhost |
| b3175fc7b62748679b4c958fe89fbdf0 | heat-cfn | True | heat-cfn@localhost |
| 4873cd322d0d4582908002b619e3940a | neutron | True | neutron@localhost |
| 37ef623c60474bf5a384a2047719acf4 | nova | True | nova@localhost |
| ff07bd356b8d4959be4326a43c297db0 | swift | True | swift@localhost |
+----------------------------------+------------+---------+--------------------+

After making this change, I went to add the admin role to my AD user and found I had a second problem. It seemed to be relying on my user being in the user list before it would assign them a role.

root@mfisch:~# keystone user-role-add --user-id MyADUser --role-id admin --tenant-id 6031675b7618458b8bb85b92857532b06No user with a name or ID of 'MyADUser' exists.

I didn’t really want to wait five minutes to assign or list a role, so this needed fixing.

Solutions

It was late on Friday afternoon when I started toying with some solutions for the role issue. The first solution is to add OpenStack users into a special ou so that you can limit the scope of the search. This is done via the user_filter setting in keystone.conf. If this is easy for your organization, it’s a simple fix. It also allows tools like Horizon to show all your users. If you do this you can obviously undo the change to not return LDAP users.

The second idea that came to mind was that every time an AD user authenticated, I could just make a local copy of that user, without a password and setup in a way that they’d be unable to authenticate locally. I did manage to get this code functional, but it has a few issues. The most obvious one is that when users leave the company there’s no corresponding clean-up of their records. The second one is that it doesn’t allow for any changes to things like names or emails once they’ve authenticated. In general it’s bad to have two copies of something like this info anyway. I did however manage to get this code working and it’s available on github if you ever have a use for it. It is not well tested and it’s not very clean in the current state. With this change, I had this output (truncated):

root@mfisch:~# keystone user-list
+----------------------------------+------------+---------+--------------------+
| id | name | enabled | email |
+----------------------------------+------------+---------+--------------------+
| MyADUser | MyADUser | True | my corporate email |
| e0ede62ebcb64e489a41f4a9b18cd63c | admin | True | root@localhost |
| f9d76feb7081463f973eeda82c918547 | ceilometer | True | root@localhost |
...
+----------------------------------+------------+---------+--------------------+

Due to the concerns raised above, we decided to drop this idea.

My feelings on this solution...

My feelings on this solution…

The final idea which came after some discussions was to figure out why the heck a user list was needed to add a role. I dug into the client code, hacked a fix, and then asked on IRC. It turns out I wasn’t the first person to notice it. The bug (#1189933) is that when the user ID is not a UUID, it thinks you must be passing in a name and does a search. The fix for this is to pull a copy of python-keystoneclient thats 0.4 or newer. We had 0.3.2 in our repo. If you upgrade this package to the latest (0.7x) you will need a copy of python-six that’s 1.4.0 or newer. This should all be cake if you’re using the Ubuntu Cloud Archive.

Final State

In the final state of things, I have taken the code from Ionut and:

  1. disabled listing of LDAP users in the driver
  2. upgraded python-six and python-keystoneclient
  3. packaged and installed Ionut’s drivers (as soon as I figure out the license for them, I’ll publish the packaging branch)
  4. reconfigured keystone via puppet

So far it’s all working great with just one caveat so far. If users leave your organization they will still have roles and tenants defined in Keystone’s SQL. You will probably need to define a process for cleaning those up occasionally. As long as the users are deactivated in AD/LDAP though the roles/tenant mappings for them should be relatively harmless cruft.

Tagged , , ,