After running a cloud for 2+ years our OpenStack API databases are full of cruft. Deleted instances, deleted networks, deleted volumes, they all are still in the databases. OpenStack has no periodic clean-up for this stuff, it’s left up to you. This is partly because there’s no unified way to do it and also because each operator has different requirements on how long to retain data. Over the past few weeks I’ve been cleaning up our records and would like to share what I’ve found.
Warning: This is an advanced operation. Before doing this: You should backup everything. You should test it in a dev environment. You should assess the impact to your APIs. I did all of those before attempting any of this, including importing prod data into a test OpenStack environment to assess the impact there.
Each service has it’s own database, and depending on how that code is written and usage, some services stored more data than others. Here’s the order in terms of largest to smallest of our databases, on disk size (which we will discuss later).
Pre-cleaning DB sizes
So with this in mind I started digging into how to clean this stuff up. Here’s what I found. I’m noting in here what release we’re on for each because the tooling may be different or broken for other releases.
Heat – Mitaka
Heat was the first one I did, mainly because if Heat blows up, I can probably still keep my job. Heat has a great DB cleanup tool and it works very well. Heat lets you say “purge deleted records > X days/months/etc old”. When I did this heat had so much junk that I “walked it in”, starting with 365 days, then 250, etc etc. Heat developers win the gold medal here for best DB clean-up tool.
heat-manage purge_deleted -g days 30
Keystone – All Versions
Guess what? Keystone doesn’t keep ANY deleted junk in it’s database, once it’s gone, it’s gone. This can actually be an issue when you find a 2 year old instance that has a userid you can’t track down, but that’s how it is. So as long as you’re not storing tokens in here, you’re good. We’re using Fernet tokens, so no issues here.
Cinder – Liberty
Cinder’s DB cleanup tool is broken in Liberty. It is supposed to be fixed in Mitaka, but we’re not running Mitaka. Hoping to try this after we upgrade. We have a lot of volume cruft laying around.
Glance – Liberty
Glance has no cleanup tool at all that I can find. So I wrote one, but we ended up not using it. Why? Well because it seems that Glance can and will report deleted images via the V2 API and I could never quite convince myself that we’d not break stuff by doing a cleanup. Anyone know otherwise?
Here’s my code to do the cleanup, be careful with it! Like Heat you should probably “walk this in” by changing “1 MONTH” to “1 YEAR” or “6 MONTHS”. These deletes will lock the tables which will hang up API calls while they are running, plan appropriately. Note if you look on the internet you might find other versions that disable foreign key constraints, don’t do that.
mysql -u root glance -e "DELETE FROM image_tags WHERE image_id in\ (SELECT images.id FROM images WHERE images.status='deleted'\ AND images.deleted_at <DATE_SUB(NOW(),INTERVAL 1 MONTH));" mysql -u root glance -e "DELETE FROM image_properties WHERE image_id in\ (SELECT images.id FROM images WHERE images.status='deleted'\ AND images.deleted_at <DATE_SUB(NOW(),INTERVAL 1 MONTH));" mysql -u root glance -e "DELETE FROM image_members WHERE image_id in\ (SELECT images.id FROM images WHERE images.status='deleted'\ AND images.deleted_at <DATE_SUB(NOW(),INTERVAL 1 MONTH));" mysql -u root glance -e "DELETE FROM image_locations WHERE image_id in\ (SELECT images.id FROM images WHERE images.status='deleted'\ AND images.deleted_at <DATE_SUB(NOW(),INTERVAL 1 MONTH));" mysql -u root glance -e "DELETE FROM images WHERE images.status='deleted'\ AND images.deleted_at <DATE_SUB(NOW(),INTERVAL 1 MONTH);"
Nova – Liberty
Like Heat, Nova also has a clean-up tool, and also like Heat, Nova has a huge database. Unlike Heat, Nova’s clean-up tool is more limited. The only thing you can tell it is “don’t delete more than this many rows”. Actually Nova calls it’s tool “archiving” because it doesn’t delete records, it moves them to shadow tables. So even if you use this, you need to go back and truncate all the shadow tables.
Also near as I can tell nova just tries to archive up to the max records paying not attention to any database constraints, so when you use it, you will get warnings. These appear safe to be ignore. Also the Nova archive (in Liberty) doesn’t tell you anything (I think this is fixed in Mitaka), so figuring out when you are done is some interesting guess work. Basically I just re-ran it over and over and compared the sizes of the shadow tables, when they stop changing we’re done.
Also one quick note, when this finishes and you run du, you’re going to find out that you are now using more disk space. That’s because you just did a bunch of inserts into the shadow tables.
Like everything else, walk this in.
$ nova-manage db archive_deleted_rows 50000 2016-08-30 21:49:01.404 1 WARNING nova.db.sqlalchemy.api [req-f329a277-4fe2-45d6-ba3a-51f93827ed2f - - - - -] IntegrityError detected when archiving table aggregate_metadata 2016-08-30 21:49:11.900 1 WARNING nova.db.sqlalchemy.api [req-f329a277-4fe2-45d6-ba3a-51f93827ed2f - - - - -] IntegrityError detected when archiving table instances
How Much Disk Will I Get Back
Surprise, you get nothing back! That’s because the disk space is already allocated. If this important to you, then you will need to OPTIMIZE the tables. This ends up doing a full recreate (depending on what DB you are using) and this WILL lock your tables and hang API calls. Be very careful when doing this. How much size can you save? Well for Heat it was about 6-7x smaller, 1.1G to 170M, the gain in Nova was more like 30%. Glance was also about 8x but I was took chicken to take that past our dev environment because of the API.
This is a question you should ask yourself before attempting this. Some of these operations are risky, but it’s also going to hurt your performance if you let these grow without bound. Some of you may want to just do the cleanup and skip the optimization steps. If you do the optimizations I’d recommend you know how long it takes for each table/service. If you can export your prod data onto a test node that will give you a better idea.
- Horizon just stores ephemeral session data so it’s pretty clean.
- Neutron – the DB is small so I’ve not looked into it, anyone tried it? Comments welcome.
- Designate – we’re on an ancient version (Juno) so any experiments here will happen on something more modern