A re-post of an article I wrote last week for SuperUser.
A couple weeks back I attended the OpenStack Operators Mid-Cycle Meetup in Philadelphia.
One of the best parts of these meetups is that the agenda is all formed in public and everyone gets a say in it. Some of the things I was most concerned about made the list (RabbitMQ, packaging), some did not get on the main agenda (database maintenance, Puppet), but many side groups formed during the event and these topics were covered at lunch conversations and in the lobby.
The interesting part of this summit was hearing from other operators problems and solutions. This was more focused, yet with a larger audience, than the previous sessions in Paris. I think a real sense of camaraderie and support for shared solutions was achieved.
As I was listening to people discuss their issues with OpenStack and how others had solved them, I realized that OpenStack operators have different issues at different scales and maturity levels. When I think about scale and maturity, it’s not necessarily about number of nodes or the number of customers, its more about the number of resources you have, the number of services you provide, the maturity of your processes (such as deployment), and to a some extent how many of your problems you’ve solved with automation.
Our team started at a small scale. We had four people and the goal was to stand-up OpenStack. With four people, you are limited in scope and have to judiciously focus your resources. As our team grew and we worked through forming our deployment and automation processes, we’re able to spend more time on improving our service offerings and adding more services. We can also go back and clean up technical debt which you accumulate as you build an OpenStack deployment. Before these tools and processes are fully in place (and they are never perfect), making changes can take away valuable time.
For example, when an operator finds a bug, it takes a lot of resources and time to get a fix for that bug into production. This includes debugging, filing a bug, fixing the code, pushing a fix, begging for reviews, doing a cherry-pick, begging for more reviews, waiting for or building a package, deploying the new code. Many operators stop around the “filing a bug” step of this process.
Medium-sized operators or ones with more mature processes will sometimes work on a fix (depending on the issue), and may or may not have systems in place to allow them to hold the patch locally and build local packages. On the other hand, larger operators who have good systems in place can give the issue to the Team. They may have 10 people working on it, some of them core members. They have a full continuous integration/automation team that has solved package builds and deployments for them.
Our goal has always been to increase our scale of services, not only via more resources but through automation and creating tools/processes that allow us to offer services for the same amount of resource investment. The main reason for this that is our customers don’t care about Keystone or Neutron, these are just building blocks for them, they really want services like Domain Name System (DNS,) load-balancing-as-a-service (LBaaS), firewall-as-a-service (FWaaS) and database-as-a-service (DBaaS). But until the processes and tools are solid for the core components, it’s hard to find time to work on those, because while customers may not know what Keystone is, they sure care when it doesn’t work.
So how does any of this relate to the conference besides I was daydreaming about it in the lobby? What is clear to me after the sessions is that we have some specific areas where we’re going to work on process improvements and tooling.
My top three are:
- RabbitMQ monitoring and tooling
- Speeding up and clarifying our development & deployment process
- Investigating alternatives to heavy-weight OS packages for deploying OpenStack code
When we revisit this again in six months at the next Mid-Cycle, I suspect that number two will remain on the list, probably forever, since you can always make this process better. I’m certain we’ll have a new number one, and I’m pretty hopeful about the options for number three.
What will these investments get us? Hopefully more time for second-level service offerings and happier customers.