Tag Archives: operators

Scale & Maturity: Thoughts on the OpenStack Mid-Cycle Operators Meetup

A re-post of an article I wrote last week for SuperUser.

A couple weeks back I attended the OpenStack Operators Mid-Cycle Meetup in Philadelphia.

One of the best parts of these meetups is that the agenda is all formed in public and everyone gets a say in it. Some of the things I was most concerned about made the list (RabbitMQ, packaging), some did not get on the main agenda (database maintenance, Puppet), but many side groups formed during the event and these topics were covered at lunch conversations and in the lobby.

The interesting part of this summit was hearing from other operators problems and solutions. This was more focused, yet with a larger audience, than the previous sessions in Paris. I think a real sense of camaraderie and support for shared solutions was achieved.

Puppet OpenStack Discussion at Lunch

As I was listening to people discuss their issues with OpenStack and how others had solved them, I realized that OpenStack operators have different issues at different scales and maturity levels. When I think about scale and maturity, it’s not necessarily about number of nodes or the number of customers, its more about the number of resources you have, the number of services you provide, the maturity of your processes (such as deployment), and to a some extent how many of your problems you’ve solved with automation.

Our team started at a small scale. We had four people and the goal was to stand-up OpenStack. With four people, you are limited in scope and have to judiciously focus your resources. As our team grew and we worked through forming our deployment and automation processes, we’re able to spend more time on improving our service offerings and adding more services. We can also go back and clean up technical debt which you accumulate as you build an OpenStack deployment. Before these tools and processes are fully in place (and they are never perfect), making changes can take away valuable time.

For example, when an operator finds a bug, it takes a lot of resources and time to get a fix for that bug into production. This includes debugging, filing a bug, fixing the code, pushing a fix, begging for reviews, doing a cherry-pick, begging for more reviews, waiting for or building a package, deploying the new code. Many operators stop around the “filing a bug” step of this process.

Medium-sized operators or ones with more mature processes will sometimes work on a fix (depending on the issue), and may or may not have systems in place to allow them to hold the patch locally and build local packages. On the other hand, larger operators who have good systems in place can give the issue to the Team. They may have 10 people working on it, some of them core members. They have a full continuous integration/automation team that has solved package builds and deployments for them.

Our goal has always been to increase our scale of services, not only via more resources but through automation and creating tools/processes that allow us to offer services for the same amount of resource investment. The main reason for this that is our customers don’t care about Keystone or Neutron, these are just building blocks for them, they really want services like Domain Name System (DNS,) load-balancing-as-a-service (LBaaS), firewall-as-a-service (FWaaS) and database-as-a-service (DBaaS). But until the processes and tools are solid for the core components, it’s hard to find time to work on those, because while customers may not know what Keystone is, they sure care when it doesn’t work.

So how does any of this relate to the conference besides I was daydreaming about it in the lobby? What is clear to me after the sessions is that we have some specific areas where we’re going to work on process improvements and tooling.

My top three are:

  1. RabbitMQ monitoring and tooling
  2. Speeding up and clarifying our development & deployment process
  3. Investigating alternatives to heavy-weight OS packages for deploying OpenStack code

When we revisit this again in six months at the next Mid-Cycle, I suspect that number two will remain on the list, probably forever, since you can always make this process better. I’m certain we’ll have a new number one, and I’m pretty hopeful about the options for number three.

What will these investments get us? Hopefully more time for second-level service offerings and happier customers.

Tagged , ,

What Do Operators Do All Day?

When I was a kid, Richard Scarry’s book, “What Do People Do All Day” was one of my favorites. I saw this book at my parents house and I was thinking about trying to categorize everything I’ve worked on in the past few months, so the result of that thinking is this post. “What Do Operators Do All Day?”

Being an operator means that you need to by necessity be a jack of all trades, unless you’re at a very large provider. And so, over the past 6 months, I’ve worked on almost every piece of our cloud, and in almost all cases I learn something new and grow my skillset (which is my favorite part of working on Openstack).

Collecting My Data

Over the last 6 months or so, I’ve resolved 106 JIRA issues, and looking back at these provides a decent picture of where I spend my tracked work time. I’ve also done upstream reviews and commits, for which stackalytics will provide good details. Using this information, I’ll present where I do my work in order from most time to least.

Puppet Automation

I spend most of my time these days working on puppet modules or configuring services with puppet. Some of this work includes:

  • Fixing/configuring/enabling new features in services like Keystone/Nova/etc
  • Upgrading our puppet branches from Icehouse to master
  • Configuring build server & infra (cobbler, puppet, package repos etc)
  • Configuring/deploying Icinga, or writing new checks
  • Refactoring and cleanup, like moving all our keystone roles/users to YAML so that they’re simpler to add

Ansible

A close second is Ansible automation. We use ansible to manage our internode dependencies and also to drive our deployments. One example of what we’d use ansible for is to upgrade mysql, one node at a time, managing state between nodes while doing so. Over the past six months I’ve written ansible jobs to:

  • Deploy a new hand-built version of ovs
  • Perform a live upgrade mysql from 5.5 to 5.6
  • Upgrade openstack services from I to J or J to K
  • Improve to our weekly deployment process

Misc

Some of these tasks don’t show up in Jira, but they do take a good amount of my time.

  • Travel/training: Openstack conf, RabbitMQ training, etc
  • Planning: sprint planning, feature planning, expansion planning, etc
  • Mentoring and on-boarding: we’ve grown a lot and this one cannot be underestimated. I do about 5-10 code reviews per day even when I’m not answering questions
  • Working on Ubuntu packaging for openstack, we roll (some) of our own
  • MySQL/Galera DBA-esque work

On Call/Issues

Every few months, I do an on-call rotation for a week, these can be quiet or not, depending on the shape of our monitoring and cloud. Whether it’s a good or bad on call is usually of our own doing however. Even when not on call, I deal with issues though, although we do our best, we occasionally have problems. When you get enough nodes, you’ll get failures. They could be hardware failures, kernel issues, or even simply software failures, we’ve had them all. I could do an entire post on the issues we see here, but the ones that stick out to me as focus areas for software are ovs, mysql, and rabbitmq. Those are probably the three most complex and most important pieces of our software stack, and so they get lots of my attention.

Upstream

I think that the community of one of the best things about OpenStack, so I spend what little time I have left here. I participate in IRC and mailing list discussions as a part of the Operators and Puppet-Openstack community. I also do reviews and submit fixes primarily for puppet-openstack but also for Openstack itself. Although my commits to openstack itself have slowed, I’ve earned my 3rd ATC for Vancouver and I think it’s important to participate in this process.

Summary

One of my first concerns that I expressed when interviewing for this job was that we’d have Openstack setup in a year and then we’d be done. That has been far from the truth. In reality, the life of an Openstack operator is always interesting. There’s no shortages of things to fix, things to improve, and things to learn, and that’s why I love it. Although each release of Openstack generally makes things easier and more robust, they also always add more features along the edges to keep us busy.

Does this list match what everyone else spends their time on? Let me know in the comments.

Tagged ,