Is OpenStack’s mission broken?

tl;dr:

  1. Betteridge’s law applies.
  2. Ease of development is self inflicted and not mission creep.
  3. Ease of use is self inflicted and not mission creep.
  4. Ease of operations is self inflicted and not mission creep.
  5. I have concrete suggestions for 2/3/4 but to avoid writing a whole book I’m just going to tackle (2) today.

Warning: this is a little ranty. Its not aimed at any individual, its just crystalised out after a couple of years focused on different things, and was seeded by Jay Pipes when he recently put a strawman up about two related discussions that we haven’t really had as a community:

  1. What should the scope of OpenStack’s mission be?
  2. a technical proposal for ‘mulligan’, a narrowly defined new mission. And yes, I know that OpenStack has incredible velocity. Just imagine what it could be if the issues I describe didn’t exist.

So is it the mission?

I think OpenStack has lots of “issues”, to use the technical term, across, well, everything. I don’t think the mission is even slightly related to the problems though.

The mission has ultimately just brought a huge number of folk together with the idea that they might produce a thing that can act like a cloud.

This has been done before: organisations like AWS, Microsoft, Google and smaller players like Digital Ocean and Rackspace (before OpenStack).

I reject the idea that having such a big, hairy, inclusive mission is a problem.

We can be more rigorous about that though: if a smaller mission would structurally prevent a given issue, then it’s the mission that is the problem. Otherwise, it’s not.

I do think the mission is somewhat ridiculous, but there’s a phrase in some companies:a companies mission defines what it doesn’t do, not what it does.

And I think the current OpenStack mission does that quite well: there are two basic filters that can be applied, and unless at least one matches, it’s out of scope for OpenStack.

  • Can you get $thing from a Public Cloud?
  • Do you uniquely need $thing to run a Cloud?

And yes, there are a billion things in the grey cloud around the edge.

Know what else has this problem? Linux. Well over ~3/5th of its code is in that grey edge. 170M of core, 130M of architectures, 530M in drivers. X86 + arm is 50M of that 130M of architectures.

Linux’s response has been dramatically different to ours though. They have a single conceptual project being built, with enormous configurability in how it’s deployed. We’ve decided that we’re building a billion different things under the same umbrella, and that comes down to a cultural norm.

Cultural norms and silos

Concretely, Swift and Nova: the two original projects, have never conceptually regarded themselves as one project.

Should they?

I honestly don’t know :). But by not being one project (with enormous configurability in now it’s deployed), we set a cultural expectation in OpenStack, that variation in workload implied a new project and new codebase.

Every split out takes years to accomplish – both the literal ones like Glance, and the moral ones like Neutron.

The lines for the split-outs are drawn inconsistently.

To illustrate this, ask yourself: what manages a node in an OpenStack cloud? What’s the component that is responsible for working with the machines actual resources, reporting usages, reporting back to service discovery, healthchecks, liveness etc?

In a clean slate architecture you might design a single agent, and then make it extensible/modular. OpenStack has many separate agents, one per siloed team.

Similarly the scheduling problem for net/disk/compute: there is an enormous vertical stack of cloud-APIs that can be built on a solid base, many of which OpenStack has in its portfolio. But that stack is not being built on a common scheduler – and can’t be because the cultural norm is to split things out, not to actually figure out how to maintain things more effectively without moving the code around.

Some things really are better off as separate projects – and I’m not talking monorepo vs repo-per-project, thats really only about the ability to do some changes atomically. A reusable library like oslo.config is only reusable by being a separate project. oslo.db though, exists solely because we have many separate projects that all look like ‘REST on one side database on the other’. That is a concrete problem: high deployment overheads, redundant information in some places, inappropriate transaction boundaries in others. The objects work – passing structured objects around and centralising the DB access – makes things a lot better, but its broken into vertical silos much too early.

Our domain specific services include huge amounts of generic, common problem space code: persistence, placement, access control…

Cultural norms and agility

Back in the dawn of OpenStack, there were some very very strong personalities. Codebases got totally overhauled and replaced without code review. Distrust got baked in as another cultural norm. Code review became a control point. It’s extraordinarily common to spend weeks or months getting patches through.

In some of the most effective teams I’ve worked in code review is optional. Trust and iterate is the norm there: bypassing code review is a thing that needs to be justified, but code review is not how quality is delivered. Quality is delivered by continual improvement, rather than by the quality of any one individual commit.

A related thing is being super risk averse around what lands in master (more on that below). Some very very very clever folk have written very clever code to facilitate this combination of siloed projects + trying super hard not to let regressions into master. This is very hard to deliver – and in fact we stepped back from being an absolute-approach there, about 4 years ago, to a model where we try very hard to prevent it just within a small set of connected projects.

OpenStack has a deeply split personality. Many folk want to build a downloadable cloud construction kit (e.g. Ubuntu). Many more want to build a downloadable cloud product (direct release users). And many wanted (are there still public clouds running master?) to be able to use master directly with confidence. This last use case is a major driver for wanting master to be regression free…

Agility requires the ability to react to new information in a short timeframe. Doing CD (continuous deployment) requires a pipeline that starts with code review and ends with deployed code. OpenStack doesn’t do that. There’s a huge discontinuity between upstream and actual deployments, and effectively none of developers of any part of OpenStack upstream are doing operations day to day. Those that do – at Rackspace, previously at HP (where I was working when I was full time on OpenStack), and I’m going to presume at OVH and other public clouds – are having to separate out their operations work from their upstream changes.

Every initiative in a project will miss some details that have to be figured out later – thats the nature of all but the most exactly software development processes, and those processes are hugely expensive. (Formal methods just to start with). OpenStack copes with that by running huge planning cycles – 3-6 months apart.

Commits-as-control-points + long planning cycles + many developers not operating what they build => reaction to new information happens at a glacial scale.

To illustrate this, consider request tracing. 8 years ago Google released the Dapper whitepaper, Twitter wrote Zipkin and open sourced it, and we’re now at the point where distributed tracing is de rigeur – it’s one of the standard things a service operator will expect for any system. We spent years dealing with pushback from developers in service teams that didn’t understand the benefits of the proposed analogous system for OpenStack. Rackspace wrote their own and patched it in as part of their productionisation of master. Then we also got to have a debate about whether OpenStack should have one such system, or a plugin interface to allow Rackspace to not change. [Sidebar: Rackers, I love you and :heart: your company, but that drove me up the wall! I wish we’d managed to just join forces and get everyone to at least bring a damn tracing interface in for everything].

Test reliability

With TripleO we had the idea that we’d run a cloud based on master, provide feedback on what didn’t work, and create a virtuous circle. I think that that was ultimately flawed because the existing silos (e.g. of Nova, or Glance) were not extended into owning those components within TripleO: TripleO was just another deployer, rather than part of the core feedback cycle.

More generally, we had a team of people (TripleO) running other people’s code (all of OpenStack and commit rights were hard to get in other projects) with no SLA around that code.

I didn’t think of this that way at the time, for all that we understood that that was what we are doing, but that structure is actually structurally fragile: it’s the very antithesis of agile. When something broke it could stay broken for weeks, simply because the folk responsible for the break are not accountable for the non-brokenness of the system. (I’m not whinging about the teams we worked with – people did care, but caring and being accountable are fundamentally different things).

There is another place with that pattern: devstack. Devstack is a code base that exists to deploy all the other openstack components. It’s the purest essence of ‘run other people’s code with no SLA’, and devstack is the engine for pre-merge testing and pre-review testing in OpenStack.

I now believe that to be a key problem for OpenStack. Monty loves to talk about how many clouds OpenStack deploys daily in testing. Every one of those tests is running some number of components (typically the dependency graph for the service under test) which have not changed and are not written by the author, from scratch. And then of course the actual service being tested.

Thats structurally fragile: it’s running 5 or 10 times as much code as is relevant to the test being conducted. And the people able to fix any problems in those dependencies don’t feel the friction at the same time, in the same way, as their users do. (This isn’t a critique of the people, it’s just maths).

I’ll probably write more about this in detail later, as it ties into a larger discussion about testing and deployment of microservices, or testing in production. But imagine if we got rid of devstack for review and merge testing. It has several other use cases of course – ‘give me an OpenStack to hack on’ is an important, discrete test case, and folk probably care that that works. For simplicity I’m going to ignore that for now.

So, if we don’t use devstack, how do we deploy a cloud for pre-merge testing.

We don’t. We don’t need to. What we need to do is deploy the changed code into a cloud whose other components are expected to be compatible with that code. Devstack did this by taking a given branch of a bunch of components and bringing them up from scratch. Instead, we run a production grade, monitored and alerted deployment of all the components. Possibly we run many such deployments, for configurations that cannot coexist (e.g. different federation modes in keystone?). The people answering the pages for those alerts could be the service developers, or it could be an operations team with escalation back to the developers as-needed (to filter noise like ‘oh, cloud $X has just had an outage’). But ultimately the developers would be directly accountable in some realtime fashion.

Then the test workflow becomes:

  1. Build the code under test. (e.g. clean VM, pip install, whatever)
  2. Deploy that code into the existing cluster as a new shard
  3. Exercise it as desired
  4. Tear it down

Let’s use nova-compute as an example.

  1. pip install
  2. Run nova-compute reporting to an existing API server with some custom label on the hypervisor to allow targeting workloads to it
  3. Deploy a VM targeted it
  4. tear it down

I’m sure this raises lots of omg-we-can’t-do-that-because-technical-reason-X-about-what-we-do-today.

That’s fine, but for the purposes of this discussion, consider the destination – not the path.

If we did this:

  • Individual test runs could use substantially less resources
  • And perform substantially less work
  • Which implies better performance
  • Failures due to other components than the service under test would be a thing of the past (when you’re on the hook for your service running reliably, you engineer it to do that)

I think this post is long enough, so let me recap briefly. If there is interest out there I can drill into what sort of changes would be needed to transition to such a system, the suggestions I have for ease of use and ease of operations, and I think I’m also ready to provide some discussion about what the architecture of OpenStack should be.

Recap: why is development hard

Cultural problem #1: silos rather than collaboration in place. Moving the code rather than working with others.

Cultural problem #2: excessive entry controls. Make each commit right rather than trend up wards with a low-latency high change rate.

Cultural problem #3: developer feedback cycle is measured in weeks (optimistically), or years (realistically).

Technical problem #1: excessive code executed in tests: 80% of test activity is not testing the code under test.

Technical problem #2: our testing is optimised for new-cloud deployments: as our userbase grows upgrades become the common use case and testing should match that.

Advertisements

One thought on “Is OpenStack’s mission broken?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s