OpenStack and ease of development

In my last post, about cultural norms in OpenStack, I said that ease of development was a self inflicted issue. This was somewhat contentious 🙂 I’ve had some interested expressed a deeper dive. In that post I articulated three cultural problems and two technical ones.

What does success for developers look like?

I think independent of the scope of OpenStack, the experience for developers should have roughly the same features:

  1. global reasoning for changes should be rarely needed, (or put another way, the architecture should make it possible to think about changes without trying to consider all of OpenStack and still get high quality results). (this helps new developers make good decisions)
  2. the component being worked on should build quickly (keep local development cycles brisk)
  3. have comprehensive local unit tests (keep local development effective; low rate of defects escaping to functional/integration tests)
  4. be able to utilise project resources to perform adhoc exploration, integration, functional and scale tests (this allows developers to have sensibly sized development machines, while still ensuring what they build works in a system representative of our users).
  5. the lead time from getting the change finished locally, to the developer no longer needing to shepard the change through the system should be low think about it, should be low (I won’t scare people by saying what I think it should be 🙂 . this feature keeps cognitive load on developers from becoming a burden)
  6. failures after review should be a) localised, b) rare enough that the overhead of corrective action is tolerable and c) recovery should take place within a small number of hours at most (this keeps the project as a whole healthy and that means individual developers will be rarely impacted by failures from other developers changes)

We already do ok on a number of these things: the above is not a gap analysis.

Sidebar – Accelerate

About now I feel I have to mention Accelerate, a book that is the result of detailed research into software delivery performance – and its follow-up report the DORA 2018 state of devops report. The Puppet state-of-devops report is useful as well, though they focus on different aspects – ones that are less generalisable to open source development in my view. And interestingly, particularly around team choice, they seem to have reached entirely different conclusions around team choice :).

The particularly interesting thing for me is that this is academic grade research, showing causation and tying that back to specific practices: this gives us a solid basis for planning changes, rather than speculation that something will work.

These reports and research are looking into software delivery – which for OpenStack spans organisations: we build, then users deploy. So its not entirely clear that things generalise, nor is it clear how one might implement all the predictive practices because of that.

For instance, while Continuous Integration is something we can imagine doing in OpenStack (sorry folks, preflight testing and CI are really very very different things);

Continuous Deployment would be a much more ambitious undertaking. Imagine it though: commit through to deployed on users clouds in a matter of hours. Wouldn’t that be something. Chrome and Firefox are two open source projects that have been evolving in this direction for some time, and we could well study them to learn what they have found to work and not work.

All that said, the construct – the metrics – that predict software delivery performance are:

  1. Release frequency
  2. Mean time to recovery
  3. Lead time (commit to value consumable)

There’s a separate construct (the Westrum organisational culture construct) for culture, and they also measured the effect on e.g. implementing Continuous Delivery on those metrics.

I highly recommend reading the book – perhaps start with the 2018 report for a taste, but the book has much more detail.

Where are the gaps

I haven’t looked particularly closely at the coupling in OpenStack recently, so for 1) I think folk actually landing changes should assess this. My sense it that we’re ok on this, but not great. In particular, anytime there is a big cross project, lots of involved commits, lots of sequencing – thats something that needed global reasoning.

For 2), most of our stuff is in Python today, so build times aren’t a big issue.

For 3), we’re in pretty decent shape unit test wise, though they tend to be very slow (minutes or more to run), and I worry about  skew between mocks and actual servers.

For 4) we do allow utilisation of project resources via gerrit pre-review tests and pre-merge tests, but there’s no provision for adhoc utilisation (that I know of), and as I described in my last post, I think we could get a lot more leverage out of the cloud resources if we had the ability to wire components under test into an existing, scaled, cloud.

For 5) I’d need to do some more detailed visualisation, or add a feature to stackalytics, but the sense from folk I speak too is that lead times are still enormous. I suspect there are two, or even three, distributions hiding in there (e.g. one for regular devs, and one for infrequent/new) – but we can gather data on this. One important aspect is whether we should measure from ‘code committed( in dev branch) to merged to master’, or ‘code committed to delivered’. Its my view that measuring to delivery is critical, if we truely want to be driving benefits to our users. There is a corner case where those two things converge – trunk based development – but that is particularly challenging for open source projects. For instance, http://stackalytics.com/report/reviews/nova/open shows under the ‘Change requests waiting for reviewers since the last vote or mark’ an average age time of 144 days, with a max age time of 709 days: thats 2 years, 4 releases. Thats measuring time to git; if we measure time to delivered, then we need to add the time that changes sit in git before being included in a release – up to 6 months, though the adhoc releases many project are doing now is a great help. The stats shown though aren’t particularly useful – a) reviews that have merged already are not included in the stats and b) there’s not enough information to  start reasoning about why they have the age they do.

For 6) our changes at the moment, recovery is burdened by the slow merging process – the minimum time to recovery is the sum of the unavoidable steps in the merge / delivery process. Failure frequency (things breaking after the merge completes / is released) is fairly low, but we’re not particularly good at blast radius management – the all-or-nothing nature of change rollout today means there is no mitigation when things go wrong.

So I think there are significant gaps with room to improve on three things there:

  1. More efficient test/adhoc project resource utilisation
  2. Lead times
  3. Blast radius

Smarter testing

I covered this in my previous post in moderate detail, but its worth drilling in further at this point. I don’t think there is a silver bullet here; the necessary machinery to test a new database engine version with an existing cloud is very different in detail to that required to test a new nova-compute build. Lets consider just being able to test a new nova-compute with an existing cloud. Essentially we want to wire in a new shard of nova-compute. Fortunately nova-compute is intrinsically sharded: thats its very model of operation.

blog-testing.png

Though its not strictly relevant here consider that other components (like the DB) have no sharding mechanism in place today, so wiring in a new shard for that would be “tricky”.

The details may have changed since I last dug deep, but from memory nova-compute needs access to the the message bus to communicate with the rest of nova, access to glance and the swift or other store that images are in, and obviously nova-compute needs appropriate local resources to run whatever compute workload it is going to serve out.

So wiring that in from a test node to an existing cloud seems pretty simple. We probably don’t want the services listening unsecured on the internet, so we’ll need a credential distribution system (e.g. vault), and automation to look those up and wire in the nova-compute instance with appropriate credentials.

There may be trust issues: are all components equally privileged in the system? This also shows up as a bug risk – how much damage could a broken but not malicious nova-compute do?

Harder cases – DDL

One common harder case is DDL – schema changes at the DB layer. I don’t have a good canned answer here, but roughly speaking in the context of tests we need to be able to:

  1. Try applying the DDL across the whole DB
  2. Run the code that works with the DB with the modified schema
  3. Be able to do that for many different patches

Right now we machinery to do 1) against a static copy of various cloud’s DBs. 2) and 3) are almost at cross purposes: it may be necessary to serialise those tests: they are fewer than other code changes. One possible implementation would be to use an expand-contract SQL server migration strategy to expand to a new server, run the DDL, verify the cloud metrics don’t regress, then migrate back using the source servers schema (and ignoring missing columns [because if they’ve been dropped in the new schema, then code is already not querying them].

Another possibility, given that these changes are rarer, is not to optimise the testing of them.

Harder cases – exotic components

Power machines, ESXi hypervisors, and other not generally-available hypervisors would all be good to expose to developers – make it possible for them to verify changes to the code that interacts with them – in real time. Ideally with more access than the current hands-off gerrit-test-job only approach.

Lead times

Today, I’m going to treat ‘in a release’ as delivered. I’m picking this definition because:

  • We can choose to make more releases
  • We don’t need to build consensus or whole new delivery stacks to try and get customers upgraded
  • We can always come back and defined delivered with more scope later

Lean methodology provides a number of tools for analysing lead times – it has been used successfully in many organisations; sufficiently robust and consistent in its results that Accelerate even cites adopting lean management practices as being predictive for performance. And then there is the whole what-does-delivered mean.

And yes, we are not a company, we are many volunteers, but that merely adds corner cases – most of our volunteers are given tasks to work on w/in OpenStack, and have the time to work with an effective SDLC and change management process.

As I mentioned above, without some more detailed modelling, its hard to say for sure what leads to the high lead times; but there are some things we can identify easily enough…

  1. We don’t treat each commit as a release. We do say that trunk should never be broken, but we’re not sure enough of our execution to actually tag each commit as a release and publish for consumption.
    1. Consider what we would need to solve to do this.
  2. We aren’t practicing CI. In particular:
    1. Merges (required to repair things that snuck in) often take much more than 10 minutes
    2. We’re not integrating the work-in-progress from developers early enough to avoid reintegration costs.
  3. We’re not practicing trunk based development: every outstanding patch chain is a branch, just in a different representation, and our branch lifetime clearly exceeds a day… and we have a large stabilisation period during the development cycle.
  4. Reviews – needs a deeper analysis to say if this is or isn’t a driver. I suspect it is, because nothing I hear or see shows this to have changed in any fundamental way.
  5. We don’t work in small batches: 6 month cycles is huge batches.
  6. We’re pretty poor at enabling team experimentation. I think this is due to layering: for example, we have N different API servers, so if one team wants to experiment, they create customer confusion due to yet-another-API idiom. If we had just one API server, changes to that would be happening from just one team, gaining much better integration and discussion characteristics. (For an example of having just one API server in a distributed system, consider k8s, which has just one primary API server – the kubelet API is not really customer facing.)
  7. We don’t manage work in progress well: this may not seem important, but its a Lean foundational practice. Think of it as a combination of not exceeding your bandwidth, and minimising context switches.

So what should we do to drive lead times down?

I propose setting a vision: 95% of patches that are either maintenance or part of an agreed current feature merge (or are completely rejected) the same day that they are uploaded to gerrit. (Patches that are for some completely random thing may obviously require considerable more effort to reason about).

Then work back from that: what do we need to have in place to do that safely.
Yes its hard. Thats more of a reason to do it.

Delivering that will require better safety ropes (e.g. clearer contracts for components, better linting (maybe mypy), more willingness to roll forward, consistent review latency (this is more about scheduling than how many reviews any one person does).

The benefits could be immense though: if OpenStack is a juggernaut today, consider what it could be if we could respond nimbly to new user demands.

Blast radius containment

So this is about things like making releases and deployments much more robust to mistakes. For instance, imagine if every server could run in a shadow mode – where it receives traffic, operates on it, but marks any external operations it does as not-real. Then if it blows up we can detect that without destablising a running version. (And the long running supported test cloud would give a perfect place to do this). So rollouts rather than being atomic, become a series of small steps. The simplest form is just taking a stateless scale-out service and running 2 builds in parallel. Thats better than a binary old/new. Canary builds, rolling upgrades similarly.

Now, since we defined ‘delivered’ as in a release, not ‘in use’, maybe we should ignore that operational blast radius and instead limit ourselves to the development side.

Even here is a lot more sophistication that we can add: consider that for libraries our ‘fleet’ is basically every developer. Pinning all those dependencies like we do is a good step. What if we actually could deliver updates to 1% of our devs, then 10%, then all?

So we could have a pipeline:

  1. Unit test a consumer, raise its version for 1% of consumers.
  2. Watch for failures, raise the % until 100%

This would require a metrics channel (opt-in!), and some way of signalling the versions to choose from to development environments.

We could use multiple branches as another mechanism: if everyone works off of trunk, we optimise trunk merges to be no more than (say) 20 minutes, and code self promotes to a tested branch, then release branch over a couple of hours. Failures would generate a proposed rollback straight into gerrit.

Wrapup

There’s a high cost of change in OpenStack – I don’t mean individual code changes, I mean changing e.g. policies, languages, architecture – lots of code, and thousands of affected people. A result of a high cost of change is a high risk of change: if a change makes things worse, it can take as long to back it out as it took to bring it in.

I’ll freely admit that I’m partly off in architecture-astronaut land here: there’s a huge gap of detail between what I’m describing and what would be needed to make it happen.

I have confidence in the community though, if we can just pull some vision together about what we want, we have the people and knowledge to execute on it.

Leave a comment