OpenStack and ease of development

In my last post, about cultural norms in OpenStack, I said that ease of development was a self inflicted issue. This was somewhat contentious ūüôā I’ve had some interested expressed a deeper dive. In that post I articulated three cultural problems and two technical ones.

What does success for developers look like?

I think independent of the scope of OpenStack, the experience for developers should have roughly the same features:

  1. global reasoning for changes should be rarely needed, (or put another way, the architecture should make it possible to think about changes without trying to consider all of OpenStack and still get high quality results). (this helps new developers make good decisions)
  2. the component being worked on should build quickly (keep local development cycles brisk)
  3. have comprehensive local unit tests (keep local development effective; low rate of defects escaping to functional/integration tests)
  4. be able to utilise project resources to perform adhoc exploration, integration, functional and scale tests (this allows developers to have sensibly sized development machines, while still ensuring what they build works in a system representative of our users).
  5. the lead time from getting the change finished locally, to the developer no longer needing to shepard the change through the system should be low think about it, should be low (I won’t scare people by saying what I think it should be ūüôā . this feature keeps cognitive load on developers from becoming a burden)
  6. failures after review should be a) localised, b) rare enough that the overhead of corrective action is tolerable and c) recovery should take place within a small number of hours at most (this keeps the project as a whole healthy and that means individual developers will be rarely impacted by failures from other developers changes)

We already do ok on a number of these things: the above is not a gap analysis.

Sidebar – Accelerate

About now I feel I have to mention Accelerate, a book that is the result of detailed research into software delivery performance – and its follow-up report the DORA 2018 state of devops report. The Puppet state-of-devops report is useful as well, though they focus on different aspects – ones that are less generalisable to open source development in my view. And interestingly, particularly around team choice, they seem to have reached entirely different conclusions around team choice :).

The particularly interesting thing for me is that this is academic grade research, showing causation and tying that back to specific practices: this gives us a solid basis for planning changes, rather than speculation that something will work.

These reports and research are looking into software delivery – which for OpenStack spans organisations: we build, then users deploy. So its not entirely clear that things generalise, nor is it clear how one might implement all the predictive practices because of that.

For instance, while Continuous Integration is something we can imagine doing in OpenStack (sorry folks, preflight testing and CI are really very very different things);

Continuous Deployment would be a much more ambitious undertaking. Imagine it though: commit through to deployed on users clouds in a matter of hours. Wouldn’t that be something. Chrome and Firefox are two open source projects that have been evolving in this direction for some time, and we could well study them to learn what they have found to work and not work.

All that said, the construct – the metrics – that predict software delivery performance are:

  1. Release frequency
  2. Mean time to recovery
  3. Lead time (commit to value consumable)

There’s a separate construct (the Westrum organisational culture construct) for culture, and they also measured the effect on e.g. implementing Continuous Delivery on those metrics.

I highly recommend reading the book – perhaps start with the 2018 report for a taste, but the book has much more detail.

Where are the gaps

I haven’t looked particularly closely at the coupling in OpenStack recently, so for 1) I think folk actually landing changes should assess this. My sense it that we’re ok on this, but not great. In particular, anytime there is a big cross project, lots of involved commits, lots of sequencing – thats something that needed global reasoning.

For 2), most of our stuff is in Python today, so build times aren’t a big issue.

For 3), we’re in pretty decent shape unit test wise, though they tend to be very slow (minutes or more to run), and I worry about¬† skew between mocks and actual servers.

For 4) we do allow utilisation of project resources via gerrit pre-review tests and pre-merge tests, but there’s no provision for adhoc utilisation (that I know of), and as I described in my last post, I think we could get a lot more leverage out of the cloud resources if we had the ability to wire components under test into an existing, scaled, cloud.

For 5) I’d need to do some more detailed visualisation, or add a feature to stackalytics, but the sense from folk I speak too is that lead times are still enormous. I suspect there are two, or even three, distributions hiding in there (e.g. one for regular devs, and one for infrequent/new) – but we can gather data on this. One important aspect is whether we should measure from ‘code committed( in dev branch) to merged to master’, or ‘code committed to delivered’. Its my view that measuring to delivery is critical, if we truely want to be driving benefits to our users. There is a corner case where those two things converge – trunk based development – but that is particularly challenging for open source projects. For instance, http://stackalytics.com/report/reviews/nova/open shows under the ‘Change requests waiting for reviewers since the last vote or mark’ an average age time of 144 days, with a max age time of 709 days: thats 2 years, 4 releases. Thats measuring time to git; if we measure time to delivered, then we need to add the time that changes sit in git before being included in a release – up to 6 months, though the adhoc releases many project are doing now is a great help. The stats shown though aren’t particularly useful – a) reviews that have merged already are not included in the stats and b) there’s not enough information to¬† start reasoning about why they have the age they do.

For 6) our changes at the moment, recovery is burdened by the slow merging process – the minimum time to recovery is the sum of the unavoidable steps in the merge / delivery process. Failure frequency (things breaking after the merge completes / is released) is fairly low, but we’re not particularly good at blast radius management – the all-or-nothing nature of change rollout today means there is no mitigation when things go wrong.

So I think there are significant gaps with room to improve on three things there:

  1. More efficient test/adhoc project resource utilisation
  2. Lead times
  3. Blast radius

Smarter testing

I covered this in my previous post in moderate detail, but its worth drilling in further at this point. I don’t think there is a silver bullet here; the necessary machinery to test a new database engine version with an existing cloud is very different in detail to that required to test a new nova-compute build. Lets consider just being able to test a new nova-compute with an existing cloud. Essentially we want to wire in a new shard of nova-compute. Fortunately nova-compute is intrinsically sharded: thats its very model of operation.

blog-testing.png

Though its not strictly relevant here consider that other components (like the DB) have no sharding mechanism in place today, so wiring in a new shard for that would be “tricky”.

The details may have changed since I last dug deep, but from memory nova-compute needs access to the the message bus to communicate with the rest of nova, access to glance and the swift or other store that images are in, and obviously nova-compute needs appropriate local resources to run whatever compute workload it is going to serve out.

So wiring that in from a test node to an existing cloud seems pretty simple. We probably don’t want the services listening unsecured on the internet, so we’ll need a credential distribution system (e.g. vault), and automation to look those up and wire in the nova-compute instance with appropriate credentials.

There may be trust issues: are all components equally privileged in the system? This also shows up as a bug risk – how much damage could a broken but not malicious nova-compute do?

Harder cases – DDL

One common harder case is DDL – schema changes at the DB layer. I don’t have a good canned answer here, but roughly speaking in the context of tests we need to be able to:

  1. Try applying the DDL across the whole DB
  2. Run the code that works with the DB with the modified schema
  3. Be able to do that for many different patches

Right now we machinery to do 1) against a static copy of various cloud’s DBs. 2) and 3) are almost at cross purposes: it may be necessary to serialise those tests: they are fewer than other code changes. One possible implementation would be to use an expand-contract SQL server migration strategy to expand to a new server, run the DDL, verify the cloud metrics don’t regress, then migrate back using the source servers schema (and ignoring missing columns [because if they’ve been dropped in the new schema, then code is already not querying them].

Another possibility, given that these changes are rarer, is not to optimise the testing of them.

Harder cases – exotic components

Power machines, ESXi hypervisors, and other not generally-available hypervisors would all be good to expose to developers – make it possible for them to verify changes to the code that interacts with them – in real time. Ideally with more access than the current hands-off gerrit-test-job only approach.

Lead times

Today, I’m going to treat ‘in a release’ as delivered. I’m picking this definition because:

  • We can choose to make more releases
  • We don’t need to build consensus or whole new delivery stacks to try and get customers upgraded
  • We can always come back and defined delivered with more scope later

Lean methodology provides a number of tools for analysing lead times – it has been used successfully in many organisations; sufficiently robust and consistent in its results that Accelerate even cites adopting lean management practices as being predictive for performance. And then there is the whole what-does-delivered mean.

And yes, we are not a company, we are many volunteers, but that merely adds corner cases – most of our volunteers are given tasks to work on w/in OpenStack, and have the time to work with an effective SDLC and change management process.

As I mentioned above, without some more detailed modelling, its hard to say for sure what leads to the high lead times; but there are some things we can identify easily enough…

  1. We don’t treat each commit as a release. We do say that trunk should never be broken, but we’re not sure enough of our execution to actually tag each commit as a release and publish for consumption.
    1. Consider what we would need to solve to do this.
  2. We aren’t practicing CI. In particular:
    1. Merges (required to repair things that snuck in) often take much more than 10 minutes
    2. We’re not integrating the work-in-progress from developers early enough to avoid reintegration costs.
  3. We’re not practicing trunk based development: every outstanding patch chain is a branch, just in a different representation, and our branch lifetime clearly exceeds a day… and we have a large stabilisation period during the development cycle.
  4. Reviews – needs a deeper analysis to say if this is or isn’t a driver. I suspect it is, because nothing I hear or see shows this to have changed in any fundamental way.
  5. We don’t work in small batches: 6 month cycles is huge batches.
  6. We’re pretty poor at enabling team experimentation. I think this is due to layering: for example, we have N different API servers, so if one team wants to experiment, they create customer confusion due to yet-another-API idiom. If we had just one API server, changes to that would be happening from just one team, gaining much better integration and discussion characteristics. (For an example of having just one API server in a distributed system, consider k8s, which has just one primary API server – the kubelet API is not really customer facing.)
  7. We don’t manage work in progress well: this may not seem important, but its a Lean foundational practice. Think of it as a combination of not exceeding your bandwidth, and minimising context switches.

So what should we do to drive lead times down?

I propose setting a vision: 95% of patches that are either maintenance or part of an agreed current feature merge (or are completely rejected) the same day that they are uploaded to gerrit. (Patches that are for some completely random thing may obviously require considerable more effort to reason about).

Then work back from that: what do we need to have in place to do that safely.
Yes its hard. Thats more of a reason to do it.

Delivering that will require better safety ropes (e.g. clearer contracts for components, better linting (maybe mypy), more willingness to roll forward, consistent review latency (this is more about scheduling than how many reviews any one person does).

The benefits could be immense though: if OpenStack is a juggernaut today, consider what it could be if we could respond nimbly to new user demands.

Blast radius containment

So this is about things like making releases and deployments much more robust to mistakes. For instance, imagine if every server could run in a shadow mode – where it receives traffic, operates on it, but marks any external operations it does as not-real. Then if it blows up we can detect that without destablising a running version. (And the long running supported test cloud would give a perfect place to do this). So rollouts rather than being atomic, become a series of small steps. The simplest form is just taking a stateless scale-out service and running 2 builds in parallel. Thats better than a binary old/new. Canary builds, rolling upgrades similarly.

Now, since we defined ‘delivered’ as in a release, not ‘in use’, maybe we should ignore that operational blast radius and instead limit ourselves to the development side.

Even here is a lot more sophistication that we can add: consider that for libraries our ‘fleet’ is basically every developer. Pinning all those dependencies like we do is a good step. What if we actually could deliver updates to 1% of our devs, then 10%, then all?

So we could have a pipeline:

  1. Unit test a consumer, raise its version for 1% of consumers.
  2. Watch for failures, raise the % until 100%

This would require a metrics channel (opt-in!), and some way of signalling the versions to choose from to development environments.

We could use multiple branches as another mechanism: if everyone works off of trunk, we optimise trunk merges to be no more than (say) 20 minutes, and code self promotes to a tested branch, then release branch over a couple of hours. Failures would generate a proposed rollback straight into gerrit.

Wrapup

There’s a high cost of change in OpenStack – I don’t mean individual code changes, I mean changing e.g. policies, languages, architecture – lots of code, and thousands of affected people. A result of a high cost of change is a high risk of change: if a change makes things worse, it can take as long to back it out as it took to bring it in.

I’ll freely admit that I’m partly off in architecture-astronaut land here: there’s a huge gap of detail between what I’m describing and what would be needed to make it happen.

I have confidence in the community though, if we can just pull some vision together about what we want, we have the people and knowledge to execute on it.

Advertisements

Is OpenStack’s mission broken?

tl;dr:

  1. Betteridge’s law applies.
  2. Ease of development is self inflicted and not mission creep.
  3. Ease of use is self inflicted and not mission creep.
  4. Ease of operations is self inflicted and not mission creep.
  5. I have concrete suggestions for 2/3/4 but to avoid writing a whole book I’m just going to tackle (2) today.

Warning: this is a little ranty. Its not aimed at any individual, its just crystalised out after a couple of years focused on different things, and was seeded by¬†Jay Pipes when he recently put a strawman up about two related discussions that we haven’t really had as a community:

  1. What should¬†the scope of OpenStack’s mission be?
  2. a technical proposal for ‘mulligan’, a narrowly defined new mission. And yes, I know that OpenStack has incredible velocity. Just imagine what it could be if the issues I describe didn’t exist.

So is it the mission?

I think OpenStack has lots of “issues”, to use the technical term, across, well, everything. I don’t think the mission is even slightly related to the problems though.

The mission has ultimately just brought a huge number of folk together with the idea that they might produce a thing that can act like a cloud.

This has been done before: organisations like AWS, Microsoft, Google and smaller players like Digital Ocean and Rackspace (before OpenStack).

I reject the idea that having such a big, hairy, inclusive mission is a problem.

We can be more rigorous about that though: if a smaller mission would structurally prevent a given issue, then it’s the mission that is the problem. Otherwise, it’s not.

I do think the mission is somewhat ridiculous, but there’s a phrase in some companies:a companies mission defines what it doesn’t do, not what it does.

And I think the current OpenStack mission does that quite well: there are two basic filters that can be applied, and unless at least one matches, it’s out of scope for OpenStack.

  • Can you get $thing from a Public Cloud?
  • Do you uniquely need $thing to run a Cloud?

And yes, there are a billion things in the grey cloud around the edge.

Know what else has this problem? Linux. Well over ~3/5th of its code is in that grey edge. 170M of core, 130M of architectures, 530M in drivers. X86 + arm is 50M of that 130M of architectures.

Linux’s response has been dramatically different to ours though. They have a single conceptual project being built, with enormous configurability in how it’s deployed. We’ve decided that we’re building a billion different things under the same umbrella, and that comes down to a cultural norm.

Cultural norms and silos

Concretely, Swift and Nova: the two original projects, have never conceptually regarded themselves as one project.

Should they?

I honestly don’t know :). But by not being one project (with enormous configurability in now it’s deployed), we set a cultural expectation in OpenStack, that variation in workload implied a new project and new codebase.

Every split out takes years to accomplish – both the literal ones like Glance, and the moral ones like Neutron.

The lines for the split-outs are drawn inconsistently.

To illustrate this, ask yourself: what manages a node in an OpenStack cloud? What’s the component that is responsible for working with the machines actual resources, reporting usages, reporting back to service discovery, healthchecks, liveness etc?

In a clean slate architecture you might design a single agent, and then make it extensible/modular. OpenStack has many separate agents, one per siloed team.

Similarly the scheduling problem for net/disk/compute: there is an enormous vertical stack of cloud-APIs that can be built on a solid base, many of which OpenStack has in its portfolio. But that stack is not being built on a common scheduler – and can’t be because the cultural norm is to split things out, not to actually figure out how to maintain things more effectively without moving the code around.

Some things really are better off as separate projects – and I’m not talking monorepo vs repo-per-project, thats really only about the ability to do some changes atomically. A reusable library like oslo.config is only reusable by being a separate project. oslo.db though, exists solely because we have many separate projects that all look like ‘REST on one side database on the other’. That is a concrete problem: high deployment overheads, redundant information in some places, inappropriate transaction boundaries in others. The objects work – passing structured objects around and centralising the DB access – makes things a lot better, but its broken into vertical silos much too early.

Our domain specific services include huge amounts of generic, common problem space code: persistence, placement, access control…

Cultural norms and agility

Back in the dawn of OpenStack, there were some very very strong personalities. Codebases got totally overhauled and replaced without code review. Distrust got baked in as another cultural norm. Code review became a control point. It’s extraordinarily common to spend weeks or months getting patches through.

In some of the most effective teams I’ve worked in code review is optional. Trust and iterate is the norm there: bypassing code review is a thing that needs to be justified, but code review is not how quality is delivered. Quality is delivered by continual improvement, rather than by the quality of any one individual commit.

A related thing is being super risk averse around what lands in master (more on that below). Some very very very clever folk have written very clever code to facilitate this combination of siloed projects + trying super hard not to let regressions into master. This is very hard to deliver – and in fact we stepped back from being an absolute-approach there, about 4 years ago, to a model where we try very hard to prevent it just within a small set of connected projects.

OpenStack has a deeply split personality. Many folk want to build a downloadable cloud construction kit (e.g. Ubuntu). Many more want to build a downloadable cloud product (direct release users). And many wanted (are there still public clouds running master?) to be able to use master directly with confidence. This last use case is a major driver for wanting master to be regression free…

Agility requires the ability to react to new information in a short timeframe. Doing CD (continuous deployment) requires a pipeline that starts with code review and ends with deployed code. OpenStack doesn’t do that. There’s a huge discontinuity between upstream and actual deployments, and effectively none of developers of any part of OpenStack¬†upstream are doing operations day to day. Those that do – at Rackspace, previously at HP (where I was working when I was full time on OpenStack), and I’m going to presume at OVH and other public clouds – are having to separate out their operations work from their upstream changes.

Every initiative in a project will miss some details that have to be figured out later – thats the nature of all but the most exactly software development processes, and those processes are hugely expensive. (Formal methods just to start with). OpenStack copes with that by running huge planning cycles – 3-6 months apart.

Commits-as-control-points + long planning cycles + many developers not operating what they build => reaction to new information happens at a glacial scale.

To illustrate this, consider request tracing. 8 years ago Google released the Dapper¬†whitepaper, Twitter wrote Zipkin and open sourced it, and we’re now at the point where distributed tracing is de rigeur – it’s one of the standard things a service operator will expect for any system. We spent¬†years dealing with pushback from developers in service teams that didn’t understand the benefits of the proposed analogous system for OpenStack. Rackspace wrote their own and patched it in as part of their productionisation of master. Then we also got to have a debate about whether OpenStack should have¬†one such system, or a¬†plugin interface to allow Rackspace to not change. [Sidebar: Rackers, I love you and :heart: your company, but that drove me up the wall! I wish we’d managed to just join forces and get everyone to at least bring a damn tracing interface in for everything].

Test reliability

With TripleO we had the idea that we’d run a cloud based on master, provide feedback on what didn’t work, and create a virtuous circle. I think that that was ultimately flawed because the existing silos (e.g. of Nova, or Glance) were not extended into owning those components within TripleO: TripleO was just another deployer, rather than part of the core feedback cycle.

More generally, we had a team of people (TripleO) running other people’s code (all of OpenStack and commit rights were hard to get in other projects) with no SLA around that code.

I didn’t think of this that way at the time, for all that we understood that that was what we are doing, but that structure is actually structurally fragile: it’s the very antithesis of agile. When something broke it could stay broken for weeks, simply because the folk responsible for the break are not accountable for the non-brokenness of the system. (I’m not whinging about the teams we worked with – people did care, but caring and being accountable are fundamentally different things).

There is another place with that pattern: devstack. Devstack is a code base that exists to deploy all the other openstack components. It’s the purest essence of ‘run other people’s code with no SLA’, and devstack is¬†the engine for pre-merge testing and pre-review testing in OpenStack.

I now believe that to be a key problem for OpenStack. Monty loves to talk about how many clouds OpenStack deploys daily in testing. Every one of those tests is running some number of components (typically the dependency graph for the service under test) which have not changed and are not written by the author, from scratch. And then of course the actual service being tested.

Thats structurally fragile: it’s running 5 or 10 times as much code as is relevant to the test being conducted. And the people able to fix any problems in those dependencies don’t feel the friction at the same time, in the same way, as their users do. (This isn’t a critique of the people, it’s just maths).

I’ll probably write more about this in detail later, as it ties into a larger discussion about testing and deployment of microservices, or testing in production. But imagine if we got rid of devstack for review and merge testing. It has several other use cases of course – ‘give me an OpenStack to hack on’ is an important, discrete test case, and folk probably care that that works. For simplicity I’m going to ignore that for now.

So, if we don’t use devstack, how do we deploy a cloud for pre-merge testing.

We don’t. We don’t need to. What we need to do is deploy the changed code¬†into a cloud whose other components are expected to be compatible with that code. Devstack did this by taking a given branch of a bunch of components and bringing them up from scratch. Instead, we run a¬†production grade, monitored and alerted¬†deployment of all the components. Possibly we run many such deployments, for configurations that cannot coexist (e.g. different federation modes in keystone?). The people answering the pages for those alerts could be the service developers, or it could be an operations team with escalation back to the developers as-needed (to filter noise like ‘oh, cloud $X has just had an outage’). But ultimately the developers would be directly accountable in some realtime fashion.

Then the test workflow becomes:

  1. Build the code under test. (e.g. clean VM, pip install, whatever)
  2. Deploy that code into the existing cluster as a new shard
  3. Exercise it as desired
  4. Tear it down

Let’s use nova-compute as an example.

  1. pip install
  2. Run nova-compute reporting to an existing API server with some custom label on the hypervisor to allow targeting workloads to it
  3. Deploy a VM targeted it
  4. tear it down

I’m sure this raises lots of omg-we-can’t-do-that-because-technical-reason-X-about-what-we-do-today.

That’s fine, but for the purposes of this discussion, consider the destination – not the path.

If we did this:

  • Individual test runs could use substantially less resources
  • And perform substantially less work
  • Which implies better performance
  • Failures due to other components than the service under test would be a thing of the past (when you’re on the hook for your service running reliably, you engineer it to do that)

I think this post is long enough, so let me recap briefly. If there is interest out there I can drill into what sort of changes would be needed to transition to such a system, the suggestions I have for ease of use and ease of operations, and I think I’m also ready to provide some discussion about what the architecture of OpenStack¬†should be.

Recap: why is development hard

Cultural problem #1: silos rather than collaboration in place. Moving the code rather than working with others.

Cultural problem #2: excessive entry controls. Make each commit right rather than trend up wards with a low-latency high change rate.

Cultural problem #3: developer feedback cycle is measured in weeks (optimistically), or years (realistically).

Technical problem #1: excessive code executed in tests: 80% of test activity is not testing the code under test.

Technical problem #2: our testing is optimised for new-cloud deployments: as our userbase grows upgrades become the common use case and testing should match that.

OpenStack Mitaka debrief

Well, last week was the 6-monthly OpenStack summit in Tokyo. It was fantastic to catch up with many folk, but with 5000 attendees, there are many more that I didn’t see than those that I did. Yet I find the sheer volume of face-to-face stuff nearly overwhelming. I wish it was quite a bit longer and less intense.

Over the next cycle I’ve committed to a few things…

  1. Kicking off TC leadership of scaling for OpenStack. That is, sparking the conversation with the broader community about what scaling means for us, and ensuring each project is paying some attention to it – in the same way that each project already pays attention to e.g. backwards compatibility – they can choose how much, and implementation and so on, but the basic user expectations and framework for thinking about it are shared across OpenStack. The performance working group is certainly related to this but scaling is different to performance.
  2. Replacing the oslo incubator process with one that creates the package straight away. This will go up as a spec for approval of course. The crux of the issue will be finding a way to preserve the freedom of early refactorings without API commitments, without breaking everything. The current approach in my head is to use versioned submodules within the package during the pre-1.0.0 phase, and liberally copy-paste things when API breaks are needed.
  3. Helping the app catalog folk a little bit by doing a review of their review guidelines – looking specifically for gaps (e.g. like the currently unsecured http attack vector).
  4. Start a broad discussion over changing the way we use minimum versions of requirements. Today we raise the minimum version of most requirements quite eagerly. Yet for some like libvirt we instead use feature detection and degrade gracefully when non-latest versions are installed. It seems likely that it would increase compatibility with distributions if we took that approach more widely, but we’d need some care to think through the ramifications.
  5. Kicking off a discussion about leadership training for TC & PTL members. We vote folk into these rolls, but leading isn’t a innate skill. With our constituency of over two thousand developers, spending some money on good leadership training seems like a sound investment. If the TC agrees that its a good idea, my plan is to seek funding from the Board, and aim to make the training be a pre-summit event. This was suggested to me by Colette Alexander.
  6. Seek some more eyeballs on the olso.messaging Kafka driver spec from the HP folk that have been working with Kafka.
  7. Establish connections between Yahoo & HP’s iLO team – they’re seeing the same sort of lockups we did with IPMI on the TripleO test cloud (and the infra-cloud folk are still seeing that) – so I want to see if we can get the bug fixed for everyone.
  8. Work up a clear spec on refactoring the testrepository and subunit2sql layers so that we have all the data store backends in one common repository, an HTTP REST API for consumers like openstack-health, and still have a good experience for CLI users.
  9. Lastly, but not least, work up a formal stabilisation cycle proposal to try and give everyone (product working group, users, core developers) what they want which we seem deadlocked on not doing today. The basic thing to me seems to be fear of the consequences of saying no to feature patches – for pretty good reason; many developers have their income directly tied to achieving things upstream, and when upstream says no, the ensuing discussion is fraught (and there is often information asymmetry present). What we probably need to do is find some balance point – and then socialise the plan very broadly – including the Board, so they can encourage member companies to look after their developers properly.

If any of these things are of interest to you, please feel free to reach out to me :).

Testrepository roadmap 2015/16

Testrepository has been moderately successful – its very good at some of the things it aspired to (e.g. debugging sporadic test failures in parallel environments), but other angles have not really been explored.

I’ve set some time aside to correct this, in large part to facilitate some important features for tempest (which has its concurrency currently built on the meta-runner included in testrepository – and I’d like to enable the tempest authors to avoid having to write gnarly concurrency code :))

So my plan is to tackle a few things in the lead up to, and perhaps just after the Tokyo OpenStack summit. I wanted to socialise the proposed changes though, and thus this blog post.

Profiles

Firstly, a long standing issue is that when one tests several different configurations, testrepository is poor at reporting failures that are configuration specific. For instance, imagine that your test suite is run with both Python 2.7 and 3.4, and both results are loaded into your repository. If a given test ‘X’ fails in the first run, and not the second… after the second run is loaded, it will be reported as ‘passing’.

My proposed fix for this is to call the name of each such run a ‘profile’ and use tags to differentiate between the two runs. So you’d tag the 2.7 run perhaps ‘py27’ and the second ‘py34’, and then tell testrepository that the ‘py27’ and ‘py34’ tags are being used to identify profiles. After that testrepository will only consider two test to apply to the same test if the tags match. Tags that are not specified as being for profiles (e.g. the worker-N tags that the testrepository runner adds to track¬†backends that tests run in) won’t be considered in that comparison. This well then allow testrepository to track that each run was separate and the results are not meant to replace each other. The use of tags allows for¬†test matrices too, in principle– consider python version as one dimension, operating system¬†version as another, and database engine as a third — it would be up to the user. I don’t plan to directly implement a matrix system in the first iteration. A different, more dynamic model is in principle possible: don’t tag things, just log events that will give clues and correlate later – thats not precluded by this tag based approach, and we can always add such a thing later.

The output for the queries of the datastore need to be updated though – we don’t currently report tags in e.g. ‘testr failing –list’. This is a little tricky: the listing format is intended to be a mix of nice-for-humans,¬†and machine consumption. Another approach we considered was to namespace the tests with the profile. This has a couple of disadvantages: it may break an unknown number of deployments if the chosen separator is already in use by people, and secondly, it mixes structured and free-form data in a lossy way. One example of that would be that we’d start interpreting all test ids to see if they are – or are not – namespaced with a profile : thats likely to be fragile, at best. On the other hand it would very easily fit into the list format – which is why it was appealing. On balance though, the fragility and conflation would just add technical debt. Instead, we’ll do the following:

  1. Anything that needs to output a flat list of tests will output that for just one profile. An option will be added to allow querying the profiles for which results might be given. The default will start erroring with a list of available profiles if more than one profile has been specified.
  2. We’ll define a minimal JSON schema for reporting multiple profiles in such places. The excellent jq tool can be used to manipulate that in shell command lines. A command line option will opt into receiving this.

Testrepository has two very related programs inside itself. There is the data store and the various queries it can do – e.g. ‘testr load’ and ‘testr failing’. Then there is the meta-runner, which knows how to run some test processes to execute tests. While strictly speaking this is optional, its been very convenient for working with Python tests to have the meta-runner connected to testr and able to do in-process querying.

The meta-runner will benefit from being updated as well. My intent is to make it capable of running all the tests from all the profiles the user specifies, storing that as one single run in the datastore. Two commands in particular need to change here – `testr list-tests` needs to change in line with the test listing above, and `testr run –load-list` needs to be taught how to deal with multiple profiles. I plan to add a command line option to tell it that JSON is being used, and to select tests across all profiles when a simple list or a test regex is given. Finally the command line can benefit from a command line option to select one or more profiles.

Scheduling

The meta-runner has a crude scheduler – it balances based on historic performance prior to running any backend. An online scheduler will give much greater performance in both unseeded, and skewed data cases- e.g.if many long tests fail due to a bug the run after that will often have some workers finishing well before others – leading to slow test times.

The plan here is to finish the implementation of bidirectional channels to test backends, and then dispatch work to them incrementally

Concurrency plans

Tempest wants to be able to run some tests completely independently, and then others can run together arbitrarily. To facilitate this, the online scheduler will be extended to permit describing an overall plan to run through Рe.g. a list of segments, where each segment describes one or more tests that can be run together. The UI to supply that to the scheduler will probably start out as a JSON file listing exact test ids and we can iterate from there based on their experience.

Revisiting the Fixture API – handling leaky resources

Fixtures are one of the innovations I’m most happy with.

A Fixture is an enhanced context manager. The enhancements are:

  • There’s an API for gathering debugging information from the fixture (rather than depending on side effects such as the logging module or stdout). This makes it easy to attach log files from servers (for instance rabbitfixture does this).
  • There is glue to support composing¬†other fixtures while still exposing errors from any fixture in the composed set.

OpenStack’s Neutron has been using fixtures in its test suite for some time, but is finding that writing correct fixtures is hard. In particular, they were leaking processes when a fixture would fail during setUp / __enter__ – and then not be cleaned up by the testtools / fixtures useFixture function.

There are several things we can do to improve the situation.

  • We could make the convenience APIs like useFixture add a try:/finally: and call cleanUp() when setUp fails. This involves making cleanUp() be callable in more situations than it is today.
  • We could¬†make setUp itself do that,¬†advising¬†users to override a different function; this would hide the failure interactions internally, but wouldn’t benefit existing fixtures until they are rewritten to not override setUp.
  • We could provide a decorator that folk with fragile setUp’s (e.g. those that involve IO) could use to robustify their fixtures.

The highest leverage change is the first, but is it safe and suitable? Lets look at PEP-343.

In PEP-343 we see the following translation of with expressions:

with EXPR as VAR:
    BLOCK
....
mgr = (EXPR)
exit = type(mgr).__exit__
value = type(mgr).__enter__(mgr)
exc = True
try:
    try:
        VAR = value
        BLOCK
    except:
        exc = False
        if not exit(mgr, *sys.exc_info()):
            raise
finally:
    if exc:
        exit(mgr, None, None, None)

This means that using a Fixture which may leak external resources when setUp fails is unsafe¬†via with. Therefore we can’t use the first solution.

Decorators are nice, but somewhat noisy and opt-in. Both decorators and a different setUp in the base class will require extending the protocol to specify when cleanUp can be called more precisely.

If we make the documentation advise users to override a specific method, and setUp does this in the event of failure, I think we’ll have somewhat more uptake. So – thats the route I’m going to head down.

There’s one more thing to consider, which is access to debugging information of failures in setUp. Since the object will have been cleaned up, accessing logs etc will be hard. I think if we raise an additional exception into the MultiException with the details objects, it will be possible for fixtures to provide those details, though they will need buffering in memory (or some sophisticated lazy-delete logic such as holding a reference to an unlinked fd).

Improving dependency handling upstream (for openstack)

This is, in part, a follow up to my post a few weeks ago.

I want to touch on the things we need to improve to have robust plumbing supporting openstack’s CI and devstack needs.

Extras

We want to be able to use ‘extras’ to declare the dependencies needed for different backends. This is a setuptools requirement syntax where a¬†project can advertise additional dependencies for different use cases, which users (or other depending projects) can then trigger using '[]'. E.g. 'pip install requests[security]' says ‘install requests and the additional ‘security’ extras.¬†We don’t know yet whether we will use 'nova[mysql]' or 'nova oslo.db[mysql]', but something like that. To use this we need to:

  1. teach pbr about reflecting requirements into the 'extras_require' keyword to setup (because while setuptools supports it in setup.py, we want a constant value setup.py with everything about individual projects declarative).  James Polley has a patch for pbr.
  2. Fix pip to handle 'pip install ./nova[mysql]'. This is issue 1236 Рwhich has an open PR that may fix it. We should help review and test it.

Testing different setups may well need a similar facility, but its not clear yet how to best express that. We may need to standardise on using an extra called 'test' and just ensure our tox.ini knows to install that. That would be nice anyway, to get away from having to know about 'test_requirements.txt'.

pip dependency resolution

Currently pip has a very straight forward resolution algorithm: Only user supplied requirements can conflict at all, and the first mention of any distribution causes a distribution to be selected that matches that mention – all other mentions are simply ignored. This is issue 988, and its one of a cluster that affect OpenStack. The impact on OpenStack is that we have things install ok with pip, and then break in CI, because an incompatible version is installed. I have a patch up for this. Early adopters solicited!

incremental installations need dependency resolution

Say you’ve installed Neutron, which depends on oslo.db >=1.10. And you then install an older Nova which depends on oslo.db <1.10. What should happen? Ideally an error in this case, because the requirements are disjoint.¬†And if they do overlap, the installed version should be adjusted to be compatible. Right now, no error occurs and oslo.db will be downgraded breaking Neutron. This is pip issue 2687. Currently no-one is working on this, and since it requires dependency resolution, fixing 988 first makes a lot of sense. It should be possible to at least make things error with a much more shallow patch though, if someone wished to work on it right now – or you could build on top of my resolver branch. This has also been a¬†cause of numerous CI failures when we do releases, typically right around the time the servers branch.¬†One thing that might be nice for us, since we know a full set of working packages, is to be able to say upfront to pip what versions are compatible, and then let only the needed things be brought in. pip issue 2731

PEP-426 environment markers need polish

PEP-426 introduced a micro-language for describing the situations when a particular dependency applies. For instance, to use argparse on Python < 2.7, you can say "python_version<'2.7'" as a marker for the argparse entry in your requirements. But there are some rough edges.

  • Some comparison operators are missing.
  • The documentation and user guidance needs improvement.
  • Environment markers can’t be used inside individual requirements, only as a filter on extra_requires.¬†To express the argparse example above today (using a working operator), you need to pass the following to setup().
    extra_requires={':python_version=="2.6"': ['argparse']}

    It would be more straightforward to permit the syntax pip supports, where each requirement can be annotated with a marker.

    install_requires=['argparse:python_version=="2.6"']

    This might be setuptools issue 353.

  • pbr doesn’t reflect environment markers from its input files (requirements.txt etc) into setup keyword argument. James Polley has a patch for this (the same one enabling extras support in setup.cfg).

pip handling setup_requires

We run into setup_requires in two places in OpenStack; firstly we use that ourselves for pbr, but to avoid triggering easy_install we manually install pbr everywhere ourselves. Secondly, projects that are¬†in the transitive dependencies of OpenStack use setup_requires, and we end up triggering easy_install for them. easy_install is a concern for us because of the decreased reliability and issues with corporate egress firewalls, and¬†its security is not as robust as pips – and there’s no reason it should be, with pip being such a good tool.

However pip can’t handle setup_requires today. Doing so requires changes to setuptools and to pip.

  • setuptools needs some way to report to pip what¬†the¬†setup_requires¬†are without triggering easy_install. Ronny Pfannschmidt has mentioned he may be working on this, but I’m not sure if there is a patch ready or not. A possible further enhancement would be to put the¬†setup_requires in¬†setup.cfg in a totally declarative fashion, but this may require environment marker support first, since the current procedural approach is very flexible and can take Python version and platform into account.
  • pip needs to be able to temporarily put¬†things that it won’t be installing into the PYTHONPATH for packages it is building. The current internals are not suited for this (the target and source and needs of requirements being downloaded are all confounded). However once my resolver patch lands, there will be a nice cache layer that can deliver a ready-to-install directory for any requirement, which should make a simple recursive implementation quite reasonable. The resolver work will probably need further refactoring to make the resolver be decoupled from the user supplied requirements, but compared to the ground already covered, that should be straight forward. One thing folk tackling this should be aware of is an open question around location requirements. Say someone is installing foo from a git repository. And foo is also a setup requirement of some other package bar being installed at the same time. Should that foo from git be used for the setup of¬†bar? I’m not sure of the answer (what if the version of foo is incompatible with version bar needs?) – but one is needed :).

So thats about it – if you’re interested in helping the plumbing that supports OpenStacks CI and devstack systems, please pick one of these issues and help out. Test patches, review code, write a patch, or just tell me why we don’t need to do something ūüôā

Dealing with deps in OpenStack

We’ve got a problem in OpenStack.. dependency management.

In this post I explore it as input to the design summit session on this in Vancouver.

Goals

We have some goals that are broadly agreed:

  1. Guarantee co-installability of a single release of OpenStack
  2. Be able to deliver known-good installs of OpenStack at any point in time – e.g. ‘this is known to work’
  3. Deliver good, clear dependency metadata to redistributors
  4. Support CD deployments of OpenStack from git. Both production and devstack for developers to hack on/with
  5. Avoid firedrills in CI – both internal situations where we run incompatible things we produced, and external situations where some dependency releases a broken version, like the pycparsing one last week
  6. Deployments using the Python dependencies should be up to date and secure
  7. Support doing upgrades in the same Python environment

Assumptions

And we have some baseline assumptions:

  1. We cooperate with the Python ecosystem – publishing our libraries to PyPI for instance
  2. Every commit of server projects is a ‘release’ from the perspective of e.g. schema management
  3. Other things release when they release, not per-commit

The current approach uses a single global¬†list of acceptable install-requires for all our projects, and then merges¬†that into the¬†git trees being tested during the test. Note in particular that this doesn’t take place for things not being tested, which we install from PyPI. We create a branch of that global list for each stable release, and we also create branches of nearly everything when we do the stable release, a system that has evolved in part due to the issues in CI when new releases would break stable releases. These new branches have tightly defined constraints – e.g. “DEP >= version-at-this-release < next-point-release”‘. The idea behind this is that if the transitive closure of deps is constrained, we can install from PyPI such a version, and it won’t bring in a different version. One of the reasons we needed that was PIP bug 988, where pip takes the first occurrence of a dependency, and so servers would depend on oslo.utils which would depend on an unversioned cliff or some such, and if cliff wasn’t already installed we’d get the next releases cliff. Now – semver says we’re keeping those things compatible, but mistakes happen, and for stable branches there’s really little reason to upgrade.

Issues

We have some practical issues with the current system:

  1. Just one dependency uncapped anywhere in the wider ecosystem (including packages outside of OpenStack) that depends on a dependency that we wanted to stay unchanged, and if that dep is encountered first by the pip scanner… game over. Worse, there are components out there that introspect the installed dependencies and fail hard if one is not listed as compatible, which takes a ‘testing with unexpected¬†version’ situation and makes it a hard error
  2. We have to run stable branches for everything, even things like OpenStackClient which are intended for end users, and are aimed at a semver rather than branched release model
  3. Due to PIP bug 2687 each time we call pip may introduce the skew that breaks the gate
  4. We don’t deliver goal 1:- because we override the requirements at test time, the actual co-installability may be different, and we don’t know
  5. We deliver goal 2 but its hard to use:- you have to dig through a specific CI log, and if the CI system has pruned it, you’re toast
  6. We don’t avoid external firedrills:- because most of our external dependencies are broad, external releases break us trivially and frequently
  7. Lastly, our requirements are too tight to support upgrades: if bug 2687 was fixed, installing the first upgraded server component would error because its requirements are declared as being incompatible with the last release.

We do deliver goals 3,4 and 6 though, which is good.

So what can we do differently? In an ideal world, can we get all 6 goals?

Proposal

I think we can. Here’s¬†one way it could work:

  1. We fix the two pip bugs above (I’m working on that now)
  2. We teach pip about constraints *if* something is requested without actually requesting it
  3. We change our project overrides in CI to use a single constraints file rather than merging into each projects requirements
  4. The single constraints file would be exactly specified: “DEP == VERSION”, not semver or compatible matched.
  5. We make changes to the single constraints file by running a proposed set of constraints
  6. We find out that we should change the constraints file by having a periodic task which compares the constraints file to the published versions on  PyPI and proposes changes to the constraints repository automatically
  7. We loosen up the constraints in all our release branches to permit upgrade co-installability

And some optional bits…

  1. We could start testing new-library old-servers again
  2. We could potentially change our branching strategy for non-server components, but I don’t think it harms things – it may just be unnecessary
  3. We could add periodic jobs for testing with unreleased versions of dependencies

Working through each point. Bug 988 causes compatible requirements to be ignored – if we have one constraint of “X > 1.4” and another of “X > 1.3, !=1.5.1” but the “> 1.4” constraint is encountered first, we can end up with 1.5.1 installed, violating a known-bad constraint. Fixing this means that rather than having to have global knowledge of deps at the point where pip is being entered, we can have local knowledge about compatible versions in each package, and as long as the union of requirements is satisfiable, we’ll be ok. Bug 2687 causes the constraints that thing A had when it was installed by pip be ignored by the requirements checking for thing B. For instance, pip install python-openstackclient after pip install nova, will meet python-openstackclient’s requirements even if that means breaking nova’s requirements.

The reason we can’t just use a requirements file today, is that a requirements file specifies what needs to be installed as well as what versions are acceptable. We don’t want devstack, when configured for nova-network, to install neutron dependencies. But it would today unless we put in place a bunch of complex processing logic. Whereas pip could do this very easily internally.

Merging each requirement into things we’re installing from git fails when we install releases – e.g. of client libraries, in particular because of the interactions with bug 988 above. A single constraints file could include all known good versions of everything we might use, and would apply globally in concert with local project requirements. Best of both worlds, in theory ūüôā

The use of inexact versions is a hard limitation today – we can’t upgrade multiple project trees local version needs atomically, and because we’re supplying all the version constraints in one place – the project’s merged install_requirements – they have to be broad enough to co-exist during¬†changes to the requirements, and to remain co-installed during upgrades from release to release of OpenStack. But inexact versions leads to variation in CI – every single run becomes a gamble.¬†The primary goal of CI is to tell ¬†us whether a new commit X meets all of our quality criteria – change one thing at a time. Running with every new version of every dependency doesn’t tell us more about X, it tells us about ecosystem things. Using exact constraints will solve this: we’ll decouple ‘update dependencies’ or ‘pycparsing Y is broken’ from testing X – e.g. ‘improve nova cells’.

We need to be able to update those dependencies though, and the existing global requirements mechanisms are pretty much right, they just need to work with a constraints file instead of patching each repo at test time. We will still want to check that the local requirements are compatible with the global constraints file.

One of the big holes such approaches have is that we¬†may¬†miss out on important improvements – security, performance or just plain old features – if we don’t update our constraints. So we need to be on top of that. A small amount of automation can give us a lot of assistance on that. Just try the new versions and if they work – great. If they don’t, show a failing proposal where we can assess what to do.

As I mentioned earlier today we can’t actually upgrade: kilo’s version locks exclude liberty versions of our libraries, meaning that trying to upgrade nova/kilo to nova/liberty will bring in library versions that conflict with the version deps neutron expresses. We need to open up the project local requirements to avoid this –¬†and we also need to make some guarantees about compatibility with our prior release in our library development (otherwise rebooting a server with only one component upgraded will be a gamble).

Making those guarantees will either require testing every commit against the prior server, or if we can find some way of doing it, testing proposed releases against the prior servers – which would allow more latitude during development of our libraries. The use of constraints files will give us hermetic insulation against bad releases though – we’ll be able to stay productive while we fix the issue and issue a new better release. The crucial thing is to have a tight feedback loop though – so I’m in favour of us either testing each commit against last-stable, or figuring out the ‘tests before releases’ logic (perhaps by removing direct tag access and instead having a thing we propose the intent¬†to as a review).

All this might be enough that we choose to make less stable branches of libraries and go back to plain semver – but its not a requirement: thats something we can discuss in detail if people care, or just wait and see what the overheads and benefits of keeping those branches are.

Lastly, this new structure will make it possible, if we want to, to test that unreleased versions of external¬†dependencies work with a given component, by using a periodic job. Why periodic? There are two sides to each dependencies, and neither side would want their gate to wedge if an accident breaks the other side. E.g. using two of our own components – oslo.messaging and nova. oslo.messaging releases must not break nova, but an individual oslo.messaging commit isn’t necessarily constrained (if we have the before-release testing described above). External dependencies are exactly the same, except even less closely aligned than intra-OpenStack components. So running tests with a git version of e.g. libvirt in a periodic job might give us (and libvirt) valuable prior warning about issues.