Want me to work with you?

Reach out to me – I’m currently looking for something interesting to do. https://www.linkedin.com/in/rbtcollins/ and https://twitter.com/rbtcollins are good ways to grab me if you don’t already have my details.

Should you reach out to me? Maybe :). First, a little retrospective.

Three years ago, I wrote the following when reflecting on what I wanted to be doing:

Priorities (roughly ordered most to least important):

  • Keep living in Rangiora (family)
  • Up to moderate travel requirements – 4 trips a year + LCA/PyCon
  • Significant autonomy (not at the expense of doing the right thing for the company, just I work best with the illusion of free will 🙂 )
  • Be doing something that matters
    • -> Being open source is one way to this, but not the only one
  • Something cutting edge would be awesome
    • -> Rust / Haskell / High performance requirements / scale / ….
  • Salary

How well did that work for me? Pretty good. I had a good satisfying job at VMware for 3 years, met some wonderful people, achieved some very cool things. And those priorities above were broadly achieved.
The one niggle that stands out was this – Did the things we were doing matter? Certainly there was no social impact – VMware isn’t a non-profit, being right at the core of capitalism as it is. There was direct connection and impact with the team, the staff we worked with and the users of the products… but it is just a bit hard to feel really connected through that though: VMware is a very large company and there are many layers between users and developers.

We were quite early adopters of Kubernetes, which allowed me to deepen my Go knowledge and experience some more fun with AWS scale operations. I had many interesting discussions about the relative strengths of Python Go and Rust and Java with colleagues there. (Hi Geoffrey).

Company culture is very important to me, and VMware has a fantastically supportive culture. One of the most supportive companies I’ve been in, bar none. It isn’t a truely remote-organised company though: rather its a bunch of offices that talk to each other, which I think is sad. True remote-first offers so much more engagement.

I enjoy building things to solve problems. I’ve either directly built, or shaped what is built, in all my most impactful and successful roles. Solving a problem once by hand is fine; solving it for years to come by creating a tool is far more powerful.

I seem to veer into toolmaking very often: giving other people the ability to solve their problems takes the power of a tool and multiplies it even further.

It should be no surprise then that I very much enjoy reading white papers like the original Dapper and Map-reduce ones, LinkedIn’s Kafka or for more recent fodder the Facebook Akkio paper. Excellent synthesis and toolmaking applied at industrial scale. I read those things and I want to be a part of the creation of those sorts of systems.

I was fortunate enough to take some time to go back to university part-time, which though logistically challenging is something I want to see through.

Thus I think my new roughly ordered (descending) list of priorities needs to be something like this:

  • Keep living in Rangiora (family)
  • Up to moderate travel requirements – 4 team-meeting trips a year + 2 conferences
  • Significant autonomy (not at the expense of doing the right thing for the company, just I work best with the illusion of free will 🙂 )
  • Be doing something that matters
    • Be working directly on a problem / system that has problems
  • Something cutting edge would be awesome
    • Rust / Haskell / High performance requirements / scale / ….
  • A generative (Westrum definition) + supportive company culture
  • Remote-first or at least very remote familiar environment
  • Support my part time study / self improvement initiative
  • Salary

Advertisements

Continuous Delivery and software distributors

Back in 2010 the continuous delivery meme was just grabbing traction. Today its extremely well established… except in F/LOSS projects.

I want that to change, so I’m going to try and really bring together a technical view on how that could work – which may require multiple blog posts – and if it gets traction I’ll put my fingers where my thoughts are and get into specifics with any project that wants to do this.

This is however merely a worked model today: it may be possible to do things quite differently, and I welcome all discussion about the topic!

tl;dr

Pick a service discovery mechanism (e.g. environment variables), write two small APIs – one for flag delivery, with streaming updates, and one for telemetry, with an optional aggressive data hiding proxy, then use those to feed enough data to drive a true CI/CD cycle back to upstream open source projects.

Who is in?

Background

(This assumes you know what C/D is – if you don’t, go read the link above, maybe wikipedia etc, then come back.)

Consider a typical SaaS C/D pipeline:

git -> build -> test -> deploy

Here all stages are owned by the one organisation. Once deployed, the build is usable by users – its basically the simplest pipeline around.

Now consider a typical on-premise C/D pipeline:

git -> build -> test -> expose -> install

Here the last stage, the install stage, takes place in the users context, but it may be under the control of the create, or it may be under the control of the user. For instance, Google play updates on an Android phone: when one selects ‘Update Now’, the install phase is triggered. Leaving the phone running with power and Wi-Fi will trigger it automatically, and security updates can be pushed anytime. Continuing the use of Google Play as an example, the expose step here is an API call to upload precompiled packages, so while there are three parties, the distributor – Google – isn’t performing any software development activities (they do gatekeep, but not develop).

Where it gets awkward is when there are multiple parties doing development in the pipeline.

Distributing and C/D

Lets consider an OpenStack cloud underlay circa 2015: an operating system, OpenStack itself, some configuration management tool (or tools), a log egress tool, a metrics egress handler, hardware mgmt vendor binaries. And lets say we’re working on something reasonably standalone. Say horizon.

OpenStack for most users is something obtained from a vendor. E.g. Cisco or Canonical or RedHat. And the model here is that the vendor is responsible for what the user receives; so security fixes – in particular embargoed security fixes – cannot be published publically and the slowly propogate. They must reach users very quickly. Often, ideally, before the public publication.

Now we have something like this:

upstream ends with distribution, then vendor does an on-prem pipeline


Can we not just say ‘the end of the C/D pipeline is a .tar.gz of horizon at the distribute step? Then every organisation can make their own decisions?

Maybe…

Why C/D?

  • Lower risk upgrades (smaller changes that can be reasoned about better; incremental enablement of new implementations to limit blast radius, decoupling shipping and enablement of new features)
  • Faster delivery of new features (less time dealing with failed upgrades == more time available to work on new features; finished features spend less time in inventory before benefiting users).
  • Better code hygiene (the same disciplines needed to make C/D safe also make more aggressive refactoring and tidiness changes safer to do, so it gets done more often).

1. If the upstream C/D pipeline stops at a tar.gz file, the lower-risk upgrade benefit is reduced or lost: the pipeline isn’t able to actually push all the to installation, and thus we cannot tell when a particular upgrade workaround is no longer needed.

But Robert, that is the vendors problem!

I wish it was: in OpenStack so many vendors had the same problem they created shared branches to work on it, then asked for shared time from the project to perform C/I on those branches. The benefit is only realise when the developer who is responsible for creating the issue can fix it, and can be sure that the fix has been delivered; this means either knowing that every install will install transiently every intermediary version, or that they will keep every workaround for every issue for some minimum time period; or that there will be a pipeline that can actually deliver the software.

2. .tar.gz files are not installed and running systems. A key characteristic of a C/D pipeline is that is exercises the installation and execution of software; the ability to run a component up is quite tightly coupled to the component itself, for all the the ‘this is a process’ interface is very general, the specific ‘this is server X’ or ‘this is CLI utility Y’ interfaces are very concrete. Perhaps a container based approach, where a much narrower interface in many ways can be defined, could be used to mitigate this aspect. Then even if different vendors use different config tools to do last mile config, the dev cycle knows that configuration and execution works. We need to make sure that we don’t separate the teams and their products though: the pipeline upstream must only test code that is relevant to upstream – and downstream likewise. We may be able to find a balance here, but I think more work articulating what that looks like it needed.

3. it will break the feedback cycle if the running metrics are not receive upstream; yes we need to be careful of privacy aspects, but basic telemetry: the upgrade worked, the upgrade failed, here is a crash dump – these are the tools for sifting through failure at scale, and a number of open source projects like firefox, Ubuntu and chromium have adopted them, with great success. Notably all three have direct delivery models: their preference is to own the relationship with the user and gather such telemetry directly.

C/D and technical debt

Sidebar: ignoring public APIs and external dependencies, because they form the contract that installations and end users interact with, which we can reasonably expect to be quite sticky, the rest of a system should be entirely up to the maintainers right? Refactor the DB; Switch frameworks, switch languages. Cleanup classes and so on. With microservices there is a grey area: APIs that other microservices use which are not publically supported.

The grey area is crucial, because it is where development drag comes in: anything internal to the system can be refactored in a single commit, or in a series of small commits that is rolled up into one, or variations on this theme.

But some aspect that another discrete component depends upon, with its own delivery cycle: that cannot be fixed, and unless it was built with the same care public APIs were, it may well have poor scaling or performance characteristics that making fixing it very important.

Given two C/D’d components A and B, where A wants to remove some private API B uses, A cannot delete that API from its git repo until all B’s everywhere that receive A via C/D have been deployed with a version that does not use the private API.

That is, old versions of B place technical debt on A across the interfaces of A that they use. And this actually applies to public interfaces too – even if they are more sticky, we can expect the components of an ecosystem to update to newer APIs that are cheaper to serve, and laggards hold performance back, keep stale code alive in the codebase for longer and so on.

This places a secondary requirement on the telemetry: we need to be able to tell whether the fleet is upgraded or not.

So what does a working model look like?

I think we need a different diagram than the pipeline; the pipeline talks about the things most folk doing an API or some such project will have directly in hand, but its not actually the full story. The full story is rounded out with two additional features. Feature flags and telemetry. And since we want to protect our users, and distributors probably will simply refuse to provide insights onto actual users, lets assume a near-zero-trust model around both.

Feature flags

As I discussed in my previous blog post, feature flags can be used for fairly arbitrary purposes, but in this situation, where trust is limited, I think we need to identify the crucial C/D enabling use cases, and design for them.

I think that those can be reduce to soft launches – decoupling activating new code paths from getting them shipped out onto machines, and kill switches – killing off flawed / faulty code paths when they start failing in advance of a massive cascade failure; which we can implement with essentially the same thing: some identifier for a code path and then a percentage of the deployed base to enable it on. If we define this API with efficient streaming updates and a consistent service discovery mechanism for the flag API, then this could be replicated by vendors and other distributors or even each user, and pull the feature API data downstream in near real time.

Telemetry

The difficulty with telemetry APIs is that they can egress anything. OTOH this is open source code, so malicious telemetry would be visible. But we can structure it to make it harder to violate privacy.

What does the C/D cycle need from telemetry, and what privacy do we need to preserve?

This very much needs discussion with stakeholders, but at a first approximation: the C/D cycle depends on knowing what versions are out there and whether they are working. It depends on known what feature flags have actually activated in the running versions. It doesn’t depend on absolute numbers of either feature flags or versions

Using Google Play again as an example, there is prior art – https://support.google.com/firebase/answer/6317485 – but I want to think truely minimally, because the goal I have is to enable C/D in situations with vastly different trust levels than Google play has. However, perhaps this isn’t enough, perhaps we do need generic events and the ability to get deeper telemetry to enable confidence.

That said, let us sketch what an API document for that might look like:

project:
version:
health:
flags:
- name:
  value:

If that was reported by every deployed instance of a project, once per hour, maybe with a dependencies version list added to deal with variation in builds, it would trivially reveal the cardinality of reporters. Many reporters won’t care (for instance QA testbeds). Many will.

If we aggregate through a cardinality hiding proxy, then that vector is addressed – something like this:

- project:
  version:
  weight:
  health:
  flags:
  - name:
    value:
- project: ...

Because this data is really only best effort, such a proxy could be backed by memcache or even just an in-memory store, depending on what degree of ‘cloud-nativeness’ we want to offer. It would receive accurate data, then deduplicate to get relative weights, round those to (say) 5% as a minimum to avoid disclosing too much about long tail situations (and yes, the sum of 100 1% reports would exceed 100 :)), and then push that up.

Open Questions

  • Should library projects report, or are they only used in the context of an application/service?
    • How can we help library projects answer questions like ‘has every user stopped using feature Y so that we can finally remove it’ ?
  • Would this be enough to get rid of the fixation on using stable branches everyone seems to have?
    • If not why not?
  • What have I forgotten?

Feature flags

Feature toggles, feature flags – they’ve been written about a lot already (use a search engine :)), yet I feel like writing a post about them. Why? I’ve been personally involved in two from-scratch implementations, and it may be interesting for folk to read about that.

I say that lots has been written; http://featureflags.io/ (which appears to be a bit of an astroturf site for LaunchDarkly 😉 ) nevertheless has gathered a bunch of links to literature as well as a number of SDKs and the like; there are *other* FFaaS offerings than LaunchDarkly; I have no idea which I would use for my next project at this point – but hopefully you’ll have some tools to reason about that at the end of this piece.

I’m going to entirely skip over the motivation (go read those other pieces), other than to say that the evidence is in, trunk based development is better.



Humble, J. and Kim, G., 2018. Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution.

A feature flag is a very simple thing: it is a value controlled outside of your development cycle that in turn controls the behaviour of your code. There are dozens of ways to implement that. hash-defines and compile time flags have been used for a very long time, so long that we don’t even think of them as feature flags, but they are are. So are configuration options in configuration files in the broadest possible sense. The difference is largely in focus, and where the same system meets all parties needs, I think its entirely fine to use just the one system – that is what we did for Launchpad, and it worked quite well I think – as far as I know it hasn’t been changed. Specifically in Launchpad the Zope runtime config is regular ZCML files on disk, and feature flags are complementary to that (but see the profiling example below).

Configuration tends to be thought of as “choosing behaviour after the system is compiled and before the process is started” – e.g. creating files on disk. But this is not always the case – some enterprise systems are notoriously flexible with database managed configuration rulesets which no-one can figure out – and we don’t want to create that situation.

Lets generalise things a little – a flag could be configured over the lifetime of the binary (compile flag), execution (runtime flag/config file or one-time evaluation of some dynamic system), time(dynamically reconfigured from changed config files or some dynamic system), or configured based on user/team/organisation, URL path (of a web request naturally :P), and generally any other thing that could be utilised in making a decision about whether to conditionally perform some code. It can also be useful to be able to randomly bucket some fraction of checks (e.g. 1/3 of all requests will go down this code path).. but do it consistently for the same browser.

Depending on what sort of system you are building, some of those sorts of scopes will be more or less important to you – for instance, if you are shipping on-premise software, you may well want to be turning unreleased features entirely off in the binary. If you are shipping a web API, doing soft launches with population rollouts and feature kill switches may be your priority.

Similarly, if you have an existing microservices architecture, having a feature flags aaS API is probably much more important (so that your different microservices can collaborate on in-progress features!) than if you have a monolithic DB where you put all your data today.

Ultimately you will end up with something that looks roughly like a key-value store: get_flag_value(flagname, context) -> value. Somewhere separate to your code base you will have a configuration store where you put rules that define how that key-value interface comes to a given value.

There are a few key properties that I consider beneficial in a feature flag systems:

  • Graceful degradation
  • Permissionless / (or alternatively namespaced)
  • Loosely typed
  • Observable
  • Centralised
  • Dynamic

Graceful Degradation

Feature flags will be consulted from all over the place – browser code, templates, DB mapper, data exporters, test harnesses etc. If the flag system itself is degraded, you need the systems behaviour to remain graceful, rather than stopping catastrophically. This often requires multiple different considerations; for instance, having sensible defaults for your flags (choose a default that is ok, change the meaning of defaults as what is ‘ok’ changes over time), having caching layers to deal with internet flakiness or API blips back to your flag store, making sure you have memory limits on local caches to prevent leaks and so forth. Different sorts of flag implementations have different failure modes : an API based flag system will be quite different to one stored in the same DB the rest of your code is using, which will be different to a process-startup command line option flag system.

A second dimension where things can go wrong is dealing with missing or unexpected flags. Remember that your system changes over time: a new flag in the code base won’t exist in the database until after the rollout, and when a flag is deleted from the codebase, it may still be in your database. Worse, if you have multiple instances running of a service, you may have different code all examining the same flag at the same time, so operations like ‘we are changing the meaning of a flag’ won’t take place atomically.

Permissions

Flags have dual audiences; one part is pure dev: make it possible to keep integration costs and risks low by merging fully integrated code on a continual basis without activating not-yet-ready (or released!) codepaths. The second part is pure operations: use flags to control access to dark launches, demo new features, killswitch parts of the site during attack mitigation, target debug features to staff and so forth.

Your developers need some way to add and remove the flags needed in their inner loop of development. Lifetimes of a few days for some flags.

Whoever is doing operations on prod though, may need some stronger guarantees – particularly they may need some controls over who can enable what flags. e.g. if you have a high control environment then team A shouldn’t be able to influence team B’s flags. One way is to namespace the flags and only permit configuration for the namespace a developer’s team(s) owns. Another way is to only have trusted individuals be able to set flags – but this obviously adds friction to processes.

Loose Typing

Some systems model the type of each flag: is it boolean, numeric, string etc. I think this is a poor idea mainly because it tends to interact poorly with the ephemeral nature of each deployment of a code base. If build X defines flag Y as boolean, and build X+1 defines it as string, the configuration store has to interact with both at the same time during rollouts, and do so gracefully. One way is to treat everything as a string and cast it to the desired type just in time, with failures being treated as default.

Observable

Make sure that when a user reports crazy weird behaviour, that you can figure out what value they had for what flags. For instance, in Launchpad we put them in the HTML.

Centralised

Having all your flags in one system lets you write generic tooling – such as ‘what flags are enabled in QA but not production’, or ‘what flags are set but have not been queried in the last month’. It is well worth the effort to build a single centralised system (or consume one such thing) and then use it everywhere. Writing adapters to different runtimes is relatively low overhead compared to rummaging through N different config systems because you can’t remember which one is running which platform.

Scope things in the system with a top level tenant / project style construct (any FFaaS will have this I’m sure :)).

Dynamic

There may be some parts of the system that cannot apply some flags rapidly, but generally speaking the less poking around that needs to be done to make something take effect the better. So build or buy a dynamic system, and if you want a ‘only on process restart’ model for some bits of it, just consult the dynamic system at the relevant time (e.g. during k8s Deployment object creation, or process startup, or …). But then everywhere else, you can react just-in-time; and even make the system itself be script driven.

The Launchpad feature flag system

I was the architect for Launchpad when the flag system was added. Martin Pool wanted to help accelerate feature development on Launchpad, and we’d all become aware of the feature flag style things hip groups like YouTube were doing; so he wrote a LEP: https://dev.launchpad.net/LEP/FeatureFlags , pushed that through our process and then turned it into code and docs (and once the first bits landed folk started using and contributing to it). Here’s a patch I wrote using the system to allow me to turn on Python profiling remotely. Here’s one added by William Grant to allow working around a crash in a packaging tool.

Launchpad has a monolithic data store, with bulk data federated out to various disk stores, but all relational data in one schema; we didn’t see much benefit in pushing for a dedicated API per se at that time – it can always be added later, as the design was deliberately minimal. The flags implementation is all in-process as a result, though there may be a JS thunk at this point – I haven’t gone looking. Permissions are done through trusted staff members, it is loosely typed and has an audit log for tracking changes.

The other one

The other one I was involved in was at VMware a couple years ago now; its in-house, but some interesting anecdotes I can share. The thinking on feature flags when I started the discussion was that they were strictly configuration file settings – I was still finding my feet with the in-house Xenon framework at the time (I think this was week 3 ? 🙂 so I whipped up an API specification and a colleague (Tyler Curtis) turned that into a draft engine; it wasn’t the most beautiful thing but it was still going strong and being enhanced by the team when I left earlier this year. The initial implementation had a REST API and a very basic set of scopes. That lasted about 18 months before tenant based scopes were needed and added. I had designed it with the intent of adding multi-arm bandit selection down the track, but we didn’t make the time to develop that capability, which is a bit sad.

Comparing that API with LaunchDarkly I see that they do support A/B trials but don’t have multivariate tests live yet, which suggests that they are still very limited in that space. I think there is room for some very simple home grown work in this area to pay off nicely for Symphony (the project codename the flags system was written for).

Should you run your own?

It would be very unusual to have PII or customer data in the flag configuration store; and you shouldn’t have access control lists in there either (in LP we did allow turning on code by group, which is somewhat similar). Point is, that the very worst thing that can happen if someone else controls your feature flags and is malicious is actually not very bad. So as far as aaS vendor trust goes, not a lot of trust is needed to be pretty comfortable using one.

But, if you’re in a particularly high-trust environment, or you have no internet access, running your own may be super important, and then yeah, do it :). They aren’t big complex systems, even with multi-arm bandit logic added in (the difficulty there is the logic, not the processing).

Or if you think the prices being charged by the incumbents are ridiculous. Actually, perhaps hit me up and we’ll make a startup and do this right…

Should you build your own?

A trivial flag system + persistence could be as little as a few days work. Less if you grab an existing bolt-on for your framework. If you have multiple services, or teams, or languages.. expect that to become the gift that keeps on giving as you have to consolidate and converge across your organisation indefinitely. If you have the resources – great, not a problem.


I think most people will be better off taking one of the existing open source flag systems – perhaps https://unleash.github.io/ – and using it; even if it is more complex than a system tightly fitted to your needs, the benefit of having one that is a true API from the start will pay for itself the very first time you split a project, or want to report what features are on in dev and off in prod, not to mention multiple existing language bindings etc.

Christchurch

Trigger warning: this is likely going to make you angry, or sad, or sadmad, or more. OTOH I’m going to make the points I’ve got to make in a short and pointed fashion and you, dear reader, can go look up supporting or refuting documentation as you feel appropriate.

Christchurch recently suffered NZ’s largest modern mass murder. I say modern because and also.

This is the third time I’ve been “close” to murder. Once was the Aramoana massacre, where I lived roughly halfway between the city and Aramoana itself – the helicopter flight path was right overhead; once was moving to Sydney and on the very first day after I arrived taking the train into the city from a train station bathed in blood. (With a helpful sign informing us that the police were seeking information that could lead to an arrest). And last Friday was Christchurch. And this is why I have put “close” in quotes: close enough to be viscerally affected, but not directly affected: none of the folk that died were personally known to me in any of these cases. I have had people I know die and thats a whole different level. There are plenty of people writing from that perspective, and you should go and listen to them. I have, and am, and have friends in that set, who need all the support they can get at the moment.

And that is why I’m a little conflicted about writing this post. Because this tragedy is all the more tragic because of our failures leading up to it.

What failures you might ask? Ours as citizens, or our bureaucracies? Or governments?

Lets look at this from a “Law & Order” perspective – means, motive, opportunity. The opportunity is something I think we should be proud of: folk being able to do what they want to do, inviting people they don’t know into their mosque, without fear. But means and motive…

Means

Means first. I wish I could say “I’m not going to engage in the nonsense debates about whether the sort of weapon the shooter had should be available for private citizen use or not.” But actually, I have to. Australia has ongoing problems with violence, and regulated guns heavily back in 1996. Since then, massacres have stopped being a thing in Australia. Prior to that, massacres were happening on a regular basis. This doesn’t deny the truth that the same machines can be used to shoot rabbits, but the evidence seems to be that they are too easily abused and there are many less over the top alternatives for the farm use case to be a compelling tradeoff.

Would the attacker have been able to carry out the attack without semi-automatic rifles? Sure. But with less ammunition per clip, less accuracy per shot, less damage per shot.

Here’s the sad bit though. We tightened our gun laws after Aramoana in 1990. We didn’t tighten them again after Port Arthur. We didn’t tighten them again after Sandy Hook. Or any of the other mass murders overseas using military weapons. We have had decades of inaction while the evidence of the potential harms increased. Late last year Stuff published this article – which is actually good reporting! I quote:

It is a very sad fact that changes to gun regulation only come about in the wake of a tragedy: Aramoana, Port Arthur, the Dunblane massacre.

Since 1992, politicians have backed nervously away.

50 people died on Friday in large part because the murderer had access to effective means for mass murder, which we knewwas effective, which our Police knew was effective, and which we spent political capital on other complete BS (Look up political scandals in NZ – I don’t have the heart to enumerate them right now).

Jacinda’s leadership right now consists of executing on a thing we’ve had queued up as necessary for – at the least – decades. It is an indictment on all of us here in NZ that 50 people had to die to get this done. I’m very sadmad right now. I don’t feel like I could have driven the debate in the right direction, but our leaders surely could have. And none of them are owning this.

Motive

It’s tempting to write the murderer off as being a bad human, or badly raised, or insane. But the reality as I understand it is that there were many signs of his intolerant violent views, that he doesn’t have a obvious mental disorder, and his violence was targeted: he’s not just plain evil and out to kill everyone…

In NZ at least we consider both intolerance and violence antisocial: we expect tolerance and peaceful discussion with each other as baseline characteristics of human beings.

And again, this is a thing we’ve seen before overseas. We’ve seen many murders done on the basis of violent intolerant ideology, but we haven’t actually adapted here to deal with it.

How are we, at a systemic level, engaging with the problem and addressing it. How do we help folk become tolerant? How do we give them other tools than violence? And if they are truely unable to use other tools, and unable to become tolerant, how do we safeguard ourselves?

But its worse: listening to my Māori compatriots NZ is pretty racist, *at a minimum* structurally, with many more Māori in prison, and presumably wage inequality and other “invisible” discriminations. Not much point asking everyone to be super tolerant and friendly with each other if we’re not giving each other a fair go.

We do pretty great in how we bring up kids in early childhood and primary school, from what I can tell with my daughter’s school, but I have no idea about higher schooling these days, and what do we do for adults and immigrants? Science says that most immigrants are motivated flexible people (self selecting group), but for the tiny tiny number that aren’t: how do we help them?

Personally I think wealth inequality goes a long way towards sustaining discrimination – anyone thinking of the world as a zero-sum game is much more likely to be hellbent on keeping other folk down to keep themselves up – and I very much want to see that change in NZ: I’d like to see us introduce a UBI, get rid of the means testing on various social supports (like the dole), and generally toss away the neoliberal narrative that has poisoned us for 30 odd years.

Recent changes

I’ve had three pretty significant changes in my life recently. All are worth a little explanation.

Debian

I’ve resigned from my role as a Debian Developer. In truth I hadn’t been active for years, and stepping down just makes clear to everyone what the current status is. I may go back at some point, but I think there are fundamental changes needed to Debian – and most “distros” in fact – for it to really excel in our modern open source milieu. More on that another time, but the relevance here is that if I was to go back, it would be because the consensus around the mission has changed (or because I’ve decided the best thing I can do is try to shift that consensus).

LCA Papers

During the most recent Linux.conf.au we had a meeting amongst the physically present papers committee members. I’d expected that meeting to be able what things we could do to prepare for the next LCA papers process – things like adding blinding to the review system, or introducing paper assignments to facilitate a significantly larger papers committee. However, it turned out there was a bigger topic to discuss – diversity within the papers committee. The general mood in that room was that we had been failing to really shift the diversity dial amongst the papers committee and that new ways to shift it needed to be trialled. One thing in particular that was suggested was replacing the leadership (the theory being that leaders more deeply connected to non-cis-hetero-white-male people would find it easier to recruit those folk into the papers committee). I think many good points were raised, and that if we’d started (say) 6 years ago we could have tried hybrid approaches (e.g. delegating recruitment entirely to someone with such connectivity, or a hard quota).

I don’t think the room actually had consensus on how much diversity is needed… should the committee represent the current demographics of LCA attendees? Or of the open source community? Or of humans? Or should it exceed the diversity in order to counter-balance the current skewed demographics and help lead from in front? I’m not sure what my position is at this point – but I am sure the folk in the room would all have given somewhat different answers, but equally that all felt more was needed.

In the morning following that I resigned – My reasons were very simple: my contribution to the papers committee, while (IMO) significant, are replaceable. There are other people with awareness of interesting projects in web/infra/cloud/programming/build/vcs technology spaces. People who are not cis-hetero-white-male; by leaving I provide an opportunity for the LCA papers committee to increase its diversity much faster than if I stayed.

I deeply loved being on the papers committee, I took great pride in looking for the unknown presenters who could add surprising and fascinating things to LCA. I hope that the committee take this priceless opportunity that we’ve given them to radically shift the amount of diversity in the team. And if it should pass that whatever target is reached, and they were to offer membership to me again in future, well then I’d be delighted to participate in future.

VMware

I joined VMware 2 and a half years ago to work on a suite of new SaaS products being built there. During that time I grew as an engineer, as you always hope to do; finishing up there as SRE architect (for one business unit). I have some thoughts I intend to pull together about what worked well, what didn’t, what things I’d try next time and so on, but those are not yet ready for publication. As of last week I’m no longer at VMware – I’m taking a small break for a bit; I plan to catch up on home and family things – we’ve had a couple of super hard years with e.g. Lynne’s health. I’m also going to do some of the more far-out ‘what if’ things that I haven’t had time to attempt while working. I may even get around to mass review and merging of various testing-cabal patches that I see backing up!

 

OpenStack and ease of development

In my last post, about cultural norms in OpenStack, I said that ease of development was a self inflicted issue. This was somewhat contentious 🙂 I’ve had some interested expressed a deeper dive. In that post I articulated three cultural problems and two technical ones.

What does success for developers look like?

I think independent of the scope of OpenStack, the experience for developers should have roughly the same features:

  1. global reasoning for changes should be rarely needed, (or put another way, the architecture should make it possible to think about changes without trying to consider all of OpenStack and still get high quality results). (this helps new developers make good decisions)
  2. the component being worked on should build quickly (keep local development cycles brisk)
  3. have comprehensive local unit tests (keep local development effective; low rate of defects escaping to functional/integration tests)
  4. be able to utilise project resources to perform adhoc exploration, integration, functional and scale tests (this allows developers to have sensibly sized development machines, while still ensuring what they build works in a system representative of our users).
  5. the lead time from getting the change finished locally, to the developer no longer needing to shepard the change through the system should be low think about it, should be low (I won’t scare people by saying what I think it should be 🙂 . this feature keeps cognitive load on developers from becoming a burden)
  6. failures after review should be a) localised, b) rare enough that the overhead of corrective action is tolerable and c) recovery should take place within a small number of hours at most (this keeps the project as a whole healthy and that means individual developers will be rarely impacted by failures from other developers changes)

We already do ok on a number of these things: the above is not a gap analysis.

Sidebar – Accelerate

About now I feel I have to mention Accelerate, a book that is the result of detailed research into software delivery performance – and its follow-up report the DORA 2018 state of devops report. The Puppet state-of-devops report is useful as well, though they focus on different aspects – ones that are less generalisable to open source development in my view. And interestingly, particularly around team choice, they seem to have reached entirely different conclusions around team choice :).

The particularly interesting thing for me is that this is academic grade research, showing causation and tying that back to specific practices: this gives us a solid basis for planning changes, rather than speculation that something will work.

These reports and research are looking into software delivery – which for OpenStack spans organisations: we build, then users deploy. So its not entirely clear that things generalise, nor is it clear how one might implement all the predictive practices because of that.

For instance, while Continuous Integration is something we can imagine doing in OpenStack (sorry folks, preflight testing and CI are really very very different things);

Continuous Deployment would be a much more ambitious undertaking. Imagine it though: commit through to deployed on users clouds in a matter of hours. Wouldn’t that be something. Chrome and Firefox are two open source projects that have been evolving in this direction for some time, and we could well study them to learn what they have found to work and not work.

All that said, the construct – the metrics – that predict software delivery performance are:

  1. Release frequency
  2. Mean time to recovery
  3. Lead time (commit to value consumable)

There’s a separate construct (the Westrum organisational culture construct) for culture, and they also measured the effect on e.g. implementing Continuous Delivery on those metrics.

I highly recommend reading the book – perhaps start with the 2018 report for a taste, but the book has much more detail.

Where are the gaps

I haven’t looked particularly closely at the coupling in OpenStack recently, so for 1) I think folk actually landing changes should assess this. My sense it that we’re ok on this, but not great. In particular, anytime there is a big cross project, lots of involved commits, lots of sequencing – thats something that needed global reasoning.

For 2), most of our stuff is in Python today, so build times aren’t a big issue.

For 3), we’re in pretty decent shape unit test wise, though they tend to be very slow (minutes or more to run), and I worry about  skew between mocks and actual servers.

For 4) we do allow utilisation of project resources via gerrit pre-review tests and pre-merge tests, but there’s no provision for adhoc utilisation (that I know of), and as I described in my last post, I think we could get a lot more leverage out of the cloud resources if we had the ability to wire components under test into an existing, scaled, cloud.

For 5) I’d need to do some more detailed visualisation, or add a feature to stackalytics, but the sense from folk I speak too is that lead times are still enormous. I suspect there are two, or even three, distributions hiding in there (e.g. one for regular devs, and one for infrequent/new) – but we can gather data on this. One important aspect is whether we should measure from ‘code committed( in dev branch) to merged to master’, or ‘code committed to delivered’. Its my view that measuring to delivery is critical, if we truely want to be driving benefits to our users. There is a corner case where those two things converge – trunk based development – but that is particularly challenging for open source projects. For instance, http://stackalytics.com/report/reviews/nova/open shows under the ‘Change requests waiting for reviewers since the last vote or mark’ an average age time of 144 days, with a max age time of 709 days: thats 2 years, 4 releases. Thats measuring time to git; if we measure time to delivered, then we need to add the time that changes sit in git before being included in a release – up to 6 months, though the adhoc releases many project are doing now is a great help. The stats shown though aren’t particularly useful – a) reviews that have merged already are not included in the stats and b) there’s not enough information to  start reasoning about why they have the age they do.

For 6) our changes at the moment, recovery is burdened by the slow merging process – the minimum time to recovery is the sum of the unavoidable steps in the merge / delivery process. Failure frequency (things breaking after the merge completes / is released) is fairly low, but we’re not particularly good at blast radius management – the all-or-nothing nature of change rollout today means there is no mitigation when things go wrong.

So I think there are significant gaps with room to improve on three things there:

  1. More efficient test/adhoc project resource utilisation
  2. Lead times
  3. Blast radius

Smarter testing

I covered this in my previous post in moderate detail, but its worth drilling in further at this point. I don’t think there is a silver bullet here; the necessary machinery to test a new database engine version with an existing cloud is very different in detail to that required to test a new nova-compute build. Lets consider just being able to test a new nova-compute with an existing cloud. Essentially we want to wire in a new shard of nova-compute. Fortunately nova-compute is intrinsically sharded: thats its very model of operation.

blog-testing.png

Though its not strictly relevant here consider that other components (like the DB) have no sharding mechanism in place today, so wiring in a new shard for that would be “tricky”.

The details may have changed since I last dug deep, but from memory nova-compute needs access to the the message bus to communicate with the rest of nova, access to glance and the swift or other store that images are in, and obviously nova-compute needs appropriate local resources to run whatever compute workload it is going to serve out.

So wiring that in from a test node to an existing cloud seems pretty simple. We probably don’t want the services listening unsecured on the internet, so we’ll need a credential distribution system (e.g. vault), and automation to look those up and wire in the nova-compute instance with appropriate credentials.

There may be trust issues: are all components equally privileged in the system? This also shows up as a bug risk – how much damage could a broken but not malicious nova-compute do?

Harder cases – DDL

One common harder case is DDL – schema changes at the DB layer. I don’t have a good canned answer here, but roughly speaking in the context of tests we need to be able to:

  1. Try applying the DDL across the whole DB
  2. Run the code that works with the DB with the modified schema
  3. Be able to do that for many different patches

Right now we machinery to do 1) against a static copy of various cloud’s DBs. 2) and 3) are almost at cross purposes: it may be necessary to serialise those tests: they are fewer than other code changes. One possible implementation would be to use an expand-contract SQL server migration strategy to expand to a new server, run the DDL, verify the cloud metrics don’t regress, then migrate back using the source servers schema (and ignoring missing columns [because if they’ve been dropped in the new schema, then code is already not querying them].

Another possibility, given that these changes are rarer, is not to optimise the testing of them.

Harder cases – exotic components

Power machines, ESXi hypervisors, and other not generally-available hypervisors would all be good to expose to developers – make it possible for them to verify changes to the code that interacts with them – in real time. Ideally with more access than the current hands-off gerrit-test-job only approach.

Lead times

Today, I’m going to treat ‘in a release’ as delivered. I’m picking this definition because:

  • We can choose to make more releases
  • We don’t need to build consensus or whole new delivery stacks to try and get customers upgraded
  • We can always come back and defined delivered with more scope later

Lean methodology provides a number of tools for analysing lead times – it has been used successfully in many organisations; sufficiently robust and consistent in its results that Accelerate even cites adopting lean management practices as being predictive for performance. And then there is the whole what-does-delivered mean.

And yes, we are not a company, we are many volunteers, but that merely adds corner cases – most of our volunteers are given tasks to work on w/in OpenStack, and have the time to work with an effective SDLC and change management process.

As I mentioned above, without some more detailed modelling, its hard to say for sure what leads to the high lead times; but there are some things we can identify easily enough…

  1. We don’t treat each commit as a release. We do say that trunk should never be broken, but we’re not sure enough of our execution to actually tag each commit as a release and publish for consumption.
    1. Consider what we would need to solve to do this.
  2. We aren’t practicing CI. In particular:
    1. Merges (required to repair things that snuck in) often take much more than 10 minutes
    2. We’re not integrating the work-in-progress from developers early enough to avoid reintegration costs.
  3. We’re not practicing trunk based development: every outstanding patch chain is a branch, just in a different representation, and our branch lifetime clearly exceeds a day… and we have a large stabilisation period during the development cycle.
  4. Reviews – needs a deeper analysis to say if this is or isn’t a driver. I suspect it is, because nothing I hear or see shows this to have changed in any fundamental way.
  5. We don’t work in small batches: 6 month cycles is huge batches.
  6. We’re pretty poor at enabling team experimentation. I think this is due to layering: for example, we have N different API servers, so if one team wants to experiment, they create customer confusion due to yet-another-API idiom. If we had just one API server, changes to that would be happening from just one team, gaining much better integration and discussion characteristics. (For an example of having just one API server in a distributed system, consider k8s, which has just one primary API server – the kubelet API is not really customer facing.)
  7. We don’t manage work in progress well: this may not seem important, but its a Lean foundational practice. Think of it as a combination of not exceeding your bandwidth, and minimising context switches.

So what should we do to drive lead times down?

I propose setting a vision: 95% of patches that are either maintenance or part of an agreed current feature merge (or are completely rejected) the same day that they are uploaded to gerrit. (Patches that are for some completely random thing may obviously require considerable more effort to reason about).

Then work back from that: what do we need to have in place to do that safely.
Yes its hard. Thats more of a reason to do it.

Delivering that will require better safety ropes (e.g. clearer contracts for components, better linting (maybe mypy), more willingness to roll forward, consistent review latency (this is more about scheduling than how many reviews any one person does).

The benefits could be immense though: if OpenStack is a juggernaut today, consider what it could be if we could respond nimbly to new user demands.

Blast radius containment

So this is about things like making releases and deployments much more robust to mistakes. For instance, imagine if every server could run in a shadow mode – where it receives traffic, operates on it, but marks any external operations it does as not-real. Then if it blows up we can detect that without destablising a running version. (And the long running supported test cloud would give a perfect place to do this). So rollouts rather than being atomic, become a series of small steps. The simplest form is just taking a stateless scale-out service and running 2 builds in parallel. Thats better than a binary old/new. Canary builds, rolling upgrades similarly.

Now, since we defined ‘delivered’ as in a release, not ‘in use’, maybe we should ignore that operational blast radius and instead limit ourselves to the development side.

Even here is a lot more sophistication that we can add: consider that for libraries our ‘fleet’ is basically every developer. Pinning all those dependencies like we do is a good step. What if we actually could deliver updates to 1% of our devs, then 10%, then all?

So we could have a pipeline:

  1. Unit test a consumer, raise its version for 1% of consumers.
  2. Watch for failures, raise the % until 100%

This would require a metrics channel (opt-in!), and some way of signalling the versions to choose from to development environments.

We could use multiple branches as another mechanism: if everyone works off of trunk, we optimise trunk merges to be no more than (say) 20 minutes, and code self promotes to a tested branch, then release branch over a couple of hours. Failures would generate a proposed rollback straight into gerrit.

Wrapup

There’s a high cost of change in OpenStack – I don’t mean individual code changes, I mean changing e.g. policies, languages, architecture – lots of code, and thousands of affected people. A result of a high cost of change is a high risk of change: if a change makes things worse, it can take as long to back it out as it took to bring it in.

I’ll freely admit that I’m partly off in architecture-astronaut land here: there’s a huge gap of detail between what I’m describing and what would be needed to make it happen.

I have confidence in the community though, if we can just pull some vision together about what we want, we have the people and knowledge to execute on it.

Is OpenStack’s mission broken?

tl;dr:

  1. Betteridge’s law applies.
  2. Ease of development is self inflicted and not mission creep.
  3. Ease of use is self inflicted and not mission creep.
  4. Ease of operations is self inflicted and not mission creep.
  5. I have concrete suggestions for 2/3/4 but to avoid writing a whole book I’m just going to tackle (2) today.

Warning: this is a little ranty. Its not aimed at any individual, its just crystalised out after a couple of years focused on different things, and was seeded by Jay Pipes when he recently put a strawman up about two related discussions that we haven’t really had as a community:

  1. What should the scope of OpenStack’s mission be?
  2. a technical proposal for ‘mulligan’, a narrowly defined new mission. And yes, I know that OpenStack has incredible velocity. Just imagine what it could be if the issues I describe didn’t exist.

So is it the mission?

I think OpenStack has lots of “issues”, to use the technical term, across, well, everything. I don’t think the mission is even slightly related to the problems though.

The mission has ultimately just brought a huge number of folk together with the idea that they might produce a thing that can act like a cloud.

This has been done before: organisations like AWS, Microsoft, Google and smaller players like Digital Ocean and Rackspace (before OpenStack).

I reject the idea that having such a big, hairy, inclusive mission is a problem.

We can be more rigorous about that though: if a smaller mission would structurally prevent a given issue, then it’s the mission that is the problem. Otherwise, it’s not.

I do think the mission is somewhat ridiculous, but there’s a phrase in some companies:a companies mission defines what it doesn’t do, not what it does.

And I think the current OpenStack mission does that quite well: there are two basic filters that can be applied, and unless at least one matches, it’s out of scope for OpenStack.

  • Can you get $thing from a Public Cloud?
  • Do you uniquely need $thing to run a Cloud?

And yes, there are a billion things in the grey cloud around the edge.

Know what else has this problem? Linux. Well over ~3/5th of its code is in that grey edge. 170M of core, 130M of architectures, 530M in drivers. X86 + arm is 50M of that 130M of architectures.

Linux’s response has been dramatically different to ours though. They have a single conceptual project being built, with enormous configurability in how it’s deployed. We’ve decided that we’re building a billion different things under the same umbrella, and that comes down to a cultural norm.

Cultural norms and silos

Concretely, Swift and Nova: the two original projects, have never conceptually regarded themselves as one project.

Should they?

I honestly don’t know :). But by not being one project (with enormous configurability in now it’s deployed), we set a cultural expectation in OpenStack, that variation in workload implied a new project and new codebase.

Every split out takes years to accomplish – both the literal ones like Glance, and the moral ones like Neutron.

The lines for the split-outs are drawn inconsistently.

To illustrate this, ask yourself: what manages a node in an OpenStack cloud? What’s the component that is responsible for working with the machines actual resources, reporting usages, reporting back to service discovery, healthchecks, liveness etc?

In a clean slate architecture you might design a single agent, and then make it extensible/modular. OpenStack has many separate agents, one per siloed team.

Similarly the scheduling problem for net/disk/compute: there is an enormous vertical stack of cloud-APIs that can be built on a solid base, many of which OpenStack has in its portfolio. But that stack is not being built on a common scheduler – and can’t be because the cultural norm is to split things out, not to actually figure out how to maintain things more effectively without moving the code around.

Some things really are better off as separate projects – and I’m not talking monorepo vs repo-per-project, thats really only about the ability to do some changes atomically. A reusable library like oslo.config is only reusable by being a separate project. oslo.db though, exists solely because we have many separate projects that all look like ‘REST on one side database on the other’. That is a concrete problem: high deployment overheads, redundant information in some places, inappropriate transaction boundaries in others. The objects work – passing structured objects around and centralising the DB access – makes things a lot better, but its broken into vertical silos much too early.

Our domain specific services include huge amounts of generic, common problem space code: persistence, placement, access control…

Cultural norms and agility

Back in the dawn of OpenStack, there were some very very strong personalities. Codebases got totally overhauled and replaced without code review. Distrust got baked in as another cultural norm. Code review became a control point. It’s extraordinarily common to spend weeks or months getting patches through.

In some of the most effective teams I’ve worked in code review is optional. Trust and iterate is the norm there: bypassing code review is a thing that needs to be justified, but code review is not how quality is delivered. Quality is delivered by continual improvement, rather than by the quality of any one individual commit.

A related thing is being super risk averse around what lands in master (more on that below). Some very very very clever folk have written very clever code to facilitate this combination of siloed projects + trying super hard not to let regressions into master. This is very hard to deliver – and in fact we stepped back from being an absolute-approach there, about 4 years ago, to a model where we try very hard to prevent it just within a small set of connected projects.

OpenStack has a deeply split personality. Many folk want to build a downloadable cloud construction kit (e.g. Ubuntu). Many more want to build a downloadable cloud product (direct release users). And many wanted (are there still public clouds running master?) to be able to use master directly with confidence. This last use case is a major driver for wanting master to be regression free…

Agility requires the ability to react to new information in a short timeframe. Doing CD (continuous deployment) requires a pipeline that starts with code review and ends with deployed code. OpenStack doesn’t do that. There’s a huge discontinuity between upstream and actual deployments, and effectively none of developers of any part of OpenStack upstream are doing operations day to day. Those that do – at Rackspace, previously at HP (where I was working when I was full time on OpenStack), and I’m going to presume at OVH and other public clouds – are having to separate out their operations work from their upstream changes.

Every initiative in a project will miss some details that have to be figured out later – thats the nature of all but the most exactly software development processes, and those processes are hugely expensive. (Formal methods just to start with). OpenStack copes with that by running huge planning cycles – 3-6 months apart.

Commits-as-control-points + long planning cycles + many developers not operating what they build => reaction to new information happens at a glacial scale.

To illustrate this, consider request tracing. 8 years ago Google released the Dapper whitepaper, Twitter wrote Zipkin and open sourced it, and we’re now at the point where distributed tracing is de rigeur – it’s one of the standard things a service operator will expect for any system. We spent years dealing with pushback from developers in service teams that didn’t understand the benefits of the proposed analogous system for OpenStack. Rackspace wrote their own and patched it in as part of their productionisation of master. Then we also got to have a debate about whether OpenStack should have one such system, or a plugin interface to allow Rackspace to not change. [Sidebar: Rackers, I love you and :heart: your company, but that drove me up the wall! I wish we’d managed to just join forces and get everyone to at least bring a damn tracing interface in for everything].

Test reliability

With TripleO we had the idea that we’d run a cloud based on master, provide feedback on what didn’t work, and create a virtuous circle. I think that that was ultimately flawed because the existing silos (e.g. of Nova, or Glance) were not extended into owning those components within TripleO: TripleO was just another deployer, rather than part of the core feedback cycle.

More generally, we had a team of people (TripleO) running other people’s code (all of OpenStack and commit rights were hard to get in other projects) with no SLA around that code.

I didn’t think of this that way at the time, for all that we understood that that was what we are doing, but that structure is actually structurally fragile: it’s the very antithesis of agile. When something broke it could stay broken for weeks, simply because the folk responsible for the break are not accountable for the non-brokenness of the system. (I’m not whinging about the teams we worked with – people did care, but caring and being accountable are fundamentally different things).

There is another place with that pattern: devstack. Devstack is a code base that exists to deploy all the other openstack components. It’s the purest essence of ‘run other people’s code with no SLA’, and devstack is the engine for pre-merge testing and pre-review testing in OpenStack.

I now believe that to be a key problem for OpenStack. Monty loves to talk about how many clouds OpenStack deploys daily in testing. Every one of those tests is running some number of components (typically the dependency graph for the service under test) which have not changed and are not written by the author, from scratch. And then of course the actual service being tested.

Thats structurally fragile: it’s running 5 or 10 times as much code as is relevant to the test being conducted. And the people able to fix any problems in those dependencies don’t feel the friction at the same time, in the same way, as their users do. (This isn’t a critique of the people, it’s just maths).

I’ll probably write more about this in detail later, as it ties into a larger discussion about testing and deployment of microservices, or testing in production. But imagine if we got rid of devstack for review and merge testing. It has several other use cases of course – ‘give me an OpenStack to hack on’ is an important, discrete test case, and folk probably care that that works. For simplicity I’m going to ignore that for now.

So, if we don’t use devstack, how do we deploy a cloud for pre-merge testing.

We don’t. We don’t need to. What we need to do is deploy the changed code into a cloud whose other components are expected to be compatible with that code. Devstack did this by taking a given branch of a bunch of components and bringing them up from scratch. Instead, we run a production grade, monitored and alerted deployment of all the components. Possibly we run many such deployments, for configurations that cannot coexist (e.g. different federation modes in keystone?). The people answering the pages for those alerts could be the service developers, or it could be an operations team with escalation back to the developers as-needed (to filter noise like ‘oh, cloud $X has just had an outage’). But ultimately the developers would be directly accountable in some realtime fashion.

Then the test workflow becomes:

  1. Build the code under test. (e.g. clean VM, pip install, whatever)
  2. Deploy that code into the existing cluster as a new shard
  3. Exercise it as desired
  4. Tear it down

Let’s use nova-compute as an example.

  1. pip install
  2. Run nova-compute reporting to an existing API server with some custom label on the hypervisor to allow targeting workloads to it
  3. Deploy a VM targeted it
  4. tear it down

I’m sure this raises lots of omg-we-can’t-do-that-because-technical-reason-X-about-what-we-do-today.

That’s fine, but for the purposes of this discussion, consider the destination – not the path.

If we did this:

  • Individual test runs could use substantially less resources
  • And perform substantially less work
  • Which implies better performance
  • Failures due to other components than the service under test would be a thing of the past (when you’re on the hook for your service running reliably, you engineer it to do that)

I think this post is long enough, so let me recap briefly. If there is interest out there I can drill into what sort of changes would be needed to transition to such a system, the suggestions I have for ease of use and ease of operations, and I think I’m also ready to provide some discussion about what the architecture of OpenStack should be.

Recap: why is development hard

Cultural problem #1: silos rather than collaboration in place. Moving the code rather than working with others.

Cultural problem #2: excessive entry controls. Make each commit right rather than trend up wards with a low-latency high change rate.

Cultural problem #3: developer feedback cycle is measured in weeks (optimistically), or years (realistically).

Technical problem #1: excessive code executed in tests: 80% of test activity is not testing the code under test.

Technical problem #2: our testing is optimised for new-cloud deployments: as our userbase grows upgrades become the common use case and testing should match that.