Continuous Delivery and software distributors

Back in 2010 the continuous delivery meme was just grabbing traction. Today its extremely well established… except in F/LOSS projects.

I want that to change, so I’m going to try and really bring together a technical view on how that could work – which may require multiple blog posts – and if it gets traction I’ll put my fingers where my thoughts are and get into specifics with any project that wants to do this.

This is however merely a worked model today: it may be possible to do things quite differently, and I welcome all discussion about the topic!

tl;dr

Pick a service discovery mechanism (e.g. environment variables), write two small APIs – one for flag delivery, with streaming updates, and one for telemetry, with an optional aggressive data hiding proxy, then use those to feed enough data to drive a true CI/CD cycle back to upstream open source projects.

Who is in?

Background

(This assumes you know what C/D is – if you don’t, go read the link above, maybe wikipedia etc, then come back.)

Consider a typical SaaS C/D pipeline:

git -> build -> test -> deploy

Here all stages are owned by the one organisation. Once deployed, the build is usable by users – its basically the simplest pipeline around.

Now consider a typical on-premise C/D pipeline:

git -> build -> test -> expose -> install

Here the last stage, the install stage, takes place in the users context, but it may be under the control of the create, or it may be under the control of the user. For instance, Google play updates on an Android phone: when one selects ‘Update Now’, the install phase is triggered. Leaving the phone running with power and Wi-Fi will trigger it automatically, and security updates can be pushed anytime. Continuing the use of Google Play as an example, the expose step here is an API call to upload precompiled packages, so while there are three parties, the distributor – Google – isn’t performing any software development activities (they do gatekeep, but not develop).

Where it gets awkward is when there are multiple parties doing development in the pipeline.

Distributing and C/D

Lets consider an OpenStack cloud underlay circa 2015: an operating system, OpenStack itself, some configuration management tool (or tools), a log egress tool, a metrics egress handler, hardware mgmt vendor binaries. And lets say we’re working on something reasonably standalone. Say horizon.

OpenStack for most users is something obtained from a vendor. E.g. Cisco or Canonical or RedHat. And the model here is that the vendor is responsible for what the user receives; so security fixes – in particular embargoed security fixes – cannot be published publically and the slowly propogate. They must reach users very quickly. Often, ideally, before the public publication.

Now we have something like this:

upstream ends with distribution, then vendor does an on-prem pipeline


Can we not just say ‘the end of the C/D pipeline is a .tar.gz of horizon at the distribute step? Then every organisation can make their own decisions?

Maybe…

Why C/D?

  • Lower risk upgrades (smaller changes that can be reasoned about better; incremental enablement of new implementations to limit blast radius, decoupling shipping and enablement of new features)
  • Faster delivery of new features (less time dealing with failed upgrades == more time available to work on new features; finished features spend less time in inventory before benefiting users).
  • Better code hygiene (the same disciplines needed to make C/D safe also make more aggressive refactoring and tidiness changes safer to do, so it gets done more often).

1. If the upstream C/D pipeline stops at a tar.gz file, the lower-risk upgrade benefit is reduced or lost: the pipeline isn’t able to actually push all the to installation, and thus we cannot tell when a particular upgrade workaround is no longer needed.

But Robert, that is the vendors problem!

I wish it was: in OpenStack so many vendors had the same problem they created shared branches to work on it, then asked for shared time from the project to perform C/I on those branches. The benefit is only realise when the developer who is responsible for creating the issue can fix it, and can be sure that the fix has been delivered; this means either knowing that every install will install transiently every intermediary version, or that they will keep every workaround for every issue for some minimum time period; or that there will be a pipeline that can actually deliver the software.

2. .tar.gz files are not installed and running systems. A key characteristic of a C/D pipeline is that is exercises the installation and execution of software; the ability to run a component up is quite tightly coupled to the component itself, for all the the ‘this is a process’ interface is very general, the specific ‘this is server X’ or ‘this is CLI utility Y’ interfaces are very concrete. Perhaps a container based approach, where a much narrower interface in many ways can be defined, could be used to mitigate this aspect. Then even if different vendors use different config tools to do last mile config, the dev cycle knows that configuration and execution works. We need to make sure that we don’t separate the teams and their products though: the pipeline upstream must only test code that is relevant to upstream – and downstream likewise. We may be able to find a balance here, but I think more work articulating what that looks like it needed.

3. it will break the feedback cycle if the running metrics are not receive upstream; yes we need to be careful of privacy aspects, but basic telemetry: the upgrade worked, the upgrade failed, here is a crash dump – these are the tools for sifting through failure at scale, and a number of open source projects like firefox, Ubuntu and chromium have adopted them, with great success. Notably all three have direct delivery models: their preference is to own the relationship with the user and gather such telemetry directly.

C/D and technical debt

Sidebar: ignoring public APIs and external dependencies, because they form the contract that installations and end users interact with, which we can reasonably expect to be quite sticky, the rest of a system should be entirely up to the maintainers right? Refactor the DB; Switch frameworks, switch languages. Cleanup classes and so on. With microservices there is a grey area: APIs that other microservices use which are not publically supported.

The grey area is crucial, because it is where development drag comes in: anything internal to the system can be refactored in a single commit, or in a series of small commits that is rolled up into one, or variations on this theme.

But some aspect that another discrete component depends upon, with its own delivery cycle: that cannot be fixed, and unless it was built with the same care public APIs were, it may well have poor scaling or performance characteristics that making fixing it very important.

Given two C/D’d components A and B, where A wants to remove some private API B uses, A cannot delete that API from its git repo until all B’s everywhere that receive A via C/D have been deployed with a version that does not use the private API.

That is, old versions of B place technical debt on A across the interfaces of A that they use. And this actually applies to public interfaces too – even if they are more sticky, we can expect the components of an ecosystem to update to newer APIs that are cheaper to serve, and laggards hold performance back, keep stale code alive in the codebase for longer and so on.

This places a secondary requirement on the telemetry: we need to be able to tell whether the fleet is upgraded or not.

So what does a working model look like?

I think we need a different diagram than the pipeline; the pipeline talks about the things most folk doing an API or some such project will have directly in hand, but its not actually the full story. The full story is rounded out with two additional features. Feature flags and telemetry. And since we want to protect our users, and distributors probably will simply refuse to provide insights onto actual users, lets assume a near-zero-trust model around both.

Feature flags

As I discussed in my previous blog post, feature flags can be used for fairly arbitrary purposes, but in this situation, where trust is limited, I think we need to identify the crucial C/D enabling use cases, and design for them.

I think that those can be reduce to soft launches – decoupling activating new code paths from getting them shipped out onto machines, and kill switches – killing off flawed / faulty code paths when they start failing in advance of a massive cascade failure; which we can implement with essentially the same thing: some identifier for a code path and then a percentage of the deployed base to enable it on. If we define this API with efficient streaming updates and a consistent service discovery mechanism for the flag API, then this could be replicated by vendors and other distributors or even each user, and pull the feature API data downstream in near real time.

Telemetry

The difficulty with telemetry APIs is that they can egress anything. OTOH this is open source code, so malicious telemetry would be visible. But we can structure it to make it harder to violate privacy.

What does the C/D cycle need from telemetry, and what privacy do we need to preserve?

This very much needs discussion with stakeholders, but at a first approximation: the C/D cycle depends on knowing what versions are out there and whether they are working. It depends on known what feature flags have actually activated in the running versions. It doesn’t depend on absolute numbers of either feature flags or versions

Using Google Play again as an example, there is prior art – https://support.google.com/firebase/answer/6317485 – but I want to think truely minimally, because the goal I have is to enable C/D in situations with vastly different trust levels than Google play has. However, perhaps this isn’t enough, perhaps we do need generic events and the ability to get deeper telemetry to enable confidence.

That said, let us sketch what an API document for that might look like:

project:
version:
health:
flags:
- name:
  value:

If that was reported by every deployed instance of a project, once per hour, maybe with a dependencies version list added to deal with variation in builds, it would trivially reveal the cardinality of reporters. Many reporters won’t care (for instance QA testbeds). Many will.

If we aggregate through a cardinality hiding proxy, then that vector is addressed – something like this:

- project:
  version:
  weight:
  health:
  flags:
  - name:
    value:
- project: ...

Because this data is really only best effort, such a proxy could be backed by memcache or even just an in-memory store, depending on what degree of ‘cloud-nativeness’ we want to offer. It would receive accurate data, then deduplicate to get relative weights, round those to (say) 5% as a minimum to avoid disclosing too much about long tail situations (and yes, the sum of 100 1% reports would exceed 100 :)), and then push that up.

Open Questions

  • Should library projects report, or are they only used in the context of an application/service?
    • How can we help library projects answer questions like ‘has every user stopped using feature Y so that we can finally remove it’ ?
  • Would this be enough to get rid of the fixation on using stable branches everyone seems to have?
    • If not why not?
  • What have I forgotten?

Advertisements

Feature flags

Feature toggles, feature flags – they’ve been written about a lot already (use a search engine :)), yet I feel like writing a post about them. Why? I’ve been personally involved in two from-scratch implementations, and it may be interesting for folk to read about that.

I say that lots has been written; http://featureflags.io/ (which appears to be a bit of an astroturf site for LaunchDarkly 😉 ) nevertheless has gathered a bunch of links to literature as well as a number of SDKs and the like; there are *other* FFaaS offerings than LaunchDarkly; I have no idea which I would use for my next project at this point – but hopefully you’ll have some tools to reason about that at the end of this piece.

I’m going to entirely skip over the motivation (go read those other pieces), other than to say that the evidence is in, trunk based development is better.



Humble, J. and Kim, G., 2018. Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution.

A feature flag is a very simple thing: it is a value controlled outside of your development cycle that in turn controls the behaviour of your code. There are dozens of ways to implement that. hash-defines and compile time flags have been used for a very long time, so long that we don’t even think of them as feature flags, but they are are. So are configuration options in configuration files in the broadest possible sense. The difference is largely in focus, and where the same system meets all parties needs, I think its entirely fine to use just the one system – that is what we did for Launchpad, and it worked quite well I think – as far as I know it hasn’t been changed. Specifically in Launchpad the Zope runtime config is regular ZCML files on disk, and feature flags are complementary to that (but see the profiling example below).

Configuration tends to be thought of as “choosing behaviour after the system is compiled and before the process is started” – e.g. creating files on disk. But this is not always the case – some enterprise systems are notoriously flexible with database managed configuration rulesets which no-one can figure out – and we don’t want to create that situation.

Lets generalise things a little – a flag could be configured over the lifetime of the binary (compile flag), execution (runtime flag/config file or one-time evaluation of some dynamic system), time(dynamically reconfigured from changed config files or some dynamic system), or configured based on user/team/organisation, URL path (of a web request naturally :P), and generally any other thing that could be utilised in making a decision about whether to conditionally perform some code. It can also be useful to be able to randomly bucket some fraction of checks (e.g. 1/3 of all requests will go down this code path).. but do it consistently for the same browser.

Depending on what sort of system you are building, some of those sorts of scopes will be more or less important to you – for instance, if you are shipping on-premise software, you may well want to be turning unreleased features entirely off in the binary. If you are shipping a web API, doing soft launches with population rollouts and feature kill switches may be your priority.

Similarly, if you have an existing microservices architecture, having a feature flags aaS API is probably much more important (so that your different microservices can collaborate on in-progress features!) than if you have a monolithic DB where you put all your data today.

Ultimately you will end up with something that looks roughly like a key-value store: get_flag_value(flagname, context) -> value. Somewhere separate to your code base you will have a configuration store where you put rules that define how that key-value interface comes to a given value.

There are a few key properties that I consider beneficial in a feature flag systems:

  • Graceful degradation
  • Permissionless / (or alternatively namespaced)
  • Loosely typed
  • Observable
  • Centralised
  • Dynamic

Graceful Degradation

Feature flags will be consulted from all over the place – browser code, templates, DB mapper, data exporters, test harnesses etc. If the flag system itself is degraded, you need the systems behaviour to remain graceful, rather than stopping catastrophically. This often requires multiple different considerations; for instance, having sensible defaults for your flags (choose a default that is ok, change the meaning of defaults as what is ‘ok’ changes over time), having caching layers to deal with internet flakiness or API blips back to your flag store, making sure you have memory limits on local caches to prevent leaks and so forth. Different sorts of flag implementations have different failure modes : an API based flag system will be quite different to one stored in the same DB the rest of your code is using, which will be different to a process-startup command line option flag system.

A second dimension where things can go wrong is dealing with missing or unexpected flags. Remember that your system changes over time: a new flag in the code base won’t exist in the database until after the rollout, and when a flag is deleted from the codebase, it may still be in your database. Worse, if you have multiple instances running of a service, you may have different code all examining the same flag at the same time, so operations like ‘we are changing the meaning of a flag’ won’t take place atomically.

Permissions

Flags have dual audiences; one part is pure dev: make it possible to keep integration costs and risks low by merging fully integrated code on a continual basis without activating not-yet-ready (or released!) codepaths. The second part is pure operations: use flags to control access to dark launches, demo new features, killswitch parts of the site during attack mitigation, target debug features to staff and so forth.

Your developers need some way to add and remove the flags needed in their inner loop of development. Lifetimes of a few days for some flags.

Whoever is doing operations on prod though, may need some stronger guarantees – particularly they may need some controls over who can enable what flags. e.g. if you have a high control environment then team A shouldn’t be able to influence team B’s flags. One way is to namespace the flags and only permit configuration for the namespace a developer’s team(s) owns. Another way is to only have trusted individuals be able to set flags – but this obviously adds friction to processes.

Loose Typing

Some systems model the type of each flag: is it boolean, numeric, string etc. I think this is a poor idea mainly because it tends to interact poorly with the ephemeral nature of each deployment of a code base. If build X defines flag Y as boolean, and build X+1 defines it as string, the configuration store has to interact with both at the same time during rollouts, and do so gracefully. One way is to treat everything as a string and cast it to the desired type just in time, with failures being treated as default.

Observable

Make sure that when a user reports crazy weird behaviour, that you can figure out what value they had for what flags. For instance, in Launchpad we put them in the HTML.

Centralised

Having all your flags in one system lets you write generic tooling – such as ‘what flags are enabled in QA but not production’, or ‘what flags are set but have not been queried in the last month’. It is well worth the effort to build a single centralised system (or consume one such thing) and then use it everywhere. Writing adapters to different runtimes is relatively low overhead compared to rummaging through N different config systems because you can’t remember which one is running which platform.

Scope things in the system with a top level tenant / project style construct (any FFaaS will have this I’m sure :)).

Dynamic

There may be some parts of the system that cannot apply some flags rapidly, but generally speaking the less poking around that needs to be done to make something take effect the better. So build or buy a dynamic system, and if you want a ‘only on process restart’ model for some bits of it, just consult the dynamic system at the relevant time (e.g. during k8s Deployment object creation, or process startup, or …). But then everywhere else, you can react just-in-time; and even make the system itself be script driven.

The Launchpad feature flag system

I was the architect for Launchpad when the flag system was added. Martin Pool wanted to help accelerate feature development on Launchpad, and we’d all become aware of the feature flag style things hip groups like YouTube were doing; so he wrote a LEP: https://dev.launchpad.net/LEP/FeatureFlags , pushed that through our process and then turned it into code and docs (and once the first bits landed folk started using and contributing to it). Here’s a patch I wrote using the system to allow me to turn on Python profiling remotely. Here’s one added by William Grant to allow working around a crash in a packaging tool.

Launchpad has a monolithic data store, with bulk data federated out to various disk stores, but all relational data in one schema; we didn’t see much benefit in pushing for a dedicated API per se at that time – it can always be added later, as the design was deliberately minimal. The flags implementation is all in-process as a result, though there may be a JS thunk at this point – I haven’t gone looking. Permissions are done through trusted staff members, it is loosely typed and has an audit log for tracking changes.

The other one

The other one I was involved in was at VMware a couple years ago now; its in-house, but some interesting anecdotes I can share. The thinking on feature flags when I started the discussion was that they were strictly configuration file settings – I was still finding my feet with the in-house Xenon framework at the time (I think this was week 3 ? 🙂 so I whipped up an API specification and a colleague (Tyler Curtis) turned that into a draft engine; it wasn’t the most beautiful thing but it was still going strong and being enhanced by the team when I left earlier this year. The initial implementation had a REST API and a very basic set of scopes. That lasted about 18 months before tenant based scopes were needed and added. I had designed it with the intent of adding multi-arm bandit selection down the track, but we didn’t make the time to develop that capability, which is a bit sad.

Comparing that API with LaunchDarkly I see that they do support A/B trials but don’t have multivariate tests live yet, which suggests that they are still very limited in that space. I think there is room for some very simple home grown work in this area to pay off nicely for Symphony (the project codename the flags system was written for).

Should you run your own?

It would be very unusual to have PII or customer data in the flag configuration store; and you shouldn’t have access control lists in there either (in LP we did allow turning on code by group, which is somewhat similar). Point is, that the very worst thing that can happen if someone else controls your feature flags and is malicious is actually not very bad. So as far as aaS vendor trust goes, not a lot of trust is needed to be pretty comfortable using one.

But, if you’re in a particularly high-trust environment, or you have no internet access, running your own may be super important, and then yeah, do it :). They aren’t big complex systems, even with multi-arm bandit logic added in (the difficulty there is the logic, not the processing).

Or if you think the prices being charged by the incumbents are ridiculous. Actually, perhaps hit me up and we’ll make a startup and do this right…

Should you build your own?

A trivial flag system + persistence could be as little as a few days work. Less if you grab an existing bolt-on for your framework. If you have multiple services, or teams, or languages.. expect that to become the gift that keeps on giving as you have to consolidate and converge across your organisation indefinitely. If you have the resources – great, not a problem.


I think most people will be better off taking one of the existing open source flag systems – perhaps https://unleash.github.io/ – and using it; even if it is more complex than a system tightly fitted to your needs, the benefit of having one that is a true API from the start will pay for itself the very first time you split a project, or want to report what features are on in dev and off in prod, not to mention multiple existing language bindings etc.

Christchurch

Trigger warning: this is likely going to make you angry, or sad, or sadmad, or more. OTOH I’m going to make the points I’ve got to make in a short and pointed fashion and you, dear reader, can go look up supporting or refuting documentation as you feel appropriate.

Christchurch recently suffered NZ’s largest modern mass murder. I say modern because and also.

This is the third time I’ve been “close” to murder. Once was the Aramoana massacre, where I lived roughly halfway between the city and Aramoana itself – the helicopter flight path was right overhead; once was moving to Sydney and on the very first day after I arrived taking the train into the city from a train station bathed in blood. (With a helpful sign informing us that the police were seeking information that could lead to an arrest). And last Friday was Christchurch. And this is why I have put “close” in quotes: close enough to be viscerally affected, but not directly affected: none of the folk that died were personally known to me in any of these cases. I have had people I know die and thats a whole different level. There are plenty of people writing from that perspective, and you should go and listen to them. I have, and am, and have friends in that set, who need all the support they can get at the moment.

And that is why I’m a little conflicted about writing this post. Because this tragedy is all the more tragic because of our failures leading up to it.

What failures you might ask? Ours as citizens, or our bureaucracies? Or governments?

Lets look at this from a “Law & Order” perspective – means, motive, opportunity. The opportunity is something I think we should be proud of: folk being able to do what they want to do, inviting people they don’t know into their mosque, without fear. But means and motive…

Means

Means first. I wish I could say “I’m not going to engage in the nonsense debates about whether the sort of weapon the shooter had should be available for private citizen use or not.” But actually, I have to. Australia has ongoing problems with violence, and regulated guns heavily back in 1996. Since then, massacres have stopped being a thing in Australia. Prior to that, massacres were happening on a regular basis. This doesn’t deny the truth that the same machines can be used to shoot rabbits, but the evidence seems to be that they are too easily abused and there are many less over the top alternatives for the farm use case to be a compelling tradeoff.

Would the attacker have been able to carry out the attack without semi-automatic rifles? Sure. But with less ammunition per clip, less accuracy per shot, less damage per shot.

Here’s the sad bit though. We tightened our gun laws after Aramoana in 1990. We didn’t tighten them again after Port Arthur. We didn’t tighten them again after Sandy Hook. Or any of the other mass murders overseas using military weapons. We have had decades of inaction while the evidence of the potential harms increased. Late last year Stuff published this article – which is actually good reporting! I quote:

It is a very sad fact that changes to gun regulation only come about in the wake of a tragedy: Aramoana, Port Arthur, the Dunblane massacre.

Since 1992, politicians have backed nervously away.

50 people died on Friday in large part because the murderer had access to effective means for mass murder, which we knewwas effective, which our Police knew was effective, and which we spent political capital on other complete BS (Look up political scandals in NZ – I don’t have the heart to enumerate them right now).

Jacinda’s leadership right now consists of executing on a thing we’ve had queued up as necessary for – at the least – decades. It is an indictment on all of us here in NZ that 50 people had to die to get this done. I’m very sadmad right now. I don’t feel like I could have driven the debate in the right direction, but our leaders surely could have. And none of them are owning this.

Motive

It’s tempting to write the murderer off as being a bad human, or badly raised, or insane. But the reality as I understand it is that there were many signs of his intolerant violent views, that he doesn’t have a obvious mental disorder, and his violence was targeted: he’s not just plain evil and out to kill everyone…

In NZ at least we consider both intolerance and violence antisocial: we expect tolerance and peaceful discussion with each other as baseline characteristics of human beings.

And again, this is a thing we’ve seen before overseas. We’ve seen many murders done on the basis of violent intolerant ideology, but we haven’t actually adapted here to deal with it.

How are we, at a systemic level, engaging with the problem and addressing it. How do we help folk become tolerant? How do we give them other tools than violence? And if they are truely unable to use other tools, and unable to become tolerant, how do we safeguard ourselves?

But its worse: listening to my Māori compatriots NZ is pretty racist, *at a minimum* structurally, with many more Māori in prison, and presumably wage inequality and other “invisible” discriminations. Not much point asking everyone to be super tolerant and friendly with each other if we’re not giving each other a fair go.

We do pretty great in how we bring up kids in early childhood and primary school, from what I can tell with my daughter’s school, but I have no idea about higher schooling these days, and what do we do for adults and immigrants? Science says that most immigrants are motivated flexible people (self selecting group), but for the tiny tiny number that aren’t: how do we help them?

Personally I think wealth inequality goes a long way towards sustaining discrimination – anyone thinking of the world as a zero-sum game is much more likely to be hellbent on keeping other folk down to keep themselves up – and I very much want to see that change in NZ: I’d like to see us introduce a UBI, get rid of the means testing on various social supports (like the dole), and generally toss away the neoliberal narrative that has poisoned us for 30 odd years.