Feature flags

Feature toggles, feature flags – they’ve been written about a lot already (use a search engine :)), yet I feel like writing a post about them. Why? I’ve been personally involved in two from-scratch implementations, and it may be interesting for folk to read about that.

I say that lots has been written; http://featureflags.io/ (which appears to be a bit of an astroturf site for LaunchDarkly 😉 ) nevertheless has gathered a bunch of links to literature as well as a number of SDKs and the like; there are *other* FFaaS offerings than LaunchDarkly; I have no idea which I would use for my next project at this point – but hopefully you’ll have some tools to reason about that at the end of this piece.

I’m going to entirely skip over the motivation (go read those other pieces), other than to say that the evidence is in, trunk based development is better.



Humble, J. and Kim, G., 2018. Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution.

A feature flag is a very simple thing: it is a value controlled outside of your development cycle that in turn controls the behaviour of your code. There are dozens of ways to implement that. hash-defines and compile time flags have been used for a very long time, so long that we don’t even think of them as feature flags, but they are are. So are configuration options in configuration files in the broadest possible sense. The difference is largely in focus, and where the same system meets all parties needs, I think its entirely fine to use just the one system – that is what we did for Launchpad, and it worked quite well I think – as far as I know it hasn’t been changed. Specifically in Launchpad the Zope runtime config is regular ZCML files on disk, and feature flags are complementary to that (but see the profiling example below).

Configuration tends to be thought of as “choosing behaviour after the system is compiled and before the process is started” – e.g. creating files on disk. But this is not always the case – some enterprise systems are notoriously flexible with database managed configuration rulesets which no-one can figure out – and we don’t want to create that situation.

Lets generalise things a little – a flag could be configured over the lifetime of the binary (compile flag), execution (runtime flag/config file or one-time evaluation of some dynamic system), time(dynamically reconfigured from changed config files or some dynamic system), or configured based on user/team/organisation, URL path (of a web request naturally :P), and generally any other thing that could be utilised in making a decision about whether to conditionally perform some code. It can also be useful to be able to randomly bucket some fraction of checks (e.g. 1/3 of all requests will go down this code path).. but do it consistently for the same browser.

Depending on what sort of system you are building, some of those sorts of scopes will be more or less important to you – for instance, if you are shipping on-premise software, you may well want to be turning unreleased features entirely off in the binary. If you are shipping a web API, doing soft launches with population rollouts and feature kill switches may be your priority.

Similarly, if you have an existing microservices architecture, having a feature flags aaS API is probably much more important (so that your different microservices can collaborate on in-progress features!) than if you have a monolithic DB where you put all your data today.

Ultimately you will end up with something that looks roughly like a key-value store: get_flag_value(flagname, context) -> value. Somewhere separate to your code base you will have a configuration store where you put rules that define how that key-value interface comes to a given value.

There are a few key properties that I consider beneficial in a feature flag systems:

  • Graceful degradation
  • Permissionless / (or alternatively namespaced)
  • Loosely typed
  • Observable
  • Centralised
  • Dynamic

Graceful Degradation

Feature flags will be consulted from all over the place – browser code, templates, DB mapper, data exporters, test harnesses etc. If the flag system itself is degraded, you need the systems behaviour to remain graceful, rather than stopping catastrophically. This often requires multiple different considerations; for instance, having sensible defaults for your flags (choose a default that is ok, change the meaning of defaults as what is ‘ok’ changes over time), having caching layers to deal with internet flakiness or API blips back to your flag store, making sure you have memory limits on local caches to prevent leaks and so forth. Different sorts of flag implementations have different failure modes : an API based flag system will be quite different to one stored in the same DB the rest of your code is using, which will be different to a process-startup command line option flag system.

A second dimension where things can go wrong is dealing with missing or unexpected flags. Remember that your system changes over time: a new flag in the code base won’t exist in the database until after the rollout, and when a flag is deleted from the codebase, it may still be in your database. Worse, if you have multiple instances running of a service, you may have different code all examining the same flag at the same time, so operations like ‘we are changing the meaning of a flag’ won’t take place atomically.

Permissions

Flags have dual audiences; one part is pure dev: make it possible to keep integration costs and risks low by merging fully integrated code on a continual basis without activating not-yet-ready (or released!) codepaths. The second part is pure operations: use flags to control access to dark launches, demo new features, killswitch parts of the site during attack mitigation, target debug features to staff and so forth.

Your developers need some way to add and remove the flags needed in their inner loop of development. Lifetimes of a few days for some flags.

Whoever is doing operations on prod though, may need some stronger guarantees – particularly they may need some controls over who can enable what flags. e.g. if you have a high control environment then team A shouldn’t be able to influence team B’s flags. One way is to namespace the flags and only permit configuration for the namespace a developer’s team(s) owns. Another way is to only have trusted individuals be able to set flags – but this obviously adds friction to processes.

Loose Typing

Some systems model the type of each flag: is it boolean, numeric, string etc. I think this is a poor idea mainly because it tends to interact poorly with the ephemeral nature of each deployment of a code base. If build X defines flag Y as boolean, and build X+1 defines it as string, the configuration store has to interact with both at the same time during rollouts, and do so gracefully. One way is to treat everything as a string and cast it to the desired type just in time, with failures being treated as default.

Observable

Make sure that when a user reports crazy weird behaviour, that you can figure out what value they had for what flags. For instance, in Launchpad we put them in the HTML.

Centralised

Having all your flags in one system lets you write generic tooling – such as ‘what flags are enabled in QA but not production’, or ‘what flags are set but have not been queried in the last month’. It is well worth the effort to build a single centralised system (or consume one such thing) and then use it everywhere. Writing adapters to different runtimes is relatively low overhead compared to rummaging through N different config systems because you can’t remember which one is running which platform.

Scope things in the system with a top level tenant / project style construct (any FFaaS will have this I’m sure :)).

Dynamic

There may be some parts of the system that cannot apply some flags rapidly, but generally speaking the less poking around that needs to be done to make something take effect the better. So build or buy a dynamic system, and if you want a ‘only on process restart’ model for some bits of it, just consult the dynamic system at the relevant time (e.g. during k8s Deployment object creation, or process startup, or …). But then everywhere else, you can react just-in-time; and even make the system itself be script driven.

The Launchpad feature flag system

I was the architect for Launchpad when the flag system was added. Martin Pool wanted to help accelerate feature development on Launchpad, and we’d all become aware of the feature flag style things hip groups like YouTube were doing; so he wrote a LEP: https://dev.launchpad.net/LEP/FeatureFlags , pushed that through our process and then turned it into code and docs (and once the first bits landed folk started using and contributing to it). Here’s a patch I wrote using the system to allow me to turn on Python profiling remotely. Here’s one added by William Grant to allow working around a crash in a packaging tool.

Launchpad has a monolithic data store, with bulk data federated out to various disk stores, but all relational data in one schema; we didn’t see much benefit in pushing for a dedicated API per se at that time – it can always be added later, as the design was deliberately minimal. The flags implementation is all in-process as a result, though there may be a JS thunk at this point – I haven’t gone looking. Permissions are done through trusted staff members, it is loosely typed and has an audit log for tracking changes.

The other one

The other one I was involved in was at VMware a couple years ago now; its in-house, but some interesting anecdotes I can share. The thinking on feature flags when I started the discussion was that they were strictly configuration file settings – I was still finding my feet with the in-house Xenon framework at the time (I think this was week 3 ? 🙂 so I whipped up an API specification and a colleague (Tyler Curtis) turned that into a draft engine; it wasn’t the most beautiful thing but it was still going strong and being enhanced by the team when I left earlier this year. The initial implementation had a REST API and a very basic set of scopes. That lasted about 18 months before tenant based scopes were needed and added. I had designed it with the intent of adding multi-arm bandit selection down the track, but we didn’t make the time to develop that capability, which is a bit sad.

Comparing that API with LaunchDarkly I see that they do support A/B trials but don’t have multivariate tests live yet, which suggests that they are still very limited in that space. I think there is room for some very simple home grown work in this area to pay off nicely for Symphony (the project codename the flags system was written for).

Should you run your own?

It would be very unusual to have PII or customer data in the flag configuration store; and you shouldn’t have access control lists in there either (in LP we did allow turning on code by group, which is somewhat similar). Point is, that the very worst thing that can happen if someone else controls your feature flags and is malicious is actually not very bad. So as far as aaS vendor trust goes, not a lot of trust is needed to be pretty comfortable using one.

But, if you’re in a particularly high-trust environment, or you have no internet access, running your own may be super important, and then yeah, do it :). They aren’t big complex systems, even with multi-arm bandit logic added in (the difficulty there is the logic, not the processing).

Or if you think the prices being charged by the incumbents are ridiculous. Actually, perhaps hit me up and we’ll make a startup and do this right…

Should you build your own?

A trivial flag system + persistence could be as little as a few days work. Less if you grab an existing bolt-on for your framework. If you have multiple services, or teams, or languages.. expect that to become the gift that keeps on giving as you have to consolidate and converge across your organisation indefinitely. If you have the resources – great, not a problem.


I think most people will be better off taking one of the existing open source flag systems – perhaps https://unleash.github.io/ – and using it; even if it is more complex than a system tightly fitted to your needs, the benefit of having one that is a true API from the start will pay for itself the very first time you split a project, or want to report what features are on in dev and off in prod, not to mention multiple existing language bindings etc.

Learning is hard

I feel like I’m taking a big personal risk writing this, even though I know the internet is large and probably no-one will read this :-).

So, dear reader, please be gentle.

As we grow – as people, as developers, as professionals – some lessons are are hard to learn (e.g. you have to keep trying and trying to learn the task), and some are hard to experience (they might still be hard to learn, but just being there is hard itself…) I want to talk about a particular lesson I started learning in late 2008/early 2009 – while I was at Canonical – sadly one of those that was hard to experience.

At the time I was one of the core developers on Bazaar, and I was feeling pretty happy about our progress, how bzr was developing, features, community etc. There was a bunch of pressure on to succeed in the marketplace, but that was ok, challenges bring out the stubborn in me :). There was one glitch though – we’d been having a bunch of contentious code reviews, and my manager (Martin Pool) was chatting to me about them.

I was – as far as I could tell – doing precisely the right thing from a peer review perspective: I was safeguarding the project, preventing changes that didn’t fit properly, or that reduced key aspects- performance, usability – from landing until they were fixed.

However, the folk on the other side of the review were feeling frustrated, that nothing they could do would fix it, and generally very unhappy. Reviews and design discussions would grind to a halt, and they felt I was the cause. [They were right].

And here was the thing – I simply couldn’t understand the issue. I was doing my job; I wasn’t angry at the people submitting code; I wasn’t hostile; I wasn’t attacking them (but I was being shall we say frank about the work being submitted). I remember saying to Martin one day ‘look, I just don’t get it – can you show me what I said wrong?’ … and he couldn’t.

Canonical has a 360′ review system – every 6 months / year (it changed over time) you review your peers, subordinate(s) and manager(s), and they review you. Imagine my surprise – I was used to getting very positive reports with some constructive suggestions – when I scored low on a bunch of the inter-personal metrics in the review. Martin explained that it was the reviews thing – folk were genuinely unhappy, even as they commended me on my technical merits. Further to that, he said that I really needed to stop worrying about technical improvement and focus on this inter-personal stuff.

Two really important things happened around this time. Firstly, Steve Alexander, who was one of my managers-once-removed at the time, reached out to me and suggested I read a book – Getting out of the box – and that we might have a chat about the issue after I had read it. I did so, and we chatted. That book gave me a language and viewpoint for thinking about the problem. It didn’t solve it, but it meant that I ‘got it’, which I hadn’t before.

So then the second thing happened – we had a company all hands and I got to chat with Claire Davis (head of HR at Canonical at the time) about what was going on. To this day the sheer embarrassment I felt when she told me that the broad perception of me amongst other teams managers was – and I paraphrase a longer, more nuance conversation here – “technically fantastic but very scary to have on the team – will disrupt and cause trouble”.

So, at this point about 6 months had passed, I knew what I wanted – I wanted folk to want to work with me, to find my presence beneficial and positive on both technical and team aspects. I already knew then that what I seek is technical challenges: I crave novelty, new challenges, new problems. Once things become easy, it call all too easily slip into tedium. So at that time my reasoning was somewhat selfish: how was I to get challenges if no-one wanted to work with me except in extremis?

I spent the next year working on myself as much as specific projects: learning more and more about how to play well with others.

In June 2010 I got a performance review I could be proud of again – I was – in no way – perfect, but I’d made massive strides. This journey had also made huge improvements to my personal life – a lot of stress between Lynne and I had gone away. Shortly after that I was invited to apply for a new role within Canonical as Technical Architect for Launchpad – and Francis Lacoste told me that it was only due to my improved ability to play well with others that I was even considered. I literally could not have done the job 18 months before. I got the job, and I think I did pretty well – in fact I was awarded an internal ‘Spotlight on Success’ award for what we (it was a whole Launchpad team team effort) achieved while I was in that role.

So, what did I change/learn? There’s just a couple of key changes I needed to make in myself, but a) they aren’t sticky: if I get overly tired, ye old terrible Robert can leak out, and b) there’s actually a /lot/ of learnable skills in this area, much of which is derived – lots of practice and critical self review is a good thing. The main thing I learnt was that I was Selfish. Yes – capital S. For instance, in a discussion about adding working tree filter to bzr, I would focus on the impact/risk on me-and-things-I-directly-care-about: would it make my life harder, would it make bzr slower, was there anything that could go wrong. And I would spend only a little time thinking about what the proposer needed: they needed support and assistance making their idea reach the standards the bzr community had agreed on. The net effect of my behaviours was that I was a class A asshole when it came to getting proposals into a code base.

The key things I had to change were:

  1. I need to think about the needs of the person I’m speaking to *and not my own*. [Thats not to say you should ignore your needs, but you shouldn’t dwell on them: if they are critical, your brain will prompt you].
  2. There’s always a reason people do things: if it doesn’t make sense, ask them!  [The crucial conversations books have some useful modelling here on how and why people do things, and on how-and-why conversations and confrontations go bad and how to fix them.]

Ok so this is all interesting and so forth, but why the blog post?

Firstly, I want to thank four folk who were particularly instrumental in helping me learn this lesson: Martin, Steve, Claire and of course my wife Lynne – I owe you all an unmeasurable debt for your support and assistance.

Secondly, I realised today that while I’ve apologised one on one to particular folk who I knew I’d made life hard for, I’d never really made a widespread apology. So here it is: I spent many years as an ass, and while I didn’t mean to be one, intent doesn’t actually count here – actions do. I’m sorry for making your life hell in the past, and I hope I’m doing better now.

Lastly, if I’m an ass to you now, I’m sorry, I’m probably regressing to old habits because I’m too tired – something I try to avoid, but it’s not always possible. Please tell me, and I will go get some sleep then come and apologise to you, and try to do better in future.

Less-assily-yrs,
Rob

Launchpadlib without gnome-keyring

Recently I’ve been doing my personal development SSH’d into my personal laptop. I found that launchpadlib (which various projects use for release automation) was failing – the gnome keyring API threw an error because the keyring was locked, and python-keyring didn’t try to unlock it.

I needed a workaround to be able to release stuff, and with a bit of digging and help from #launchpad, came up with this:

mkdir ~/.cache/keyring
mkdir ~/.local/share/python_keyring
echo > ~/.local/share/python_keyring/keyringrc.cfg << EOF
[backend]
default-keyring=keyring.backend.UncryptedFileKeyring
keyring-path=/home/robertc/.cache/keyring/
EOF

(There is already encryption in place, so I chose an uncrypted store – read the keyring source to find other alternatives).

With this done, I can now use lp-shell etc over SSH, for when I’m not physically at my machine.

minimising downtime for schema changes with postgresql

Edits: Corrected the description of the slony bug, and noted that there is a typo on the lazr_postgresql PYPI page.

Two years ago Launchpad did schema changes once a month. Everyone would cross their fingers and hope while the system administrators took all the application servers offline, patched the database with a months worth of work and brought up the servers again running the new QA’d codebase.

This had two problems:

  1. due to the complexity of the system – something like 300 processes have to be stopped or inhibited to take everything offline – the downtime duration was often about 90 minutes long irrespective of the schema patch duration. [Some of the processes don’t like being interrupted at all].
  2. We simply could not deliver any change in less than 1 week, with the on average latency for something that jumped all the queues still being 2 weeks.

About a year ago we wanted to increase the rate at which schema changes could be carried out – the efforts to speed Launchpad up had consumed most low hanging fruit and more and more schema patches were required. We didn’t want to introduce additional 90 minute downtime windows though. Adopting incremental migrations – the sort of change process described in various places on the internet – seemed like a good way to make it possible to apply the schema changes without this slow shutdown-and-restart step, which was required because the pre-patch codebase couldn’t speak to the new schema. We could optimise each patch to be very fast by avoiding anything that causes a full table scan or table rewrite (such as adding indices, adding columns with a non-NULL default value). That would let us avoid the 90 minutes of downtime caused by stopping and restarting everything. However, that wasn’t sufficient – the reason Launchpad ended up doing monthly downtime is that previous attempts to do more frequent schema changes had too high a failure rate. A key reason for patch deployment time blowing out when everything wasn’t shut down was due to  Launchpad being a very busy system – with the use of Slony, schema changes require an exclusive lock on all tables. [More recent versions of Slony only lock some tables, but it still requires very widespread locks for most DDL operations]. We’re doing nearly 10 thousand transactions per minute, at any point in time there are always locks open on some table in the system: it was highly improbably and effectively impossible for slonik to get an exclusive lock on all tables in a reasonable timeframe. Background tasks that take many minutes to complete exacerbate this – we can’t just block new transactions long enough to deliver all the in-flight web pages and let locks clear that way.

PGBouncer turns out to be an ideal tool here. If you route all your connections through PGBouncer, you have a single point you can deliberately interrupt to clear all database locks in a second or so (it takes time for backends to all notice that their clients have gone).

So we combined these things to get what we called ‘Fast Down Time’ or FDT.  We set the following rules for developers:

  1. Any schema patch had to complete in <= 15 seconds in our schema staging environment (which has a full copy of the production DB), or we’d roll it back and redesign.
  2. Any patch could change either code or schema, never both. schema patches were to land on a separate branch and would be promoted to trunk only after deployment. That branch also receives automated merges from trunk after every commit to trunk, so its running the latest code.

This meant that we could be confident in QA: we would QA the new schema and the application process with the current live code (we deploy trunk multiple times a day). We published some documentation about how to write fast schema patches to help socialise the approach.

Then we wrote an automated tool that would:

  1. Check for known fragile processes and abort if any were found.
  2. Check for very long transactions and abort if any were found.
  3. Shutdown pgbouncer, disconnecting all clients instantly.
  4. Use slonik to apply one or more schema patches.
  5. Start pgbouncer back up again.

The code for this (call it FDTv1) is in the Launchpad source code history – its pretty entangled but its there for grabbing if you need it. Read on to see why its only available in the history 🙂

The result was wonderful – we immediately were able to deploy schema changes with <= 90 seconds of downtime, which was significantly less than the 5 minutes our stakeholders had agreed to as a benchmark – if we were under 5 minutes, we could schedule downtime once a day rather than once a month. We had to fix some API client code to retry more reliably, and likewise fix a few minor bugs in the database connection handling logic in the appservers, but all in all it was a pretty smooth project. Along the way we spun off a small python helper to run and control pgbouncer, which let us write effective tests for the connection handling code paths. In

This gave us the following workflow for making schema changes:

  1. Land and deploy an incremental schema change.
  2. Land and deploy any indices that need to be added – these are deployed live using CREATE INDEX CONCURRENTLY.
  3. Land and deploy code changes to populate any additional fields/tables from both application servers, and from cron – we do a bulk backfill that does many small transactions while walking over the entire dataset that needs to be updated / populated.
  4. Land and deploy code changes to drop references to the old schema, whatever it was.
  5. Land and deploy an incremental schema change to finalise the change – such as making a new column NOT NULL once the backfill is complete.

This looks long and unwieldy but its worth noting that its actually just repeated applications of a smaller primitive:

  1. Make a schema change that is fast and compatible with existing code.
  2. Change code to take advantage of the changed schema

Pretty much any change that is desired can be done using this single primitive.

We wanted to go further though – the multiple stages required for complex migrations became a burden with one change a day. Fortunately PostgreSQL now includes its own replication engine, which replicates the WAL logs rather than installing triggers on all tables like Slony.

Stuart, our intrepid DBA migrated Launchpad to PostreSQL 9.1, updated the FDT tool to work with native replication, and migrated Launchpad off of Slony. The result is again wonderful – the overhead in doing a schema patch, with all the protection I described above, is now ~5 seconds. We can do incremental changes in less time than it takes your browser to figure out that a given server is offline. We’re now negotiating with the Launchpad stakeholders to get multiple downtime windows each day, with this almost unnoticable, super reliable process in place.

Reliability wise, FDT has been superb. We’ve had 2 failures: one where we believe we encountered a bug in Slony: We dropped the id column from two tables in one patch (we replaced the autoincrement column as PK with a naturally unique column), and one where we landed a patch that worked on staging but led to lock contention in production – so the patch applied, but the system was very unhealthy after that until we fixed it. Thats after doing approximately 60 patches over a 1 year period.

We’re partway through extracting the patching logic from Launchpad’s code base into a reusable tool, but the basic principles will apply to any PostgreSQL environment. Note that there is a typo on the PYPI page – the actual Launchpad project is at https://launchpad.net/lazr.postgresql.

Public service announcement: signals implies reentrant code even in Python

This is a tiny PSA prompted by my digging into a deadlock condition in the Launchpad application servers.

We were observing a small number of servers stopping cold when we did log rotation, with no particularly rhyme or reason.

tl;dr: do not call any non-reentrant code from a Python signal handler. This includes the signal handler itself, queueing tools, multiprocessing, anything with locks (including RLock).

Tracking this down I found we were using an RLock from within the signal handler (via a library…) – so I filed a bug upstream: http://bugs.python.org/issue13697

Some quick background: when a signal is received by Python, the VM sets a status flag saying that signal X has been received and returns. The next chance that thread 0 gets to run bytecode, (and its always thread 0) the signal handler in Python itself runs. For builtin handlers this is pretty safe – e.g. for SIGINT a KeyboardInterrupt is raised. For custom signal handlers, the current frame is pushed and a new stack frame created, which is used to execute the signal handler.

Now this means that the previous frame has been interrupted without regard for your code: it might be part way through evaluating a multi-condition if statement, or between receiving the result of a function and storing it in a variable. Its just suspended.

If the code you call somehow ends up calling that suspended function (or other methods on the same object, or variations on this theme), there is no guarantee about the state of the object; it becomes very hard to reason about.

Consider, for instance, a writelines() call, which you might think is safe. If the internal implementation is ‘for line in lines: foo.write(line)’, then a signal handler which also calls writelines, could have what it outputs appear between any two of the lines in writelines.

True reentrancy is a step up from multithreading in terms of nastiness, primarily because guarding against it is very hard: a non-reentrant lock around the area needing guarding will force either a deadlock, or an exception from your reentered code; a reentrant lock around it will provide no protection. Both of these things apply because the reentering occurs within the same thread – kindof like a generator but without any control or influence on what happens.

Safe things to do are:

  • Calling code which is threadsafe and only other threads will be concurrently calling.
  • Performing ‘atomic’ (any C function is atomic as far as signal handling in Python is concerned) operations such as list.append, or ‘foo = 1’. (Note the use of a constant: anything obtained by reading is able to be subject to reentrancy races [unless you take care :)])

In Launchpad’s case, we will be setting a flag variable unconditionally from the signal handler, and the next log write that occurs will lock out other writers, consult the flag, and if needed do a rotation, resetting the flag. Writes after the rotation signal, which don’t see the new flag, would be ok. This is the only possible race, if a write to the variable isn’t seen by an in-progress or other-thread log write.

That is all.