How well did that work for me? Pretty good. I had a good satisfying job at VMware for 3 years, met some wonderful people, achieved some very cool things. And those priorities above were broadly achieved. The one niggle that stands out was this – Did the things we were doing matter? Certainly there was no social impact – VMware isn’t a non-profit, being right at the core of capitalism as it is. There was direct connection and impact with the team, the staff we worked with and the users of the products… but it is just a bit hard to feel really connected through that though: VMware is a very large company and there are many layers between users and developers.
We were quite early adopters of Kubernetes, which allowed me to deepen my Go knowledge and experience some more fun with AWS scale operations. I had many interesting discussions about the relative strengths of Python Go and Rust and Java with colleagues there. (Hi Geoffrey).
Company culture is very important to me, and VMware has a fantastically supportive culture. One of the most supportive companies I’ve been in, bar none. It isn’t a truely remote-organised company though: rather its a bunch of offices that talk to each other, which I think is sad. True remote-first offers so much more engagement.
I enjoy building things to solve problems. I’ve either directly built, or shaped what is built, in all my most impactful and successful roles. Solving a problem once by hand is fine; solving it for years to come by creating a tool is far more powerful.
I seem to veer into toolmaking very often: giving other people the ability to solve their problems takes the power of a tool and multiplies it even further.
It should be no surprise then that I very much enjoy reading white papers like the original Dapper and Map-reduce ones, LinkedIn’s Kafka or for more recent fodder the Facebook Akkio paper. Excellent synthesis and toolmaking applied at industrial scale. I read those things and I want to be a part of the creation of those sorts of systems.
I was fortunate enough to take some time to go back to university part-time, which though logistically challenging is something I want to see through.
Thus I think my new roughly ordered (descending) list of priorities needs to be something like this:
Keep living in Rangiora (family)
Up to moderate travel requirements – 4 team-meeting trips a year + 2 conferences
Significant autonomy (not at the expense of doing the right thing for the company, just I work best with the illusion of free will 🙂 )
Be doing something that matters
Be working directly on a problem / system that has problems
Feature toggles, feature flags – they’ve been written about a lot already (use a search engine :)), yet I feel like writing a post about them. Why? I’ve been personally involved in two from-scratch implementations, and it may be interesting for folk to read about that.
I say that lots has been written; http://featureflags.io/ (which appears to be a bit of an astroturf site for LaunchDarkly 😉 ) nevertheless has gathered a bunch of links to literature as well as a number of SDKs and the like; there are *other* FFaaS offerings than LaunchDarkly; I have no idea which I would use for my next project at this point – but hopefully you’ll have some tools to reason about that at the end of this piece.
I’m going to entirely skip over the motivation (go read those other pieces), other than to say that the evidence is in, trunk based development is better.
A feature flag is a very simple thing: it is a value controlled outside of your development cycle that in turn controls the behaviour of your code. There are dozens of ways to implement that. hash-defines and compile time flags have been used for a very long time, so long that we don’t even think of them as feature flags, but they are are. So are configuration options in configuration files in the broadest possible sense. The difference is largely in focus, and where the same system meets all parties needs, I think its entirely fine to use just the one system – that is what we did for Launchpad, and it worked quite well I think – as far as I know it hasn’t been changed. Specifically in Launchpad the Zope runtime config is regular ZCML files on disk, and feature flags are complementary to that (but see the profiling example below).
Configuration tends to be thought of as “choosing behaviour after the system is compiled and before the process is started” – e.g. creating files on disk. But this is not always the case – some enterprise systems are notoriously flexible with database managed configuration rulesets which no-one can figure out – and we don’t want to create that situation.
Lets generalise things a little – a flag could be configured over the lifetime of the binary (compile flag), execution (runtime flag/config file or one-time evaluation of some dynamic system), time(dynamically reconfigured from changed config files or some dynamic system), or configured based on user/team/organisation, URL path (of a web request naturally :P), and generally any other thing that could be utilised in making a decision about whether to conditionally perform some code. It can also be useful to be able to randomly bucket some fraction of checks (e.g. 1/3 of all requests will go down this code path).. but do it consistently for the same browser.
Depending on what sort of system you are building, some of those sorts of scopes will be more or less important to you – for instance, if you are shipping on-premise software, you may well want to be turning unreleased features entirely off in the binary. If you are shipping a web API, doing soft launches with population rollouts and feature kill switches may be your priority.
Similarly, if you have an existing microservices architecture, having a feature flags aaS API is probably much more important (so that your different microservices can collaborate on in-progress features!) than if you have a monolithic DB where you put all your data today.
Ultimately you will end up with something that looks roughly like a key-value store: get_flag_value(flagname, context) -> value. Somewhere separate to your code base you will have a configuration store where you put rules that define how that key-value interface comes to a given value.
There are a few key properties that I consider beneficial in a feature flag systems:
Permissionless / (or alternatively namespaced)
Feature flags will be consulted from all over the place – browser code, templates, DB mapper, data exporters, test harnesses etc. If the flag system itself is degraded, you need the systems behaviour to remain graceful, rather than stopping catastrophically. This often requires multiple different considerations; for instance, having sensible defaults for your flags (choose a default that is ok, change the meaning of defaults as what is ‘ok’ changes over time), having caching layers to deal with internet flakiness or API blips back to your flag store, making sure you have memory limits on local caches to prevent leaks and so forth. Different sorts of flag implementations have different failure modes : an API based flag system will be quite different to one stored in the same DB the rest of your code is using, which will be different to a process-startup command line option flag system.
A second dimension where things can go wrong is dealing with missing or unexpected flags. Remember that your system changes over time: a new flag in the code base won’t exist in the database until after the rollout, and when a flag is deleted from the codebase, it may still be in your database. Worse, if you have multiple instances running of a service, you may have different code all examining the same flag at the same time, so operations like ‘we are changing the meaning of a flag’ won’t take place atomically.
Flags have dual audiences; one part is pure dev: make it possible to keep integration costs and risks low by merging fully integrated code on a continual basis without activating not-yet-ready (or released!) codepaths. The second part is pure operations: use flags to control access to dark launches, demo new features, killswitch parts of the site during attack mitigation, target debug features to staff and so forth.
Your developers need some way to add and remove the flags needed in their inner loop of development. Lifetimes of a few days for some flags.
Whoever is doing operations on prod though, may need some stronger guarantees – particularly they may need some controls over who can enable what flags. e.g. if you have a high control environment then team A shouldn’t be able to influence team B’s flags. One way is to namespace the flags and only permit configuration for the namespace a developer’s team(s) owns. Another way is to only have trusted individuals be able to set flags – but this obviously adds friction to processes.
Some systems model the type of each flag: is it boolean, numeric, string etc. I think this is a poor idea mainly because it tends to interact poorly with the ephemeral nature of each deployment of a code base. If build X defines flag Y as boolean, and build X+1 defines it as string, the configuration store has to interact with both at the same time during rollouts, and do so gracefully. One way is to treat everything as a string and cast it to the desired type just in time, with failures being treated as default.
Make sure that when a user reports crazy weird behaviour, that you can figure out what value they had for what flags. For instance, in Launchpad we put them in the HTML.
Having all your flags in one system lets you write generic tooling – such as ‘what flags are enabled in QA but not production’, or ‘what flags are set but have not been queried in the last month’. It is well worth the effort to build a single centralised system (or consume one such thing) and then use it everywhere. Writing adapters to different runtimes is relatively low overhead compared to rummaging through N different config systems because you can’t remember which one is running which platform.
Scope things in the system with a top level tenant / project style construct (any FFaaS will have this I’m sure :)).
There may be some parts of the system that cannot apply some flags rapidly, but generally speaking the less poking around that needs to be done to make something take effect the better. So build or buy a dynamic system, and if you want a ‘only on process restart’ model for some bits of it, just consult the dynamic system at the relevant time (e.g. during k8s Deployment object creation, or process startup, or …). But then everywhere else, you can react just-in-time; and even make the system itself be script driven.
The Launchpad feature flag system
I was the architect for Launchpad when the flag system was added. Martin Pool wanted to help accelerate feature development on Launchpad, and we’d all become aware of the feature flag style things hip groups like YouTube were doing; so he wrote a LEP: https://dev.launchpad.net/LEP/FeatureFlags , pushed that through our process and then turned it into code and docs (and once the first bits landed folk started using and contributing to it). Here’s a patch I wrote using the system to allow me to turn on Python profiling remotely. Here’s one added by William Grant to allow working around a crash in a packaging tool.
Launchpad has a monolithic data store, with bulk data federated out to various disk stores, but all relational data in one schema; we didn’t see much benefit in pushing for a dedicated API per se at that time – it can always be added later, as the design was deliberately minimal. The flags implementation is all in-process as a result, though there may be a JS thunk at this point – I haven’t gone looking. Permissions are done through trusted staff members, it is loosely typed and has an audit log for tracking changes.
The other one
The other one I was involved in was at VMware a couple years ago now; its in-house, but some interesting anecdotes I can share. The thinking on feature flags when I started the discussion was that they were strictly configuration file settings – I was still finding my feet with the in-house Xenon framework at the time (I think this was week 3 ? 🙂 so I whipped up an API specification and a colleague (Tyler Curtis) turned that into a draft engine; it wasn’t the most beautiful thing but it was still going strong and being enhanced by the team when I left earlier this year. The initial implementation had a REST API and a very basic set of scopes. That lasted about 18 months before tenant based scopes were needed and added. I had designed it with the intent of adding multi-arm bandit selection down the track, but we didn’t make the time to develop that capability, which is a bit sad.
Comparing that API with LaunchDarkly I see that they do support A/B trials but don’t have multivariate tests live yet, which suggests that they are still very limited in that space. I think there is room for some very simple home grown work in this area to pay off nicely for Symphony (the project codename the flags system was written for).
Should you run your own?
It would be very unusual to have PII or customer data in the flag configuration store; and you shouldn’t have access control lists in there either (in LP we did allow turning on code by group, which is somewhat similar). Point is, that the very worst thing that can happen if someone else controls your feature flags and is malicious is actually not very bad. So as far as aaS vendor trust goes, not a lot of trust is needed to be pretty comfortable using one.
But, if you’re in a particularly high-trust environment, or you have no internet access, running your own may be super important, and then yeah, do it :). They aren’t big complex systems, even with multi-arm bandit logic added in (the difficulty there is the logic, not the processing).
Or if you think the prices being charged by the incumbents are ridiculous. Actually, perhaps hit me up and we’ll make a startup and do this right…
Should you build your own?
A trivial flag system + persistence could be as little as a few days work. Less if you grab an existing bolt-on for your framework. If you have multiple services, or teams, or languages.. expect that to become the gift that keeps on giving as you have to consolidate and converge across your organisation indefinitely. If you have the resources – great, not a problem.
I think most people will be better off taking one of the existing open source flag systems – perhaps https://unleash.github.io/ – and using it; even if it is more complex than a system tightly fitted to your needs, the benefit of having one that is a true API from the start will pay for itself the very first time you split a project, or want to report what features are on in dev and off in prod, not to mention multiple existing language bindings etc.
I’ve had three pretty significant changes in my life recently. All are worth a little explanation.
I’ve resigned from my role as a Debian Developer. In truth I hadn’t been active for years, and stepping down just makes clear to everyone what the current status is. I may go back at some point, but I think there are fundamental changes needed to Debian – and most “distros” in fact – for it to really excel in our modern open source milieu. More on that another time, but the relevance here is that if I was to go back, it would be because the consensus around the mission has changed (or because I’ve decided the best thing I can do is try to shift that consensus).
During the most recent Linux.conf.au we had a meeting amongst the physically present papers committee members. I’d expected that meeting to be able what things we could do to prepare for the next LCA papers process – things like adding blinding to the review system, or introducing paper assignments to facilitate a significantly larger papers committee. However, it turned out there was a bigger topic to discuss – diversity within the papers committee. The general mood in that room was that we had been failing to really shift the diversity dial amongst the papers committee and that new ways to shift it needed to be trialled. One thing in particular that was suggested was replacing the leadership (the theory being that leaders more deeply connected to non-cis-hetero-white-male people would find it easier to recruit those folk into the papers committee). I think many good points were raised, and that if we’d started (say) 6 years ago we could have tried hybrid approaches (e.g. delegating recruitment entirely to someone with such connectivity, or a hard quota).
I don’t think the room actually had consensus on how much diversity is needed… should the committee represent the current demographics of LCA attendees? Or of the open source community? Or of humans? Or should it exceed the diversity in order to counter-balance the current skewed demographics and help lead from in front? I’m not sure what my position is at this point – but I am sure the folk in the room would all have given somewhat different answers, but equally that all felt more was needed.
In the morning following that I resigned – My reasons were very simple: my contribution to the papers committee, while (IMO) significant, are replaceable. There are other people with awareness of interesting projects in web/infra/cloud/programming/build/vcs technology spaces. People who are not cis-hetero-white-male; by leaving I provide an opportunity for the LCA papers committee to increase its diversity much faster than if I stayed.
I deeply loved being on the papers committee, I took great pride in looking for the unknown presenters who could add surprising and fascinating things to LCA. I hope that the committee take this priceless opportunity that we’ve given them to radically shift the amount of diversity in the team. And if it should pass that whatever target is reached, and they were to offer membership to me again in future, well then I’d be delighted to participate in future.
I joined VMware 2 and a half years ago to work on a suite of new SaaS products being built there. During that time I grew as an engineer, as you always hope to do; finishing up there as SRE architect (for one business unit). I have some thoughts I intend to pull together about what worked well, what didn’t, what things I’d try next time and so on, but those are not yet ready for publication. As of last week I’m no longer at VMware – I’m taking a small break for a bit; I plan to catch up on home and family things – we’ve had a couple of super hard years with e.g. Lynne’s health. I’m also going to do some of the more far-out ‘what if’ things that I haven’t had time to attempt while working. I may even get around to mass review and merging of various testing-cabal patches that I see backing up!