Rustup CI / test suite performance

Rustup (the community package manage for the Rust language) was starting to really suffer : CI times were up at ~ one hour.

We’ve made some strides in bringing this down.

Caching factory for test scenarios

The first thing, which achieved about a 30% reduction in test time was to stop recreating all the test context every time.

Rustup tests the download/installation/upgrade of distributions of Rust. To avoid downloading gigabytes in the test suite, the suite creates mocks of the published Rust artifacts. These mocks are GPG signed and compressed with multiple compression methods, both of which are quite heavyweight operations to perform – and not actually the interesting code under test to execute.

Previously, every test was entirely hermetic, and usually the server state was also unmodified.

There were two cases where the state was modified. One, a small number of tests testing error conditions such as GPG signature failures. And two, quite a number of tests that were testing temporal behaviour: for instance, install nightly at time A, then with a newer server state, perform a rustup update and check a new version is downloaded and installed.

We’re partway through this migration, but compare these two tests:

fn check_updates_some() {
    check_update_setup(&|config| {
        set_current_dist_date(config, "2015-01-01");
        config.expect_ok(&["rustup", "update", "stable"]);
        config.expect_ok(&["rustup", "update", "beta"]);
        config.expect_ok(&["rustup", "update", "nightly"]);
        set_current_dist_date(config, "2015-01-02");
        config.expect_stdout_ok(
            &["rustup", "check"],
            for_host!(
                r"stable-{0} - Update available : 1.0.0 (hash-stable-1.0.0) -> 1.1.0 (hash-stable-1.1.0)
beta-{0} - Update available : 1.1.0 (hash-beta-1.1.0) -> 1.2.0 (hash-beta-1.2.0)
nightly-{0} - Update available : 1.2.0 (hash-nightly-1) -> 1.3.0 (hash-nightly-2)
"
            ),
        );
    })
}
fn check_updates_some() {
    test(&|config| {
        config.with_scenario(Scenario::ArchivesV2_2015_01_01, &|config| {
            config.expect_ok(&["rustup", "toolchain", "add", "stable", "beta", "nightly"]);
        });
        config.with_scenario(Scenario::SimpleV2, &|config| {
        config.expect_stdout_ok(
            &["rustup", "check"],
            for_host!(
                r"stable-{0} - Update available : 1.0.0 (hash-stable-1.0.0) -> 1.1.0 (hash-stable-1.1.0)
beta-{0} - Update available : 1.1.0 (hash-beta-1.1.0) -> 1.2.0 (hash-beta-1.2.0)
nightly-{0} - Update available : 1.2.0 (hash-nightly-1) -> 1.3.0 (hash-nightly-2)
"
            ),
        );
            })
    })
}

The former version mutates the date with set_current_dist_date; the new version uses two scenarios, one for the earlier time, and one for the later time. This permits the server state to be constructed only once. On a per-test basis it can move as much as 50% of the time out of the test.

Single binary for the integration test suite

The next major gain was moving from having 14 separate integration test binaries to just one. This reduces the link cost of linking the test binaries, all of which link in the same library. It also permits us to see unused functions in our test support library, which helps with cleaning up cruft rather than having it accumulate.

Hard linking rather than copying ‘rustup-init’

Part of the test suite for each test is setting up an installed rustup environment. Why not start from scratch every time? Well, we obviously have tests that do that, but most tests are focused on steps beyond the new-user case. Setting up an installed rustup environment has a few steps, but particular ones are copying a binary of rustup into the test sandbox, and hard linking it under various names: cargo, rustc, rustup etc.

A debug build of rustup is ~20MB. Running 400 tests means about 8GB of IO; on some platforms most of that IO won’t hit disk, on others it will.

In review now is a PR that changes the initial copy to a hardlink: we hardlink the rustup-init built by cargo into each test, and then hardlink that to the various binaries. That saves 8GB of IO, which isn’t much from some perspectives, but it adds pressure on the page cache, and is wasted work. One wrinkle is a very low max-links limit on NTFS of 1023; to mitigate that we count the links made to rustup-init and generate a new inode for the original to avoid failures happening.

Future work

In GitHub actions this lowers our test time to 19m for Linux, 24m for Windows, which is a lot better but not great.

I plan on experimenting with separate actions for building release artifacts and doing CI tests – at the moment we have the same action do both, but they don’t share artifacts in the cache in any meaningful way, so we can probably gain parallelism there, as well as turning off release builds entirely for CI.

We should finish the cached test context work and use it everywhere.

Also we’re looking at having less integration tests and more narrow close to the code tests.

hyper combinators in Rust

Recently I read Michael Snoyman’s post on combining Axum, Hyper, Tonic and Tower. While his solution worked, it irked me – it seemed like there should be a much tighter solution possible.

I can deep dive into the code in a later post perhaps, but I think there are four points of difference. One, since the post was written Axum has started boxing its routes : so the enum dispatch approach taken, which delivers low overheads actually has no benefits today.

Two, while writing out the entire type by hand has some benefits, async code is much more pithy.

Thirdly, the code in the post is entirely generic, except the routing function itself.

And fourth, the outer Service<AddrStream> is an unnecessary layer to abstract over: given the similar constraints – the inner Service must take Request<..>, it is possible to just not use a couple of helpers and instead work directly with Service<Request...>.

So, onto a pithier version.

First, the app server code itself.

use std::{convert::Infallible, net::SocketAddr};

use axum::routing::get;
use hyper::{server::conn::AddrStream, service::make_service_fn};
use hyper::{Body, Request};
use tonic::async_trait;

use demo::echo_server::{Echo, EchoServer};
use demo::{EchoReply, EchoRequest};

struct MyEcho;

#[async_trait]
impl Echo for MyEcho {
    async fn echo(
        &self,
        request: tonic::Request<EchoRequest>,
    ) -> Result<tonic::Response<EchoReply>, tonic::Status> {
        Ok(tonic::Response::new(EchoReply {
            message: format!("Echoing back: {}", request.get_ref().message),
        }))
    }
}

#[tokio::main]
async fn main() {
    let addr = SocketAddr::from(([0, 0, 0, 0], 3000));

    let axum_service = axum::Router::new().route("/", get(|| async { "Hello world!" }));

    let grpc_service = tonic::transport::Server::builder()
        .add_service(EchoServer::new(MyEcho))
        .into_service();

    let both_service =
        demo_router::Router::new(axum_service, grpc_service, |req: &Request<Body>| {
            Ok::<bool, Infallible>(
                req.headers().get("content-type").map(|x| x.as_bytes())
                    == Some(b"application/grpc"),
            )
        });

    let make_service = make_service_fn(move |_conn: &AddrStream| {
        let both_service = both_service.clone();
        async { Ok::<_, Infallible>(both_service) }
    });

    let server = hyper::Server::bind(&addr).serve(make_service);

    if let Err(e) = server.await {
        eprintln!("server error: {}", e);
    }
}

Note the Router: it takes the two services and Fn to determine which to use on any given request. Then we just drop that composed service into make_service_fn and we’re done.

Next up we have the Router implementation. This is generic across any two Service<Request<...>> types as long as they are both Into<Bytes> for their Data, and Into<Box<dyn Error>> for errors.

use std::{future::Future, pin::Pin, task::Poll};

use http_body::combinators::UnsyncBoxBody;
use hyper::{body::HttpBody, Body, Request, Response};
use tower::Service;

#[derive(Clone)]
pub struct Router<First, Second, F> {
    first: First,
    second: Second,
    discriminator: F,
}

impl<First, Second, F> Router<First, Second, F> {
    pub fn new(first: First, second: Second, discriminator: F) -> Self {
        Self {
            first,
            second,
            discriminator,
        }
    }
}

impl<First, Second, FirstBody, FirstBodyError, SecondBody, SecondBodyError, F, FErr>
    Service<Request<Body>> for BinaryRouter<First, Second, F>
where
    First: Service<Request<Body>, Response = Response<FirstBody>>,
    First::Error: Into<Box<dyn std::error::Error + Send + Sync>> + 'static,
    First::Future: Send + 'static,
    First::Response: 'static,
    Second: Service<Request<Body>, Response = Response<SecondBody>>,
    Second::Error: Into<Box<dyn std::error::Error + Send + Sync>> + 'static,
    Second::Future: Send + 'static,
    Second::Response: 'static,
    F: Fn(&Request<Body>) -> Result<bool, FErr>,
    FErr: Into<Box<dyn std::error::Error + Send + Sync>> + Send + 'static,
    FirstBody: HttpBody<Error = FirstBodyError> + Send + 'static,
    FirstBody::Data: Into<bytes::Bytes>,
    FirstBodyError: Into<Box<dyn std::error::Error + Send + Sync>> + 'static,
    SecondBody: HttpBody<Error = SecondBodyError> + Send + 'static,
    SecondBody::Data: Into<bytes::Bytes>,
    SecondBodyError: Into<Box<dyn std::error::Error + Send + Sync>> + 'static,
{
    type Response = Response<
        UnsyncBoxBody<
            <hyper::Body as HttpBody>::Data,
            Box<dyn std::error::Error + Send + Sync + 'static>,
        >,
    >;
    type Error = Box<dyn std::error::Error + Send + Sync + 'static>;
    type Future =
        Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>> + Send + 'static>>;

    fn poll_ready(
        &mut self,
        cx: &mut std::task::Context<'_>,
    ) -> std::task::Poll<Result<(), Self::Error>> {
        match self.first.poll_ready(cx) {
            Poll::Ready(Ok(())) => match self.second.poll_ready(cx) {
                Poll::Ready(Ok(())) => Poll::Ready(Ok(())),
                Poll::Ready(Err(e)) => Poll::Ready(Err(e.into())),
                Poll::Pending => Poll::Pending,
            },
            Poll::Ready(Err(e)) => Poll::Ready(Err(e.into())),
            Poll::Pending => Poll::Pending,
        }
    }

    fn call(&mut self, req: Request<Body>) -> Self::Future {
        let discriminant = { (self.discriminator)(&req) };
        let (first, second) = if matches!(discriminant, Ok(false)) {
            (Some(self.first.call(req)), None)
        } else if matches!(discriminant, Ok(true)) {
            (None, Some(self.second.call(req)))
        } else {
            (None, None)
        };
        let f = async {
            Ok(match discriminant.map_err(Into::into)? {
                true => second
                    .unwrap()
                    .await
                    .map_err(Into::into)?
                    .map(|b| b.map_data(Into::into).map_err(Into::into).boxed_unsync()),
                false => first
                    .unwrap()
                    .await
                    .map_err(Into::into)?
                    .map(|b| b.map_data(Into::into).map_err(Into::into).boxed_unsync()),
            })
        };
        Box::pin(f)
    }
}

Interesting things here – I use boxed_unsync to abstract over the body concrete type, and I implement the future using async code rather than as a separate struct. It becomes much smaller even after a few bits of extra type constraining.

One thing that flummoxed me for a little was the need to capture the future for the underlying response outside of the async block. Failing to do so provokes a 'static requirement which was tricky to debug. Fortunately there is a bug on making this easier to diagnose in rustc already. The underlying problem is that if you create the async block, and then dereference self, the type for impl of .first has to live an arbitrary time. Whereas by capturing the future immediately, only the impl of the future has to live an arbitrary time, and that doesn’t then require changing the signature of the function.

This is almost worth turning into a crate – I couldn’t see an existing one when I looked, though it does end up rather small – < 100 lines. What do you all think?

A moment of history

I’ve been asked more than once what it was like at the beginning of Ubuntu, before it was a company, when an email from someone I’d never heard of came into my mailbox.

We’re coming up on 20 years now since Ubuntu was founded, and I had cause to do some spelunking into IMAP archives recently… while there I took the opportunity to grab the very first email I received.

The Ubuntu long shot succeeded wildly. Of course, we liked to joke about how spammy those emails where: cold-calling a raft of Debian developers with job offers, some of them were closer to phishing attacks :). This very early one – I was the second employee (though I started at 4 days a week to transition my clients gradually) – was less so.

I think its interesting though to note how explicit a gamble this was framed as: a time limited experiment, funded for a year. As the company scaled this very rapidly became a hiring problem and the horizon had to be pushed out to 2 years to get folk to join.

And of course, while we started with arch in earnest, we rapidly hit significant usability problems, some of which were solvable with porcelain and shallow non-architectural changes, and we built initially patches, and then the bazaar VCS project to tackle those. But others were not: for instance, I recall exceeding the 32K hard link limit on ext3 due to a single long history during a VCS conversion. The sum of these challenges led us to create the bzr project, a ground up rethink of our version control needs, architecture, implementation and user-experience. While ultimately git has conquered all, bzr had – still has in fact – extremely loyal advocates, due to its laser sharp focus on usability.

Anyhow, here it is: one of the original no-name-here-yet, aka Ubuntu, introductory emails (with permission from Mark, of course). When I clicked through to the website Mark provided there was a link there to a fantastical website about a space tourist… not what I had expected to be reading in Adelaide during LCA 2004.


From: Mark Shuttleworth <xxx@xxx>
To: Robert Collins <xxx@xxx>
Date: Thu, 15 Jan 2004, 04:30

Tom Lord gave me your email address, I believe he’s
already sent you the email that I sent him so I’m sure
you have some background.

In short, I am going to fund some open source
development for a year. This is part of a new project
that I will be getting off the ground in the coming
weeks. I don’t know where it will lead, it’s flying in
the face of a stiff breeze but I think at the end of
the day it will at least fund a few very good open
source developers for a full year to work on the
projects they like most.

One of the pieces of the puzzle is high end source
code management. I’ll be looking to build an
infrastructure that will manage source code for
between 100 and 8000 open source projects (yes,
there’s a big difference between the two, I don’t know
at which end of the spectrum we will be at the end of
the year but our infrastructure will have to at least
be capable of scaling to the latter within two years)
with upwards of 2000 developers, drawing code from a
variety of sources, playing with it and spitting it
out regularly in nice packages.

Arch and Subversion seem to be the two leading
contenders for “next generation open source sccm”. I’d
be interested in your thoughts on the two of them, and
how they stack up. I’m looking to hire one person who
will lead that part of the effort. They’ll work alone
from home, and be responsible for two things. First,
extending the tool (arch or svn) in ways that help the
project. Such extensions will be released under an
open source licence, and hopefully embraced by the
tools maintainers and included in the mainline code
for the tool. And second, they will be responsible for
our large-scale implementation of SCCM, using that
tool, and building the management scripts and other
infrastructure to support such a large, and hopefully
highly automated, set of repositories.

Would you be interested in this position? What
attributes and experience do you think would make you
a great person to have on the team? What would your
salary expectation be, as a monthly figure, for a one
year contract full time?

I’m currently on your continent, well, just off it. On
Lizard Island, up North. Am headed today for Brisbane,
then on the 17th to Launceston via Melbourne. If you
happen to be on any of those stops, would you be
interested in meeting up to discuss it further?

If you’re curious you can find out a bit more about me
at www.markshuttleworth.com. This project is much
lower key than some of what you’ll find there. It’s a
very long shot indeed. But if at worst all that
happens is a bunch of open source work gets funded at
my expense I’ll feel it was money well spent.

Cheers,
Mark

=====

“Good judgement comes from experience, and often experience
comes from bad judgement” – Rita Mae Brown


Strength training from home

For the last year I’ve been incrementally moving away from lifting static weights and towards body weight based exercises, or callisthenics. I’ve been doing this for a number of reasons, including better avoidance of injury (if I collapse, the entire stack is dynamic, if a bar held above my head drops on me, most of the weight is just dead weight – ouch), accessibility during travel – most hotel gyms are very poor, and functional relevance – I literally never need to put 100 kg on my back, but I do climb stairs, for instance.

Covid-19 shutting down the gym where I train is a mild inconvenience for me as a result, because even though I don’t do it, I am able to do nearly all my workouts entirely from home. And I thought a post about this approach might be of interest to other folk newly separated from their training facilities.

I’ve gotten most of my information from a few different youtube channels:

There are many more channels out there, and I encourage you to go and look and read and find out what works for you. Those 5 are my greatest hits, if you will. I’ve bought the FitnessFAQs exercise programs to help me with my my training, and they are indeed very effective.

While you don’t need a gymnasium, you do need some equipment, particularly if you can’t go and use a local park. Exactly what you need will depend on what you choose to do – for instance, doing dips on the edge of a chair can avoid needing any equipment, but doing them with some portable parallel bars can be much easier. Similarly, doing pull ups on the edge of a door frame is doable, but doing them with a pull-up bar is much nicer on your fingers.

Depending on your existing strength you may not need bands, but I certainly did. Buying rings is optional – I love them, but they aren’t needed to have a good solid workout.

I bought parallettes for working on the planche.undefined Parallel bars for dips and rows.undefined A pull-up bar for pull-ups and chin-ups, though with the rings you can add flys, rows, face-pulls, unstable push-ups and more. The rings. And a set of 3 bands that combine for 7 different support amounts.undefinedundefined

In terms of routine, I do a upper/lower split, with 3 days on upper body, one day off, one day on lower, and the weekends off entirely. I was doing 2 days on lower body, but found I was over-training with Aikido later that same day.

On upper body days I’ll do (roughly) chin ups or pull ups, push ups, rows, dips, hollow body and arch body holds, handstands and some grip work. Today, as I write this on Sunday evening, 2 days after my last training day on Friday, I can still feel my lats and biceps from training Friday afternoon. Zero issue keeping the intensity up.

For lower body, I’ll do pistol squats, nordic drops, quad extensions, wall sits, single leg calf raises, bent leg calf raises. Again, zero issues hitting enough intensity to achieve growth / strength increases. The only issue at home is having a stable enough step to get a good heel drop for the calf raises.

If you haven’t done bodyweight training at all before, when starting, don’t assume it will be easy – even if you’re a gym junkie, our bodies are surprisingly heavy, and there’s a lot of resistance just moving them around.

Good luck, train well!

2019 in the rearview

2019 was a very busy year for us. I hadn’t realised how busy it was until I sat down to write this post. There’s also some moderately heavy stuff in here – if you have topics that trigger you, perhaps make sure you have spoons before reading.

We had all the usual stuff. Movies – my top two were Alita and Abominable though the Laundromat and Ford v Ferrari were both excellent and moving pieces. I introduced Cynthia to Teppanyaki and she fell in love with having egg roll thrown at her face hole.

When Cynthia started school we dropped gymnastics due to the time overload – we wanted some downtime for her to process after school, and with violin having started that year she was just looking so tired after a full day of school we felt it was best not to have anything on. Then last year we added in a specific learning tutor to help with the things that she approaches differently to the other kids in her class, giving 2 days a week of extra curricular activity after we moved swimming to the weekends.

At the end of last year she was finally chipper and with it most days after school, and she had been begging to get into more stuff, so we all got together and negotiated drama class and Aikido.

The drama school we picked, HSPA, is pretty amazing. Cynthia adored her first teacher there, and while upset at a change when they rearranged classes slightly, is again fully engaged and thrilled with her time there. Part of the class is putting on a full scale production – they did a version of the Happy Prince near the end of term 3 – and every student gets a part, with the ability for the older students to audition for more parts. On the other hand she tells me tonight that she wants to quit. So shrug, who knows :).

I last did martial arts when I took Aikido with sensei Darren Friend at Aikido Yoshinkai NSW back in Sydney, in the late 2000’s. And there was quite a bit less of me then. Cynthia had been begging to take a martial art for about 4 years, and we’d said that when she was old enough, we’d sign her up, so this year we both signed up for Aikido at the Rangiora Aikido Dojo. The Rangiora dojo is part of the NZ organisation Aikido Shinryukan which is part of the larger Aikikai style, which is quite different, yet the same, as the Yoshinkai Aikido that I had been learning. There have been quite a few moments where I have had to go back to something core – such as my stance – and unlearn it, to learn the Aikikai technique. Cynthia has found the group learning dynamic a bit challenging – she finds the explanations – needed when there are twenty kids of a range of ages and a range of experience – from new intakes each term through to ones that have been doing it for 5 or so years – get boring, and I can see her just switch off. Then she misses the actual new bit of information she didn’t have previously :(. Which then frustrates her. But she absolutely loves doing it, and she’s made a couple of friends there (everyone is positive and friendly, but there are some girls that like to play with her after the kids lesson). I have gotten over the body disconnect and awkwardness and things are starting to flow, I’m starting to be able to reason about things without just freezing in overload all the time, so that’s not bad after a year. However, the extra weight is making my forward rolls super super awkward. I can backward roll easily, with moderately good form; forward rolls though my upper body strength is far from what’s needed to support my weight through the start of the roll – my arm just collapses – so I’m in a sort of limbo – if I get the moment just right I can just start the contact on the shoulder; but if I get the moment slightly wrong, it hurts quite badly. And since I don’t want large scale injuries, doing the higher rolls is very unnerving for me. I suspect its 90% psychological, but am not sure how to get from where I am to having confidence in my technique, other than rinse-and-repeat. My hip isn’t affecting training much, and sensei Chris seems to genuinely like training with Cynthia and I, which is very nice: we feel welcomed and included in the community.

Speaking of my hip – earlier this year something ripped cartilage in my right hip – ended up having to have an MRI scan – and those machines sound exactly like a dot matrix printer – to diagnose it. Interestingly, having the MRI improved my symptoms, but we are sadly in hurry-up-and-wait mode. Before the MRI, I’d wake up at night with some soreness, and my right knee bent, foot on the bed, then sleepily let my leg collapse sideways to the right – and suddenly be awake in screaming agony as the joint opened up with every nerve at its disposal. When the MRI was done, they pumped the joint full of local anaesthetic for two purposes – one is to get a clean read on the joint, and the second is so that they can distinguish between referred surrounding pain, vs pain from the joint itself. It is to be expected with a joint issue that the local will make things feel better (duh), for up to a day or so while the local dissipates. The expression on the specialists face when I told him that I had had a permanent improvement trackable to the MRI date was priceless. Now, when I wake up with joint pain, and my leg sleepily falls back to the side, its only mildly uncomfortable, and I readjust without being brought to screaming awakeness. Similarly, early in Aikido training many activities would trigger pain, and now there’s only a couple of things that do. In another 12 or so months if the joint hasn’t fully healed, I’ll need to investigate options such as stem cells (which the specialist was negative about) or steroids (which he was more negative about) or surgery (which he was even more negative about). My theory about the improvement is that the cartilage that was ripped was sitting badly and the inflation for the MRI allowed it to settle back into the appropriate place (and perhaps start healing better). I’m told that reducing inflammation systematically is a good option. Turmeric time.

Sadly Cynthia has had some issues at school – she doesn’t fit the average mould and while wide spread bullying doesn’t seem to be a thing, there is enough of it, and she receives enough of it that its impacted her happiness more than a little – this blows up in school and at home as well. We’ve been trying a few things to improve this – helping her understand why folk behave badly, what to do in the moment (e.g. this video), but also that anything that goes beyond speech is assault and she needs to report that to us or teachers no matter what.

We’ve also had some remarkably awful interactions with another family at the school. We thought we had a friendly relationship, but I managed to trigger a complete meltdown of the relationship – not by doing anything objectively wrong, but because we had (unknown to me) different folkways, and some perfectly routine and normal behaviour turned out to be stressful and upsetting to them, and then they didn’t discuss it with us at all until it had brewed up in their heads into a big mess… and its still not resolved (and may not ever be: they are avoiding us both).

I weighed in at 110kg this morning. Jan the 4th 2019 I was 130.7kg. Feb 1 2018 I was 115.2kg. This year I peaked at 135.4kg, and got down to 108.7kg before Christmas food set in. That’s pretty happy making all things considered. Last year I was diagnosed with Coitus headaches and though I didn’t know it the medicine I was put on has a known side effect of weight gain. And it did – I had put it down to ongoing failure to manage my diet properly, but once my weight loss doctor gave me an alternative prescription for the headaches, I was able to start losing weight immediately. Sadly, though the weight gain through 2018 was effortless, losing the weight through 2019 was not. Doable, but not effortless. I saw a neurologist for the headaches when they recurred in 2019, and got a much more informative readout on them, how to treat and so on – basically the headaches can be thought of as an instability in the system, and the medicines goal is to stabilise things, and once stable for a decent period, we can attempt to remove the crutch. Often that’s successful, sometimes not, sometimes its successful on a second or third time. Sometimes you’re stuck with it forever. I’ve been eating a keto / LCHF diet – not super strict keto, though Jonie would like me to be on that, I don’t have the will power most of the time – there’s a local truck stop that sells killer hotdogs. And I simply adore them.

I started this year working for one of the largest companies on the planet – VMware. I left there in February and wrote a separate post about that. I followed that job with nearly the polar opposite – a startup working on a blockchain content distribution system. I wrote about that too. Changing jobs is hard in lots of ways – for instance I usually make friendships at my jobs, and those suffer some when you disappear to a new context – not everyone makes connections with you outside of the job context. Then there’s the somewhat non-rational emotional impact of not being in paid employment. The puritans have a lot to answer for. I’m there again, looking for work (and hey, if you’re going to be at Linux.conf.au (Gold Coast Australia January 13-17) I’ll be giving a presentation about some of the interesting things I got up to in the last job interregnum I had.

My feet have been giving me trouble for a couple of years now. My podiatrist is reasonably happy with my progress – and I can certainly walk further than I could – I even did some running earlier in the year, until I got shin splints. However, I seem to have hyper sensitive soles, so she can’t correct my pro-nation until we fix that, which at least for now means a 5 minute session where I touch my feet, someone else does, then something smooth then something rough – called “sensory massage”.

In 2017 and 2018 I injured myself at the gym, and in 2019 I wanted to avoid that, so I sought out ways to reduce injury. Moving away from machines was a big part of that; more focus on technique another part. But perhaps the largest part was moving from lifting dead weight to focusing on body weight exercises – callisthenics. This shifts from a dead weight to control when things go wrong, to an active weight, which can help deal with whatever has happened. So far at least, this has been pretty successful – although I’ve had minor issues – I managed to inflame the fatty pad the olecranon displaces when your elbow locks out – I’m nearly entirely transitioned to a weights-free program – hand stands, pistol squats, push ups, dead hangs and so on. My upper body strength needs to come along some before we can really go places though… and we’re probably going to max out the hamstring curl machine (at least for regular two-leg curls) before my core is strong enough to do a Nordic drop.

Lynne has been worried about injuring herself with weight lifting at the gym for some time now, but recently saw my physio – Ben Cameron at Pegasus PhysioSouth – who is excellent, and he suggested that she could have less chronic back pain if she took weights back up again. She’s recently told me that I’m allowed one ‘told you so’ about this, since she found herself in a spot where previously she would have put herself in a poor lifting position, but the weight training gave her a better option and she intuitively used it, avoiding pain. So that’s a good thing – complicated because of her bodies complicated history, but an excellent trainer and physio team are making progress.

Earlier this year she had a hell of a fright, with a regular eye checkup getting referred into a ‘you are going blind; maybe tomorrow, maybe within 10 years’ nightmare scenario. Fortunately a second opinion got a specialist who probably knows the same amount but was willing to communicate it with actual words… Lynne has a condition which diabetes (type I or II) can affect, and she has a vein that can alter state somewhat arbitrarily but will probably only degrade slowly, particularly if Lynne’s diet is managed as she has been doing.

Diet wise, Lynne also has been losing some weight but this is complicated by her chronic idiopathic pancreatitis. That’s code for ‘it keeps happening and we don’t know why’ pancreatitis. We’ve consulted a specialist in the North Island who comes highly recommended by Lynne’s GP, who said that rapid weight loss is a little known but possible cause of pancreatitis – and that fits the timelines involved. So Lynne needs to lose weight to manage the onset of type II diabetes. But not to fast, to avoid pancreatitis, which will hasten the onset of type II diabetes. Aiee. Slow but steady – she’s working with the same doctor I am for that, and a similar diet, though lower on the fats as she has no gall… bladder.

In April our kitchen waste pipe started chronically blocking, and investigation with a drain robot revealed a slump in the pipe. Ground penetrating radar reveal an anomaly under the garage… and this escalated. We’re going to have to move out of the house for a week while half the house’s carpets are lifted, grout is pumped into the foundations to tighten it all back up again – and hopefully they don’t over pump it – and then it all gets replaced. Oh, and it looks like the drive will be replaced again, to fix the slumped pipe permanently. It will be lovely when done but right now we’re facing a wall of disruption and argh.

Around September I think, we managed to have a gas poisoning scare – our gas hob was left on and triggered a fireball which fortunately only scared Lynne rather than flambéing her face. We did however not know how much exposure we’d had to the LPG, nor to partially combusted gas – which produces toxic CO as a by-product, so there was a trip into the hospital for observation with Cynthia, with Lynne opting out. Lynne and Cynthia had had plenty of the basic symptoms – headaches, dizziness and so on at the the time, but after waiting for 2 hours in the ER queue that had faded. Le sigh. The hospital, bless their cotton socks don’t have the necessary equipment to diagnose CO poisoning without a pretty invasive blood test, but still took Cynthia’s vitals using methods (manual observation and a infra-red reader) that are confounded by the carboxyhemoglobin that forms from the CO that has been inhaled. Pretty unimpressed – our GP was livid. (This is one recommended protocol). Oh, and our gas hob when we got checked out – as we were not sure if we had left it on, or it had been misbehaving, turned out to have never been safe, got decertified and the pipe cut at the regulator. So we’re cooking on a portable induction hob for now.

When we moved to Rangiora I was travelling a lot more, Christchurch itself had poorer air quality than Rangiora, and our financial base was a lot smaller. Now, Rangiora’s population has gone up nearly double (13k to 19k conservatively – and that’s ignoring the surrounds that use Rangiora as a base), we have more to work with, the air situation in Christchurch has improved massively, and even a busy years travel is less than I was doing before Cynthia came along. We’re looking at moving – we’re not sure where yet; maybe more country, maybe more city.

One lovely bright spot over the last few years has been reconnecting with friends from school, largely on Facebook – some of whom I had forgotten that I knew back at school – I had a little clique but was not very aware of the wider school population in hindsight (this was more than a little embarrassing to me, as I didn’t want to blurt out “who are you?!”) – and others whom I had not :). Some of these reconnections are just light touch person-X exists and cares somewhat – and that’s cool. One in particular has grown into a deeper friendship than we had back as schoolkids, and I am happy and grateful that that has happened.

Our cats are fat and happy. Well mostly. Baggy is fat and stressed and spraying his displeasure everywhere whenever the stress gets too much :(. Cynthia calls him Mr Widdlepants. The rest of the time he cuddles and purrs and is generally happy with life. Dibbler and Kitten-of-the-wild are relatively fine with everything.

Cynthia’s violin is coming along well. She did a small performance for her classroom (with her teacher) and wowed them. I’ve been inspired to start practising trumpet again. After 27 years of decay my skills are decidedly rusty, but they are coming along. Finding arrangements for violin + trumpet is a bit challenging, and my sight-reading-with-transposition struggles to cope, but we make do. Lynne is muttering about getting a clarinet or drum-kit and joining in.

So, 2019. Whew. I hope yours was less stressful and had as many or more bright points than ours. Onwards to 2020.

A Cachecash retrospective

In June 2019 I started a new role as a software engineer at a startup called Cachecash. Today is probably the last day of payroll there, and as is my usual practice, I’m going to reflect back on my time there. Less commonly, I’m going to do so in public, as we’re about to open the code (yay), and its not a mega-corporation with everything shuttered up (also yay).

Framing

This is intended to be a blameless reflection on what has transpired. Blameless doesn’t mean inaccurate; but it means placing the focus on the process and system, not on the particular actor that happened to be wearing the hat at the time a particular event happened. Sometimes the system is defined by the actors, and in that case – well, I’ll let you draw your own conclusions if you encounter that case.

A retrospective that we can’t learn from is useless. Worse than useless, because it takes time to write and time to read and that time is lost to us forever. So if a thing is a particular way, it is going to get said. Not to be mean, but because false niceness will waste everyone’s time. Mine and my ex-colleagues whose time I respect. And yours, if you are still reading this.

What was Cachecash

Cachecash was a startup – still is in a very technical sense, corporation law being what it is. But it is still a couple of code bases – and a nascent open source project (which will hopefully continue) – built to operationalise and productise this research paper that the Cachecash founders wrote.

What it isn’t anymore is a company investing significant amounts of time and money in the form of engineering in making code, to make those code bases better.

Cachecash was also a team of people. That obviously changed over time, but at the time I write this it is:

  • Ghada
  • Justin
  • Kevin
  • Marcus
  • Petar
  • Robert
  • Scott

And we’re all pretty fantastic, if you ask me :).

Technical overview

The CAPNet paper that I linked above doesn’t describe a product. What it describes is a system that permits paying caches (think squid/varnish etc) for transmitting content to clients, while also detecting attempts by such caches to claim payment when they haven’t transmitted, or attempting to collude with a client to pretend to overtransmit and get paid that way. A classic incentives-aligned scheme.

Note that there is no blockchain involved at this layer.

The blockchain was added into this core system as a way to build a federated marketplace – the idea was that the blockchain provided a suitable substrate for negotiating the purchase and sale of contracts that would be audited using the CAPNet accounting system, the payments could be micropayments back onto the blockchain, and so on – we’d avoid the regular financial system, and we wouldn’t be building a fragile central system that would prevent other companies also participating.

Miners would mine coins, publishers would buy coins then place them in escrow as a promise to pay caches to deliver content to clients, and a client would deliver proof of delivery back to the cache which would then claim payment from the publisher.

Technical Challenges

There were a few things that turned up as significant issues. In no particular order:

The protocol

The protocol itself adds additional round trips to multiple peers – in its ‘normal’ configuration the client ends up running (web- for browers) GRPC connections to 5 endpoints (with all the normal windowing concerns, but potentially over QUIC), and then gets chunks of content in batches (concurrently) from 4 of the peers, runs a small crypto brute force operation on the combined result, and then moves onto the next group of content. This should be sounding suspiciously like TCP – it is basically a window management problem, and it has exactly the same performance management problems – fast start, maximum window size, how far to reduce it when problems are suffered. But accentuated: those 4 cache peers can all suffer their own independent noise problems, or be hostile. But also, they can also suffer correlated problems: they might all be in the same datacentre, or be all run by a hostile actor, or the client might be on a hostile WiFi link, or the client’s OS/browser might be hostile. Lets just say that there is a long, rich road for optimising this new protocol to make it fast, robust, reliable. Much as we have taken many years to make HTTP into QUIC, drawing upon techniques like forward error correction rather than retries – similar techniques will need to be applied to give this protocol similar performance characteristics. And evolving the protocol while maintaining the security properties is a complicated task, with three actors involved, who may collude in various ways.

An early performance analysis I did on the go code implementation showed that the brute forcing work was a bottleneck because while the time (once optimise) per second was entirely modest for any small amount of data, the delay added per window element acts as a brake on performance for high capacity low latency links. For a 1Gbps 25ms RTT link I estimated a need for 8 cores doing crypto brute forcing on the client.

JS

Cachecash is essentially implementing a new network protocol. There are some great hooks these days in browsers, and one can hook in and provide streams to things like video players to let them get one segment of video. However, for downloading an entire file – for instance, if one is downloading a full video, it is not so easy. This bug, open for 2 years now, is the standards based way to do it. Even so non-standards based way to do it involves buffering the entire content in memory, oh and reflecting everything through a static github service worker. (You of course host such a static page yourself, but then the whole idea of this federated distributed system breaks down a little).

Our initial JS implementation was getting under 512KBps with all-local servers – part of that was the bandwidth delay product issue mentioned above. Moving to getting chunks of content from each cache concurrently using futures improved that up to 512KBps, but thats still shocking for a system we want to be able to compete with the likes of Youtube, Cloudflare and Akamai.

One of the hot spots turned out to be calculating SHA-256 values – the CAPNet algorithm calculates thousands (it’s tunable, but 8k in the set I was analysing) of independent SHA’s per chunk of received data. This is a problem – in browser SHA routines, even the recent native hosted ones – are slow per SHA. They are not slow per byte. Most folk want to make a small number of SHA calculations. Maybe thousands in total. Not tens of thousands per MB of data received….. So we wrote an implementation of the core crypto routines in Rust WASM, which took our performance locally up to 2MBps in Firefox and 6MBps in Chromium.

It is also possible we’d show up as crypto-JS at that point and be blacklisted as malware!

Blockchain

Having chosen to involve a block chain in the stack we had to deal with that complexity. We chose to take bitcoin’s good bits and run with those rather than either running a sidechain, trying to fit new transaction types into bitcoin itself, or trying to shoehorn our particular model into e.g. Ethereum. This turned out to be a fairly large amount of work : not the core chain itself – cloning the parts of bitcoin that we wanted was very quick. But then layering on the changes that we needed, to start dealing with escrows and negotiating parameters between components and so forth. And some of the operational challenges below turned up here as well even just in developer test setups (in particular endpoint discovery).

Operational Challenges

The operational model was pretty interesting. The basic idea was that eventually there would be this big distributed system, a bit-coin like set of miners etc, and we’d be one actor in that ecosystem running some subset of the components, but that until then we’d be running:

  • A centralised ledger
  • Centralised random number generation for the micropayment system
  • Centralised deployment and operations for the cache fleet
  • Software update / vetting for the publisher fleet
  • Software update / publishing for the JS library
  • Some number of seed caches
  • Demo publishers to show things worked
  • Metrics, traces, chain explorer, centralised logging

We had most of this live and running in some fashion for most of the time I was there – we evolved it and improved it a number of times as we iterated on things. Where appropriate we chose open source components like Jaeger, Prometheus and Elasticsearch. We also added policy layers on top of them to provide rate limiting and anti-spoofing facilities. We deployed stuff in AWS, with EKS, and there were glitches and things to workaround but generally only a tiny amount of time went into that part of it. I think I spent a day on actual operations a month, or thereabouts.

Other parties were then expected to bring along additional caches to expand the network, additional publishers to expand the content accessible via the network, and clients to use the network.

Ensuring a process run by a third party is network reachable by a browser over HTTPS is a surprisingly non-simple problem. We partly simplified it by mandating that they run a docker container that we supplied, but there’s still the chance that they are running behind a firewall with asymmetric ingress. And after that we still need a domain name for their endpoint. You can give every cache a CNAME in a dedicated subdomain – say using their public key as the subdomain, so that only that cache can issue requests to update their endpoint information in DNS. It is all solvable, but doing it so that the amount of customer interaction and handholding is reduced to the bare minimum is important: a user with a fleet of 1000 machines doesn’t want to talk to us 1000 times, and we don’t want to talk to them either. But this was a another bit of this-isn’t-really-distributed-is-it grit in the distributed-ointment.

Adoption Challenges

ISPs with large fleets of machines are in principle happy to sell capacity on them in return for money – yay. But we have no revenue stream at the moment, so they aren’t really incentivised to put effort in, it becomes a matter of principle, not a fiscal “this is 10x better for my business” imperative. And right now, its 10x slower than HTTP. Or more.

Content owners with large amounts of content being delivered without a CDN would like a radically cheaper CDN. Except – we’re not actually radically cheaper on a cost structure basis. Current CDN’s are expensive for their expensive 2nd and third generation products because no-one offers what they offer – seamless in-request edge computing. But that ISP that is contributing a cache to the fleet is going to want the cache paid for, and thats the same cost structure as existing CDNs – who often have a free entry tier. We might have been able to make our network cheaper eventually, but I’m just not sure about the radically cheaper bit.

Content owners who would like a CDN marketplace where the CDN caches are competing with each other – driving costs down – rather than than the CDN operators competing – would absolutely love us. But I rather suspect that those owners want more sophisticated offerings. To be clear, I wasn’t on the customer development team, and didn’t get much in the way of customer development briefings. But things like edge computing workers, where completely custom code can run in the CDN network, adjacent to ones user, are much more powerful offerings than simple static content shipping offerings, and offered by all major CDN’s. These are trusted services – the CAPNet paper doesn’t solve the problem of running edge code and providing proof that it was run. Enarx might go some, or even a long way way to running such code in an untrusted context, but providing a proof that it was run – so that running it can become a mining or mining-like operation is a whole other question. Without such an answer, an edge computing network starts to depend on trusting the caches behaviour a lot more all over again – the network has no proof of execution to depend on.

Rapid adjustment – load spikes – is another possible use case, but the use of the blockchain to negotiate escrows actually seemed to work against our ability to offer that. Akami define load spike in a time frame faster than many block chains can decide that a transaction has actually been accepted. Offchain transactions are of course a known thing in the block chain space but again that becomes additional engineering.

Our use of a new network protocol – for all that it was layered on standard web technology – made it harder for potential content owners to adopt our technology. Rather than “we have 200 local proxies that will deliver content to your users, just generate a url of the form X.Y.Z”, our solution is “we do not trust the 200 local proxies that we have, so you need to run complicated JS in your browser/phone app etc” to verify that the proxies are actually doing their job. This is better in some ways – precisely because we don’t trust those proxies, but it also increases both the runtime cost of using the service, the integration cost adopting the service, and complexity of debugging issues receiving content via the service.

What did we learn?

It is said that “A startup is an organization formed to search for a repeatable and scalable business model.” What did we uncover in our search? What can we take away going forward?

In principle we have a classic two sided market – people with excess capacity close to users want to sell it, and people with excess demand for their content want to buy delivery capacity.

The baseline market is saturated. The market as a whole is on its third or perhaps fourth (depending on how you define things) major iteration of functionality.

Content delivery purchasers are ok with trusting their suppliers : any supply chain fraud happening in this space at the moment is so small no-one is talking about it that I heard about.

Some of the things we were doing don’t seem to have been important to the customers we talked to – I don’t have a great read on this, but in particular, the blockchain aspect seems to have been more important to our long term vision than to the 2-sided market place that we perceived. It would be fascinating to me to validate that somehow – would cache capacity suppliers be willing to trust us enough to sell capacity to us with just the auditing mechanism, without the blockchain? Would content providers be happy buying credit from us rather than from a neutral exchange?

What did I learn?

I think in hindsight my startup muscles were atrophied – it had been some years since Canonical and it took a few months to start really thinking lean-startup again on a personal basis. That’s ok, because I was hired to build systems. But its not great, because I can do better. So number one: think lean-startup and really step up to help with learning and validation.

I levelled up my Go lang skills. That was really nice – Kevin has deep knowledge there, and though I’ve written Go before I didn’t have a good appreciation for style or aesthetics, or why. I do now. Where before I’d say ‘I’m happy to dive in but its not a language I feel I really know’, I am now happy to say that I know Go. More to learn – there always is – but in a good place.

I did a similar thing to my JS skills, but not to the same degree. Having climbed fairly deeply into the JS client – which is written in Typescript, converted its bundling system to webpack to work better with Rust-WASM, and so on. Its still not my go-to place, but I’m much more comfortable there now.

And of course playing with Rust-WASM was pure delight. Markus and I are both Rust afficionados, and having a genuine reason to write some Rust code for work was just delightful. Finding this bug was just a bonus :).

It was also really really nice being back in a truely individual contributor role for a while. I really enjoyed being able to just fix bugs and get on with things while I got my bearings. I’ve ended up doing a bit more leadership – refining of requirements, translating between idea-and-specification and the like recently, but still about 80% of time has been able to be sit-down-and-code, and that really is a pleasant holiday.

What am I going to change?

I’m certainly going to get a new job :). If you’re hiring, hit me up. (If you don’t have my details already, linkedin is probably best).

I’m think there the core thing I need to do is more alignment of the day to day work I’m doing with needs of customer development : I don’t want to take on or take over the customer development role – that will often be done best in person with a customer for startups, and I’m happy remote – but the more I can connect what I’m trying to achieve with what will get the customers to pay us, the more successful any business I’m working in will be. This may be a case for non-vanity metrics, or talking more with the customer-development team, or – well, I don’t know exactly what it will look like until I see the context I end up in, but I think more connection will be important.

And I think the second major thing is to find a better balance between individual contribution and leadership. I love individual contribution, it is perhaps the least stressful and most Zen place to be. But it is also the least effective unless the project has exactly one team member. My most impactful and successful roles have been leadership roles, but the pure leadership role with no individual contribution slowly killed me inside. Pure individual contribution has been like I imagine crack to be, and perhaps just as toxic in the long term.

Rust and distributions

Daniel wrote a lovely blog post about Rust’s ability to be included in distributions, both as a language that you can get via the distribution, and as the language that components of the distribution are being written in.

I think this is a great goal to raise and I have just a few thoughts and quibbles. First I want to acknowledge and agree with him on the Rust community, its so very nice, and he is doing a great thing as rustup lead; I wish I had more time to put in, I have more things I want to contribute to rustup. I’ll try to get back to the meetings soon.

On trust

I completely agree about the need for the crates index improvement : without those we cannot have a mirror network, and thats a significant issue for offline users and slow-region users.

On curlsh though

It isn’t the worst possible thing, for all that its “untrusted bootstrapping”, the actual thing downloaded is https secured etc, and so is the rustup binary itself. Put another way, I think the horror is more perceptual than analyzed risk. Someone that trusts Verisign etc enough to download the Debian installer enough over it, has exactly the same risk as someone trusting Verisign enough to download rustup at that point in time.

Cross signing curlsh that with per-distro keys or something seems pretty ridiculous to me, since the root of trust is still that first download; unless you’re wandering up to someone who has bootstrapped their compiler by hand (to avoid reflections-on-trust attacks), to get an installer, to build a system, to then do reproducible builds, to check that other systems are actually safe… aieeee.

I think its easier to package the curl|sh shell script in Debian itself perhaps? apt install get-rustup; then if / when rustup becomes packaged the user instructions don’t change but the root of trust would, as get-rustup would be updated to not download rustup, but to trigger a different package install, and so forth.

I don’t think its desirable though, to have distribution forks of the contents that rustup manages – Debian+Redhat+Suse+… builds of nightly rust with all the things failing or not, and so on – I don’t see who that would help. And if we don’t have that then the root of trust would still not be shifted under the GPG keychain – it would still be the HTTPS infrastructure for downloading rust toolchains + the integrity of the rustup toolchain builds themselves. Making rustup, which currently shares that trust, have a different trust root, seems pointless.

On duplication of dependencies

I think Debian needs to become more inclusive here, not Rustup. Debian has spent; pauses, counts, yes, DECADES, rejecting multiple entire ecosystems because of a prejuidiced view about what the Right Way to manage dependencies is. And they are not right in a universal sense. They were right in an engineering sense: given constraints (builds are expensive, bandwidth is expensive, disk is expensive), they are right. But those are not universal constraints, and seeking to impose those constraints on Java and Node – its been an unmitigated disaster. It hasn’t made those upstreams better, or more secure, or systematically fixed problems for users. I have another post on this so rather than repeating I’m going to stop here :).

I think Rust has – like those languages – made the crucial, maintainer and engineering efficiency important choice to embrace enabling incremental change across libraries, with the consequence that dependencies don’t shift atomically, and sure, this is basically incompatible with Debian packaging world view which says that point and patch releases of libraries are not distinct packages, and thus the shared libs for these things all coexist in the same file on disk. Boom! Crash!

I assert that it is entirely possible to come up with a reasonable design for managing a respository of software that doesn’t make this conflation, would allow actual point and patch releases of exist as they are for the languages that have this characteristic, and be amenable to automation, auditing and reporting for security issues. E.g. Modernise Debian to cope with this fundamentally different language design decision… which would make Java and Node and Rust work so very much better.

Alternatively, if Debian doesn’t want to make it possible to natively support languages that have made this choice, Debian could:

  • ship static-but-for-system-libs builds
  • not include things written in rust
  • ask things written in rust to converge their dependencies again and again and again (and only update them when the transitive dependencies across the entire distro have converged)

I have a horrible suspicion about which Debian will choose to do :(. The blinkers / echo chamber are so very strong in that community.

For Windows

We got to parity with Linux for IO for non-McAfee users, but I guess there are a lot of them out there; we probably need to keep pushing on tweaking it until it work better for them too; perhaps autodetect McAfee and switch to minimal? I agree that making Windows users – like I am these days – feel tier one, would be nice :). Maybe a survey of user experience would be a good starting point.

Shared libraries

Perhaps generating versioned symbols automatically and building many versions of the crate and then munging them together? But I’d also like to point here again that the whole focus on shared libraries is a bit of a distribution blind spot, and looking at the vast amount of distribution of software occuring in app stores and their model, suggests different ways of dealing with these things. See also the fairly specific suggestion I make about the packaging system in Debian that is the root of the problem in my entirely humble view.

Bonus

John Goerzen posted an entirely different thing recently, but in it he discusses programs that don’t properly honour terminfo. Sadly I happen to know that large chunks of the Rust ecosystem assume that everything is ANSI these days, and it certainly sounds like, at least for John, that isn’t true. So thats another way in which Rust could be more inclusive – use these things that have been built, rather than being modern and new age and reinventing the 95% match.

Want me to work with you?

Reach out to me – I’m currently looking for something interesting to do. https://www.linkedin.com/in/rbtcollins/ and https://twitter.com/rbtcollins are good ways to grab me if you don’t already have my details.

Should you reach out to me? Maybe :). First, a little retrospective.

Three years ago, I wrote the following when reflecting on what I wanted to be doing:

Priorities (roughly ordered most to least important):

  • Keep living in Rangiora (family)
  • Up to moderate travel requirements – 4 trips a year + LCA/PyCon
  • Significant autonomy (not at the expense of doing the right thing for the company, just I work best with the illusion of free will 🙂 )
  • Be doing something that matters
    • -> Being open source is one way to this, but not the only one
  • Something cutting edge would be awesome
    • -> Rust / Haskell / High performance requirements / scale / ….
  • Salary

How well did that work for me? Pretty good. I had a good satisfying job at VMware for 3 years, met some wonderful people, achieved some very cool things. And those priorities above were broadly achieved.
The one niggle that stands out was this – Did the things we were doing matter? Certainly there was no social impact – VMware isn’t a non-profit, being right at the core of capitalism as it is. There was direct connection and impact with the team, the staff we worked with and the users of the products… but it is just a bit hard to feel really connected through that though: VMware is a very large company and there are many layers between users and developers.

We were quite early adopters of Kubernetes, which allowed me to deepen my Go knowledge and experience some more fun with AWS scale operations. I had many interesting discussions about the relative strengths of Python Go and Rust and Java with colleagues there. (Hi Geoffrey).

Company culture is very important to me, and VMware has a fantastically supportive culture. One of the most supportive companies I’ve been in, bar none. It isn’t a truely remote-organised company though: rather its a bunch of offices that talk to each other, which I think is sad. True remote-first offers so much more engagement.

I enjoy building things to solve problems. I’ve either directly built, or shaped what is built, in all my most impactful and successful roles. Solving a problem once by hand is fine; solving it for years to come by creating a tool is far more powerful.

I seem to veer into toolmaking very often: giving other people the ability to solve their problems takes the power of a tool and multiplies it even further.

It should be no surprise then that I very much enjoy reading white papers like the original Dapper and Map-reduce ones, LinkedIn’s Kafka or for more recent fodder the Facebook Akkio paper. Excellent synthesis and toolmaking applied at industrial scale. I read those things and I want to be a part of the creation of those sorts of systems.

I was fortunate enough to take some time to go back to university part-time, which though logistically challenging is something I want to see through.

Thus I think my new roughly ordered (descending) list of priorities needs to be something like this:

  • Keep living in Rangiora (family)
  • Up to moderate travel requirements – 4 team-meeting trips a year + 2 conferences
  • Significant autonomy (not at the expense of doing the right thing for the company, just I work best with the illusion of free will 🙂 )
  • Be doing something that matters
    • Be working directly on a problem / system that has problems
  • Something cutting edge would be awesome
    • Rust / Haskell / High performance requirements / scale / ….
  • A generative (Westrum definition) + supportive company culture
  • Remote-first or at least very remote familiar environment
  • Support my part time study / self improvement initiative
  • Salary

Continuous Delivery and software distributors

Back in 2010 the continuous delivery meme was just grabbing traction. Today its extremely well established… except in F/LOSS projects.

I want that to change, so I’m going to try and really bring together a technical view on how that could work – which may require multiple blog posts – and if it gets traction I’ll put my fingers where my thoughts are and get into specifics with any project that wants to do this.

This is however merely a worked model today: it may be possible to do things quite differently, and I welcome all discussion about the topic!

tl;dr

Pick a service discovery mechanism (e.g. environment variables), write two small APIs – one for flag delivery, with streaming updates, and one for telemetry, with an optional aggressive data hiding proxy, then use those to feed enough data to drive a true CI/CD cycle back to upstream open source projects.

Who is in?

Background

(This assumes you know what C/D is – if you don’t, go read the link above, maybe wikipedia etc, then come back.)

Consider a typical SaaS C/D pipeline:

git -> build -> test -> deploy

Here all stages are owned by the one organisation. Once deployed, the build is usable by users – its basically the simplest pipeline around.

Now consider a typical on-premise C/D pipeline:

git -> build -> test -> expose -> install

Here the last stage, the install stage, takes place in the users context, but it may be under the control of the create, or it may be under the control of the user. For instance, Google play updates on an Android phone: when one selects ‘Update Now’, the install phase is triggered. Leaving the phone running with power and Wi-Fi will trigger it automatically, and security updates can be pushed anytime. Continuing the use of Google Play as an example, the expose step here is an API call to upload precompiled packages, so while there are three parties, the distributor – Google – isn’t performing any software development activities (they do gatekeep, but not develop).

Where it gets awkward is when there are multiple parties doing development in the pipeline.

Distributing and C/D

Lets consider an OpenStack cloud underlay circa 2015: an operating system, OpenStack itself, some configuration management tool (or tools), a log egress tool, a metrics egress handler, hardware mgmt vendor binaries. And lets say we’re working on something reasonably standalone. Say horizon.

OpenStack for most users is something obtained from a vendor. E.g. Cisco or Canonical or RedHat. And the model here is that the vendor is responsible for what the user receives; so security fixes – in particular embargoed security fixes – cannot be published publically and the slowly propogate. They must reach users very quickly. Often, ideally, before the public publication.

Now we have something like this:

upstream ends with distribution, then vendor does an on-prem pipeline


Can we not just say ‘the end of the C/D pipeline is a .tar.gz of horizon at the distribute step? Then every organisation can make their own decisions?

Maybe…

Why C/D?

  • Lower risk upgrades (smaller changes that can be reasoned about better; incremental enablement of new implementations to limit blast radius, decoupling shipping and enablement of new features)
  • Faster delivery of new features (less time dealing with failed upgrades == more time available to work on new features; finished features spend less time in inventory before benefiting users).
  • Better code hygiene (the same disciplines needed to make C/D safe also make more aggressive refactoring and tidiness changes safer to do, so it gets done more often).

1. If the upstream C/D pipeline stops at a tar.gz file, the lower-risk upgrade benefit is reduced or lost: the pipeline isn’t able to actually push all the to installation, and thus we cannot tell when a particular upgrade workaround is no longer needed.

But Robert, that is the vendors problem!

I wish it was: in OpenStack so many vendors had the same problem they created shared branches to work on it, then asked for shared time from the project to perform C/I on those branches. The benefit is only realise when the developer who is responsible for creating the issue can fix it, and can be sure that the fix has been delivered; this means either knowing that every install will install transiently every intermediary version, or that they will keep every workaround for every issue for some minimum time period; or that there will be a pipeline that can actually deliver the software.

2. .tar.gz files are not installed and running systems. A key characteristic of a C/D pipeline is that is exercises the installation and execution of software; the ability to run a component up is quite tightly coupled to the component itself, for all the the ‘this is a process’ interface is very general, the specific ‘this is server X’ or ‘this is CLI utility Y’ interfaces are very concrete. Perhaps a container based approach, where a much narrower interface in many ways can be defined, could be used to mitigate this aspect. Then even if different vendors use different config tools to do last mile config, the dev cycle knows that configuration and execution works. We need to make sure that we don’t separate the teams and their products though: the pipeline upstream must only test code that is relevant to upstream – and downstream likewise. We may be able to find a balance here, but I think more work articulating what that looks like it needed.

3. it will break the feedback cycle if the running metrics are not receive upstream; yes we need to be careful of privacy aspects, but basic telemetry: the upgrade worked, the upgrade failed, here is a crash dump – these are the tools for sifting through failure at scale, and a number of open source projects like firefox, Ubuntu and chromium have adopted them, with great success. Notably all three have direct delivery models: their preference is to own the relationship with the user and gather such telemetry directly.

C/D and technical debt

Sidebar: ignoring public APIs and external dependencies, because they form the contract that installations and end users interact with, which we can reasonably expect to be quite sticky, the rest of a system should be entirely up to the maintainers right? Refactor the DB; Switch frameworks, switch languages. Cleanup classes and so on. With microservices there is a grey area: APIs that other microservices use which are not publically supported.

The grey area is crucial, because it is where development drag comes in: anything internal to the system can be refactored in a single commit, or in a series of small commits that is rolled up into one, or variations on this theme.

But some aspect that another discrete component depends upon, with its own delivery cycle: that cannot be fixed, and unless it was built with the same care public APIs were, it may well have poor scaling or performance characteristics that making fixing it very important.

Given two C/D’d components A and B, where A wants to remove some private API B uses, A cannot delete that API from its git repo until all B’s everywhere that receive A via C/D have been deployed with a version that does not use the private API.

That is, old versions of B place technical debt on A across the interfaces of A that they use. And this actually applies to public interfaces too – even if they are more sticky, we can expect the components of an ecosystem to update to newer APIs that are cheaper to serve, and laggards hold performance back, keep stale code alive in the codebase for longer and so on.

This places a secondary requirement on the telemetry: we need to be able to tell whether the fleet is upgraded or not.

So what does a working model look like?

I think we need a different diagram than the pipeline; the pipeline talks about the things most folk doing an API or some such project will have directly in hand, but its not actually the full story. The full story is rounded out with two additional features. Feature flags and telemetry. And since we want to protect our users, and distributors probably will simply refuse to provide insights onto actual users, lets assume a near-zero-trust model around both.

Feature flags

As I discussed in my previous blog post, feature flags can be used for fairly arbitrary purposes, but in this situation, where trust is limited, I think we need to identify the crucial C/D enabling use cases, and design for them.

I think that those can be reduce to soft launches – decoupling activating new code paths from getting them shipped out onto machines, and kill switches – killing off flawed / faulty code paths when they start failing in advance of a massive cascade failure; which we can implement with essentially the same thing: some identifier for a code path and then a percentage of the deployed base to enable it on. If we define this API with efficient streaming updates and a consistent service discovery mechanism for the flag API, then this could be replicated by vendors and other distributors or even each user, and pull the feature API data downstream in near real time.

Telemetry

The difficulty with telemetry APIs is that they can egress anything. OTOH this is open source code, so malicious telemetry would be visible. But we can structure it to make it harder to violate privacy.

What does the C/D cycle need from telemetry, and what privacy do we need to preserve?

This very much needs discussion with stakeholders, but at a first approximation: the C/D cycle depends on knowing what versions are out there and whether they are working. It depends on known what feature flags have actually activated in the running versions. It doesn’t depend on absolute numbers of either feature flags or versions

Using Google Play again as an example, there is prior art – https://support.google.com/firebase/answer/6317485 – but I want to think truely minimally, because the goal I have is to enable C/D in situations with vastly different trust levels than Google play has. However, perhaps this isn’t enough, perhaps we do need generic events and the ability to get deeper telemetry to enable confidence.

That said, let us sketch what an API document for that might look like:

project:
version:
health:
flags:
- name:
  value:

If that was reported by every deployed instance of a project, once per hour, maybe with a dependencies version list added to deal with variation in builds, it would trivially reveal the cardinality of reporters. Many reporters won’t care (for instance QA testbeds). Many will.

If we aggregate through a cardinality hiding proxy, then that vector is addressed – something like this:

- project:
  version:
  weight:
  health:
  flags:
  - name:
    value:
- project: ...

Because this data is really only best effort, such a proxy could be backed by memcache or even just an in-memory store, depending on what degree of ‘cloud-nativeness’ we want to offer. It would receive accurate data, then deduplicate to get relative weights, round those to (say) 5% as a minimum to avoid disclosing too much about long tail situations (and yes, the sum of 100 1% reports would exceed 100 :)), and then push that up.

Open Questions

  • Should library projects report, or are they only used in the context of an application/service?
    • How can we help library projects answer questions like ‘has every user stopped using feature Y so that we can finally remove it’ ?
  • Would this be enough to get rid of the fixation on using stable branches everyone seems to have?
    • If not why not?
  • What have I forgotten?

Feature flags

Feature toggles, feature flags – they’ve been written about a lot already (use a search engine :)), yet I feel like writing a post about them. Why? I’ve been personally involved in two from-scratch implementations, and it may be interesting for folk to read about that.

I say that lots has been written; http://featureflags.io/ (which appears to be a bit of an astroturf site for LaunchDarkly 😉 ) nevertheless has gathered a bunch of links to literature as well as a number of SDKs and the like; there are *other* FFaaS offerings than LaunchDarkly; I have no idea which I would use for my next project at this point – but hopefully you’ll have some tools to reason about that at the end of this piece.

I’m going to entirely skip over the motivation (go read those other pieces), other than to say that the evidence is in, trunk based development is better.



Humble, J. and Kim, G., 2018. Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution.

A feature flag is a very simple thing: it is a value controlled outside of your development cycle that in turn controls the behaviour of your code. There are dozens of ways to implement that. hash-defines and compile time flags have been used for a very long time, so long that we don’t even think of them as feature flags, but they are are. So are configuration options in configuration files in the broadest possible sense. The difference is largely in focus, and where the same system meets all parties needs, I think its entirely fine to use just the one system – that is what we did for Launchpad, and it worked quite well I think – as far as I know it hasn’t been changed. Specifically in Launchpad the Zope runtime config is regular ZCML files on disk, and feature flags are complementary to that (but see the profiling example below).

Configuration tends to be thought of as “choosing behaviour after the system is compiled and before the process is started” – e.g. creating files on disk. But this is not always the case – some enterprise systems are notoriously flexible with database managed configuration rulesets which no-one can figure out – and we don’t want to create that situation.

Lets generalise things a little – a flag could be configured over the lifetime of the binary (compile flag), execution (runtime flag/config file or one-time evaluation of some dynamic system), time(dynamically reconfigured from changed config files or some dynamic system), or configured based on user/team/organisation, URL path (of a web request naturally :P), and generally any other thing that could be utilised in making a decision about whether to conditionally perform some code. It can also be useful to be able to randomly bucket some fraction of checks (e.g. 1/3 of all requests will go down this code path).. but do it consistently for the same browser.

Depending on what sort of system you are building, some of those sorts of scopes will be more or less important to you – for instance, if you are shipping on-premise software, you may well want to be turning unreleased features entirely off in the binary. If you are shipping a web API, doing soft launches with population rollouts and feature kill switches may be your priority.

Similarly, if you have an existing microservices architecture, having a feature flags aaS API is probably much more important (so that your different microservices can collaborate on in-progress features!) than if you have a monolithic DB where you put all your data today.

Ultimately you will end up with something that looks roughly like a key-value store: get_flag_value(flagname, context) -> value. Somewhere separate to your code base you will have a configuration store where you put rules that define how that key-value interface comes to a given value.

There are a few key properties that I consider beneficial in a feature flag systems:

  • Graceful degradation
  • Permissionless / (or alternatively namespaced)
  • Loosely typed
  • Observable
  • Centralised
  • Dynamic

Graceful Degradation

Feature flags will be consulted from all over the place – browser code, templates, DB mapper, data exporters, test harnesses etc. If the flag system itself is degraded, you need the systems behaviour to remain graceful, rather than stopping catastrophically. This often requires multiple different considerations; for instance, having sensible defaults for your flags (choose a default that is ok, change the meaning of defaults as what is ‘ok’ changes over time), having caching layers to deal with internet flakiness or API blips back to your flag store, making sure you have memory limits on local caches to prevent leaks and so forth. Different sorts of flag implementations have different failure modes : an API based flag system will be quite different to one stored in the same DB the rest of your code is using, which will be different to a process-startup command line option flag system.

A second dimension where things can go wrong is dealing with missing or unexpected flags. Remember that your system changes over time: a new flag in the code base won’t exist in the database until after the rollout, and when a flag is deleted from the codebase, it may still be in your database. Worse, if you have multiple instances running of a service, you may have different code all examining the same flag at the same time, so operations like ‘we are changing the meaning of a flag’ won’t take place atomically.

Permissions

Flags have dual audiences; one part is pure dev: make it possible to keep integration costs and risks low by merging fully integrated code on a continual basis without activating not-yet-ready (or released!) codepaths. The second part is pure operations: use flags to control access to dark launches, demo new features, killswitch parts of the site during attack mitigation, target debug features to staff and so forth.

Your developers need some way to add and remove the flags needed in their inner loop of development. Lifetimes of a few days for some flags.

Whoever is doing operations on prod though, may need some stronger guarantees – particularly they may need some controls over who can enable what flags. e.g. if you have a high control environment then team A shouldn’t be able to influence team B’s flags. One way is to namespace the flags and only permit configuration for the namespace a developer’s team(s) owns. Another way is to only have trusted individuals be able to set flags – but this obviously adds friction to processes.

Loose Typing

Some systems model the type of each flag: is it boolean, numeric, string etc. I think this is a poor idea mainly because it tends to interact poorly with the ephemeral nature of each deployment of a code base. If build X defines flag Y as boolean, and build X+1 defines it as string, the configuration store has to interact with both at the same time during rollouts, and do so gracefully. One way is to treat everything as a string and cast it to the desired type just in time, with failures being treated as default.

Observable

Make sure that when a user reports crazy weird behaviour, that you can figure out what value they had for what flags. For instance, in Launchpad we put them in the HTML.

Centralised

Having all your flags in one system lets you write generic tooling – such as ‘what flags are enabled in QA but not production’, or ‘what flags are set but have not been queried in the last month’. It is well worth the effort to build a single centralised system (or consume one such thing) and then use it everywhere. Writing adapters to different runtimes is relatively low overhead compared to rummaging through N different config systems because you can’t remember which one is running which platform.

Scope things in the system with a top level tenant / project style construct (any FFaaS will have this I’m sure :)).

Dynamic

There may be some parts of the system that cannot apply some flags rapidly, but generally speaking the less poking around that needs to be done to make something take effect the better. So build or buy a dynamic system, and if you want a ‘only on process restart’ model for some bits of it, just consult the dynamic system at the relevant time (e.g. during k8s Deployment object creation, or process startup, or …). But then everywhere else, you can react just-in-time; and even make the system itself be script driven.

The Launchpad feature flag system

I was the architect for Launchpad when the flag system was added. Martin Pool wanted to help accelerate feature development on Launchpad, and we’d all become aware of the feature flag style things hip groups like YouTube were doing; so he wrote a LEP: https://dev.launchpad.net/LEP/FeatureFlags , pushed that through our process and then turned it into code and docs (and once the first bits landed folk started using and contributing to it). Here’s a patch I wrote using the system to allow me to turn on Python profiling remotely. Here’s one added by William Grant to allow working around a crash in a packaging tool.

Launchpad has a monolithic data store, with bulk data federated out to various disk stores, but all relational data in one schema; we didn’t see much benefit in pushing for a dedicated API per se at that time – it can always be added later, as the design was deliberately minimal. The flags implementation is all in-process as a result, though there may be a JS thunk at this point – I haven’t gone looking. Permissions are done through trusted staff members, it is loosely typed and has an audit log for tracking changes.

The other one

The other one I was involved in was at VMware a couple years ago now; its in-house, but some interesting anecdotes I can share. The thinking on feature flags when I started the discussion was that they were strictly configuration file settings – I was still finding my feet with the in-house Xenon framework at the time (I think this was week 3 ? 🙂 so I whipped up an API specification and a colleague (Tyler Curtis) turned that into a draft engine; it wasn’t the most beautiful thing but it was still going strong and being enhanced by the team when I left earlier this year. The initial implementation had a REST API and a very basic set of scopes. That lasted about 18 months before tenant based scopes were needed and added. I had designed it with the intent of adding multi-arm bandit selection down the track, but we didn’t make the time to develop that capability, which is a bit sad.

Comparing that API with LaunchDarkly I see that they do support A/B trials but don’t have multivariate tests live yet, which suggests that they are still very limited in that space. I think there is room for some very simple home grown work in this area to pay off nicely for Symphony (the project codename the flags system was written for).

Should you run your own?

It would be very unusual to have PII or customer data in the flag configuration store; and you shouldn’t have access control lists in there either (in LP we did allow turning on code by group, which is somewhat similar). Point is, that the very worst thing that can happen if someone else controls your feature flags and is malicious is actually not very bad. So as far as aaS vendor trust goes, not a lot of trust is needed to be pretty comfortable using one.

But, if you’re in a particularly high-trust environment, or you have no internet access, running your own may be super important, and then yeah, do it :). They aren’t big complex systems, even with multi-arm bandit logic added in (the difficulty there is the logic, not the processing).

Or if you think the prices being charged by the incumbents are ridiculous. Actually, perhaps hit me up and we’ll make a startup and do this right…

Should you build your own?

A trivial flag system + persistence could be as little as a few days work. Less if you grab an existing bolt-on for your framework. If you have multiple services, or teams, or languages.. expect that to become the gift that keeps on giving as you have to consolidate and converge across your organisation indefinitely. If you have the resources – great, not a problem.


I think most people will be better off taking one of the existing open source flag systems – perhaps https://unleash.github.io/ – and using it; even if it is more complex than a system tightly fitted to your needs, the benefit of having one that is a true API from the start will pay for itself the very first time you split a project, or want to report what features are on in dev and off in prod, not to mention multiple existing language bindings etc.