OpenStack and ease of development

In my last post, about cultural norms in OpenStack, I said that ease of development was a self inflicted issue. This was somewhat contentious ūüôā I’ve had some interested expressed a deeper dive. In that post I articulated three cultural problems and two technical ones.

What does success for developers look like?

I think independent of the scope of OpenStack, the experience for developers should have roughly the same features:

  1. global reasoning for changes should be rarely needed, (or put another way, the architecture should make it possible to think about changes without trying to consider all of OpenStack and still get high quality results). (this helps new developers make good decisions)
  2. the component being worked on should build quickly (keep local development cycles brisk)
  3. have comprehensive local unit tests (keep local development effective; low rate of defects escaping to functional/integration tests)
  4. be able to utilise project resources to perform adhoc exploration, integration, functional and scale tests (this allows developers to have sensibly sized development machines, while still ensuring what they build works in a system representative of our users).
  5. the lead time from getting the change finished locally, to the developer no longer needing to shepard the change through the system should be low think about it, should be low (I won’t scare people by saying what I think it should be ūüôā . this feature keeps cognitive load on developers from becoming a burden)
  6. failures after review should be a) localised, b) rare enough that the overhead of corrective action is tolerable and c) recovery should take place within a small number of hours at most (this keeps the project as a whole healthy and that means individual developers will be rarely impacted by failures from other developers changes)

We already do ok on a number of these things: the above is not a gap analysis.

Sidebar – Accelerate

About now I feel I have to mention Accelerate, a book that is the result of detailed research into software delivery performance – and its follow-up report the DORA 2018 state of devops report. The Puppet state-of-devops report is useful as well, though they focus on different aspects – ones that are less generalisable to open source development in my view. And interestingly, particularly around team choice, they seem to have reached entirely different conclusions around team choice :).

The particularly interesting thing for me is that this is academic grade research, showing causation and tying that back to specific practices: this gives us a solid basis for planning changes, rather than speculation that something will work.

These reports and research are looking into software delivery – which for OpenStack spans organisations: we build, then users deploy. So its not entirely clear that things generalise, nor is it clear how one might implement all the predictive practices because of that.

For instance, while Continuous Integration is something we can imagine doing in OpenStack (sorry folks, preflight testing and CI are really very very different things);

Continuous Deployment would be a much more ambitious undertaking. Imagine it though: commit through to deployed on users clouds in a matter of hours. Wouldn’t that be something. Chrome and Firefox are two open source projects that have been evolving in this direction for some time, and we could well study them to learn what they have found to work and not work.

All that said, the construct – the metrics – that predict software delivery performance are:

  1. Release frequency
  2. Mean time to recovery
  3. Lead time (commit to value consumable)

There’s a separate construct (the Westrum organisational culture construct) for culture, and they also measured the effect on e.g. implementing Continuous Delivery on those metrics.

I highly recommend reading the book – perhaps start with the 2018 report for a taste, but the book has much more detail.

Where are the gaps

I haven’t looked particularly closely at the coupling in OpenStack recently, so for 1) I think folk actually landing changes should assess this. My sense it that we’re ok on this, but not great. In particular, anytime there is a big cross project, lots of involved commits, lots of sequencing – thats something that needed global reasoning.

For 2), most of our stuff is in Python today, so build times aren’t a big issue.

For 3), we’re in pretty decent shape unit test wise, though they tend to be very slow (minutes or more to run), and I worry about¬† skew between mocks and actual servers.

For 4) we do allow utilisation of project resources via gerrit pre-review tests and pre-merge tests, but there’s no provision for adhoc utilisation (that I know of), and as I described in my last post, I think we could get a lot more leverage out of the cloud resources if we had the ability to wire components under test into an existing, scaled, cloud.

For 5) I’d need to do some more detailed visualisation, or add a feature to stackalytics, but the sense from folk I speak too is that lead times are still enormous. I suspect there are two, or even three, distributions hiding in there (e.g. one for regular devs, and one for infrequent/new) – but we can gather data on this. One important aspect is whether we should measure from ‘code committed( in dev branch) to merged to master’, or ‘code committed to delivered’. Its my view that measuring to delivery is critical, if we truely want to be driving benefits to our users. There is a corner case where those two things converge – trunk based development – but that is particularly challenging for open source projects. For instance, http://stackalytics.com/report/reviews/nova/open shows under the ‘Change requests waiting for reviewers since the last vote or mark’ an average age time of 144 days, with a max age time of 709 days: thats 2 years, 4 releases. Thats measuring time to git; if we measure time to delivered, then we need to add the time that changes sit in git before being included in a release – up to 6 months, though the adhoc releases many project are doing now is a great help. The stats shown though aren’t particularly useful – a) reviews that have merged already are not included in the stats and b) there’s not enough information to¬† start reasoning about why they have the age they do.

For 6) our changes at the moment, recovery is burdened by the slow merging process – the minimum time to recovery is the sum of the unavoidable steps in the merge / delivery process. Failure frequency (things breaking after the merge completes / is released) is fairly low, but we’re not particularly good at blast radius management – the all-or-nothing nature of change rollout today means there is no mitigation when things go wrong.

So I think there are significant gaps with room to improve on three things there:

  1. More efficient test/adhoc project resource utilisation
  2. Lead times
  3. Blast radius

Smarter testing

I covered this in my previous post in moderate detail, but its worth drilling in further at this point. I don’t think there is a silver bullet here; the necessary machinery to test a new database engine version with an existing cloud is very different in detail to that required to test a new nova-compute build. Lets consider just being able to test a new nova-compute with an existing cloud. Essentially we want to wire in a new shard of nova-compute. Fortunately nova-compute is intrinsically sharded: thats its very model of operation.

blog-testing.png

Though its not strictly relevant here consider that other components (like the DB) have no sharding mechanism in place today, so wiring in a new shard for that would be “tricky”.

The details may have changed since I last dug deep, but from memory nova-compute needs access to the the message bus to communicate with the rest of nova, access to glance and the swift or other store that images are in, and obviously nova-compute needs appropriate local resources to run whatever compute workload it is going to serve out.

So wiring that in from a test node to an existing cloud seems pretty simple. We probably don’t want the services listening unsecured on the internet, so we’ll need a credential distribution system (e.g. vault), and automation to look those up and wire in the nova-compute instance with appropriate credentials.

There may be trust issues: are all components equally privileged in the system? This also shows up as a bug risk – how much damage could a broken but not malicious nova-compute do?

Harder cases – DDL

One common harder case is DDL – schema changes at the DB layer. I don’t have a good canned answer here, but roughly speaking in the context of tests we need to be able to:

  1. Try applying the DDL across the whole DB
  2. Run the code that works with the DB with the modified schema
  3. Be able to do that for many different patches

Right now we machinery to do 1) against a static copy of various cloud’s DBs. 2) and 3) are almost at cross purposes: it may be necessary to serialise those tests: they are fewer than other code changes. One possible implementation would be to use an expand-contract SQL server migration strategy to expand to a new server, run the DDL, verify the cloud metrics don’t regress, then migrate back using the source servers schema (and ignoring missing columns [because if they’ve been dropped in the new schema, then code is already not querying them].

Another possibility, given that these changes are rarer, is not to optimise the testing of them.

Harder cases – exotic components

Power machines, ESXi hypervisors, and other not generally-available hypervisors would all be good to expose to developers – make it possible for them to verify changes to the code that interacts with them – in real time. Ideally with more access than the current hands-off gerrit-test-job only approach.

Lead times

Today, I’m going to treat ‘in a release’ as delivered. I’m picking this definition because:

  • We can choose to make more releases
  • We don’t need to build consensus or whole new delivery stacks to try and get customers upgraded
  • We can always come back and defined delivered with more scope later

Lean methodology provides a number of tools for analysing lead times – it has been used successfully in many organisations; sufficiently robust and consistent in its results that Accelerate even cites adopting lean management practices as being predictive for performance. And then there is the whole what-does-delivered mean.

And yes, we are not a company, we are many volunteers, but that merely adds corner cases – most of our volunteers are given tasks to work on w/in OpenStack, and have the time to work with an effective SDLC and change management process.

As I mentioned above, without some more detailed modelling, its hard to say for sure what leads to the high lead times; but there are some things we can identify easily enough…

  1. We don’t treat each commit as a release. We do say that trunk should never be broken, but we’re not sure enough of our execution to actually tag each commit as a release and publish for consumption.
    1. Consider what we would need to solve to do this.
  2. We aren’t practicing CI. In particular:
    1. Merges (required to repair things that snuck in) often take much more than 10 minutes
    2. We’re not integrating the work-in-progress from developers early enough to avoid reintegration costs.
  3. We’re not practicing trunk based development: every outstanding patch chain is a branch, just in a different representation, and our branch lifetime clearly exceeds a day… and we have a large stabilisation period during the development cycle.
  4. Reviews – needs a deeper analysis to say if this is or isn’t a driver. I suspect it is, because nothing I hear or see shows this to have changed in any fundamental way.
  5. We don’t work in small batches: 6 month cycles is huge batches.
  6. We’re pretty poor at enabling team experimentation. I think this is due to layering: for example, we have N different API servers, so if one team wants to experiment, they create customer confusion due to yet-another-API idiom. If we had just one API server, changes to that would be happening from just one team, gaining much better integration and discussion characteristics. (For an example of having just one API server in a distributed system, consider k8s, which has just one primary API server – the kubelet API is not really customer facing.)
  7. We don’t manage work in progress well: this may not seem important, but its a Lean foundational practice. Think of it as a combination of not exceeding your bandwidth, and minimising context switches.

So what should we do to drive lead times down?

I propose setting a vision: 95% of patches that are either maintenance or part of an agreed current feature merge (or are completely rejected) the same day that they are uploaded to gerrit. (Patches that are for some completely random thing may obviously require considerable more effort to reason about).

Then work back from that: what do we need to have in place to do that safely.
Yes its hard. Thats more of a reason to do it.

Delivering that will require better safety ropes (e.g. clearer contracts for components, better linting (maybe mypy), more willingness to roll forward, consistent review latency (this is more about scheduling than how many reviews any one person does).

The benefits could be immense though: if OpenStack is a juggernaut today, consider what it could be if we could respond nimbly to new user demands.

Blast radius containment

So this is about things like making releases and deployments much more robust to mistakes. For instance, imagine if every server could run in a shadow mode – where it receives traffic, operates on it, but marks any external operations it does as not-real. Then if it blows up we can detect that without destablising a running version. (And the long running supported test cloud would give a perfect place to do this). So rollouts rather than being atomic, become a series of small steps. The simplest form is just taking a stateless scale-out service and running 2 builds in parallel. Thats better than a binary old/new. Canary builds, rolling upgrades similarly.

Now, since we defined ‘delivered’ as in a release, not ‘in use’, maybe we should ignore that operational blast radius and instead limit ourselves to the development side.

Even here is a lot more sophistication that we can add: consider that for libraries our ‘fleet’ is basically every developer. Pinning all those dependencies like we do is a good step. What if we actually could deliver updates to 1% of our devs, then 10%, then all?

So we could have a pipeline:

  1. Unit test a consumer, raise its version for 1% of consumers.
  2. Watch for failures, raise the % until 100%

This would require a metrics channel (opt-in!), and some way of signalling the versions to choose from to development environments.

We could use multiple branches as another mechanism: if everyone works off of trunk, we optimise trunk merges to be no more than (say) 20 minutes, and code self promotes to a tested branch, then release branch over a couple of hours. Failures would generate a proposed rollback straight into gerrit.

Wrapup

There’s a high cost of change in OpenStack – I don’t mean individual code changes, I mean changing e.g. policies, languages, architecture – lots of code, and thousands of affected people. A result of a high cost of change is a high risk of change: if a change makes things worse, it can take as long to back it out as it took to bring it in.

I’ll freely admit that I’m partly off in architecture-astronaut land here: there’s a huge gap of detail between what I’m describing and what would be needed to make it happen.

I have confidence in the community though, if we can just pull some vision together about what we want, we have the people and knowledge to execute on it.

Advertisements

Is OpenStack’s mission broken?

tl;dr:

  1. Betteridge’s law applies.
  2. Ease of development is self inflicted and not mission creep.
  3. Ease of use is self inflicted and not mission creep.
  4. Ease of operations is self inflicted and not mission creep.
  5. I have concrete suggestions for 2/3/4 but to avoid writing a whole book I’m just going to tackle (2) today.

Warning: this is a little ranty. Its not aimed at any individual, its just crystalised out after a couple of years focused on different things, and was seeded by¬†Jay Pipes when he recently put a strawman up about two related discussions that we haven’t really had as a community:

  1. What should¬†the scope of OpenStack’s mission be?
  2. a technical proposal for ‘mulligan’, a narrowly defined new mission. And yes, I know that OpenStack has incredible velocity. Just imagine what it could be if the issues I describe didn’t exist.

So is it the mission?

I think OpenStack has lots of “issues”, to use the technical term, across, well, everything. I don’t think the mission is even slightly related to the problems though.

The mission has ultimately just brought a huge number of folk together with the idea that they might produce a thing that can act like a cloud.

This has been done before: organisations like AWS, Microsoft, Google and smaller players like Digital Ocean and Rackspace (before OpenStack).

I reject the idea that having such a big, hairy, inclusive mission is a problem.

We can be more rigorous about that though: if a smaller mission would structurally prevent a given issue, then it’s the mission that is the problem. Otherwise, it’s not.

I do think the mission is somewhat ridiculous, but there’s a phrase in some companies:a companies mission defines what it doesn’t do, not what it does.

And I think the current OpenStack mission does that quite well: there are two basic filters that can be applied, and unless at least one matches, it’s out of scope for OpenStack.

  • Can you get $thing from a Public Cloud?
  • Do you uniquely need $thing to run a Cloud?

And yes, there are a billion things in the grey cloud around the edge.

Know what else has this problem? Linux. Well over ~3/5th of its code is in that grey edge. 170M of core, 130M of architectures, 530M in drivers. X86 + arm is 50M of that 130M of architectures.

Linux’s response has been dramatically different to ours though. They have a single conceptual project being built, with enormous configurability in how it’s deployed. We’ve decided that we’re building a billion different things under the same umbrella, and that comes down to a cultural norm.

Cultural norms and silos

Concretely, Swift and Nova: the two original projects, have never conceptually regarded themselves as one project.

Should they?

I honestly don’t know :). But by not being one project (with enormous configurability in now it’s deployed), we set a cultural expectation in OpenStack, that variation in workload implied a new project and new codebase.

Every split out takes years to accomplish – both the literal ones like Glance, and the moral ones like Neutron.

The lines for the split-outs are drawn inconsistently.

To illustrate this, ask yourself: what manages a node in an OpenStack cloud? What’s the component that is responsible for working with the machines actual resources, reporting usages, reporting back to service discovery, healthchecks, liveness etc?

In a clean slate architecture you might design a single agent, and then make it extensible/modular. OpenStack has many separate agents, one per siloed team.

Similarly the scheduling problem for net/disk/compute: there is an enormous vertical stack of cloud-APIs that can be built on a solid base, many of which OpenStack has in its portfolio. But that stack is not being built on a common scheduler – and can’t be because the cultural norm is to split things out, not to actually figure out how to maintain things more effectively without moving the code around.

Some things really are better off as separate projects – and I’m not talking monorepo vs repo-per-project, thats really only about the ability to do some changes atomically. A reusable library like oslo.config is only reusable by being a separate project. oslo.db though, exists solely because we have many separate projects that all look like ‘REST on one side database on the other’. That is a concrete problem: high deployment overheads, redundant information in some places, inappropriate transaction boundaries in others. The objects work – passing structured objects around and centralising the DB access – makes things a lot better, but its broken into vertical silos much too early.

Our domain specific services include huge amounts of generic, common problem space code: persistence, placement, access control…

Cultural norms and agility

Back in the dawn of OpenStack, there were some very very strong personalities. Codebases got totally overhauled and replaced without code review. Distrust got baked in as another cultural norm. Code review became a control point. It’s extraordinarily common to spend weeks or months getting patches through.

In some of the most effective teams I’ve worked in code review is optional. Trust and iterate is the norm there: bypassing code review is a thing that needs to be justified, but code review is not how quality is delivered. Quality is delivered by continual improvement, rather than by the quality of any one individual commit.

A related thing is being super risk averse around what lands in master (more on that below). Some very very very clever folk have written very clever code to facilitate this combination of siloed projects + trying super hard not to let regressions into master. This is very hard to deliver – and in fact we stepped back from being an absolute-approach there, about 4 years ago, to a model where we try very hard to prevent it just within a small set of connected projects.

OpenStack has a deeply split personality. Many folk want to build a downloadable cloud construction kit (e.g. Ubuntu). Many more want to build a downloadable cloud product (direct release users). And many wanted (are there still public clouds running master?) to be able to use master directly with confidence. This last use case is a major driver for wanting master to be regression free…

Agility requires the ability to react to new information in a short timeframe. Doing CD (continuous deployment) requires a pipeline that starts with code review and ends with deployed code. OpenStack doesn’t do that. There’s a huge discontinuity between upstream and actual deployments, and effectively none of developers of any part of OpenStack¬†upstream are doing operations day to day. Those that do – at Rackspace, previously at HP (where I was working when I was full time on OpenStack), and I’m going to presume at OVH and other public clouds – are having to separate out their operations work from their upstream changes.

Every initiative in a project will miss some details that have to be figured out later – thats the nature of all but the most exactly software development processes, and those processes are hugely expensive. (Formal methods just to start with). OpenStack copes with that by running huge planning cycles – 3-6 months apart.

Commits-as-control-points + long planning cycles + many developers not operating what they build => reaction to new information happens at a glacial scale.

To illustrate this, consider request tracing. 8 years ago Google released the Dapper¬†whitepaper, Twitter wrote Zipkin and open sourced it, and we’re now at the point where distributed tracing is de rigeur – it’s one of the standard things a service operator will expect for any system. We spent¬†years dealing with pushback from developers in service teams that didn’t understand the benefits of the proposed analogous system for OpenStack. Rackspace wrote their own and patched it in as part of their productionisation of master. Then we also got to have a debate about whether OpenStack should have¬†one such system, or a¬†plugin interface to allow Rackspace to not change. [Sidebar: Rackers, I love you and :heart: your company, but that drove me up the wall! I wish we’d managed to just join forces and get everyone to at least bring a damn tracing interface in for everything].

Test reliability

With TripleO we had the idea that we’d run a cloud based on master, provide feedback on what didn’t work, and create a virtuous circle. I think that that was ultimately flawed because the existing silos (e.g. of Nova, or Glance) were not extended into owning those components within TripleO: TripleO was just another deployer, rather than part of the core feedback cycle.

More generally, we had a team of people (TripleO) running other people’s code (all of OpenStack and commit rights were hard to get in other projects) with no SLA around that code.

I didn’t think of this that way at the time, for all that we understood that that was what we are doing, but that structure is actually structurally fragile: it’s the very antithesis of agile. When something broke it could stay broken for weeks, simply because the folk responsible for the break are not accountable for the non-brokenness of the system. (I’m not whinging about the teams we worked with – people did care, but caring and being accountable are fundamentally different things).

There is another place with that pattern: devstack. Devstack is a code base that exists to deploy all the other openstack components. It’s the purest essence of ‘run other people’s code with no SLA’, and devstack is¬†the engine for pre-merge testing and pre-review testing in OpenStack.

I now believe that to be a key problem for OpenStack. Monty loves to talk about how many clouds OpenStack deploys daily in testing. Every one of those tests is running some number of components (typically the dependency graph for the service under test) which have not changed and are not written by the author, from scratch. And then of course the actual service being tested.

Thats structurally fragile: it’s running 5 or 10 times as much code as is relevant to the test being conducted. And the people able to fix any problems in those dependencies don’t feel the friction at the same time, in the same way, as their users do. (This isn’t a critique of the people, it’s just maths).

I’ll probably write more about this in detail later, as it ties into a larger discussion about testing and deployment of microservices, or testing in production. But imagine if we got rid of devstack for review and merge testing. It has several other use cases of course – ‘give me an OpenStack to hack on’ is an important, discrete test case, and folk probably care that that works. For simplicity I’m going to ignore that for now.

So, if we don’t use devstack, how do we deploy a cloud for pre-merge testing.

We don’t. We don’t need to. What we need to do is deploy the changed code¬†into a cloud whose other components are expected to be compatible with that code. Devstack did this by taking a given branch of a bunch of components and bringing them up from scratch. Instead, we run a¬†production grade, monitored and alerted¬†deployment of all the components. Possibly we run many such deployments, for configurations that cannot coexist (e.g. different federation modes in keystone?). The people answering the pages for those alerts could be the service developers, or it could be an operations team with escalation back to the developers as-needed (to filter noise like ‘oh, cloud $X has just had an outage’). But ultimately the developers would be directly accountable in some realtime fashion.

Then the test workflow becomes:

  1. Build the code under test. (e.g. clean VM, pip install, whatever)
  2. Deploy that code into the existing cluster as a new shard
  3. Exercise it as desired
  4. Tear it down

Let’s use nova-compute as an example.

  1. pip install
  2. Run nova-compute reporting to an existing API server with some custom label on the hypervisor to allow targeting workloads to it
  3. Deploy a VM targeted it
  4. tear it down

I’m sure this raises lots of omg-we-can’t-do-that-because-technical-reason-X-about-what-we-do-today.

That’s fine, but for the purposes of this discussion, consider the destination – not the path.

If we did this:

  • Individual test runs could use substantially less resources
  • And perform substantially less work
  • Which implies better performance
  • Failures due to other components than the service under test would be a thing of the past (when you’re on the hook for your service running reliably, you engineer it to do that)

I think this post is long enough, so let me recap briefly. If there is interest out there I can drill into what sort of changes would be needed to transition to such a system, the suggestions I have for ease of use and ease of operations, and I think I’m also ready to provide some discussion about what the architecture of OpenStack¬†should be.

Recap: why is development hard

Cultural problem #1: silos rather than collaboration in place. Moving the code rather than working with others.

Cultural problem #2: excessive entry controls. Make each commit right rather than trend up wards with a low-latency high change rate.

Cultural problem #3: developer feedback cycle is measured in weeks (optimistically), or years (realistically).

Technical problem #1: excessive code executed in tests: 80% of test activity is not testing the code under test.

Technical problem #2: our testing is optimised for new-cloud deployments: as our userbase grows upgrades become the common use case and testing should match that.

Money doesn’t matter

Well, obviously it does. But the whole ‘government cannot pay for healthcare’, or land, or education : thats nonsense.

And any politician that claims that is either ignorant, or has an agenda that involves deliberate repression of the population.

These are strong claims, so let me break it down. Also, I’m not an economist, if I’ve gotten the wrong end of the stick economics-wise, I’ll happily update this or at least add errata to it…

Money isn’t wealth. Its a thing you can exchange for other things, but it itself is not wealth. Easy example: when countries have had runaway inflation, and the price of e.g. potatoes has been going up 100% a day, it doesn’t matter how much money you have, you will eventually be unable to buy potatoes. But a potato farmer with 10’s of thousands of potatoes won’t run out and go hungry.

We use money to scale our society. Without money, we have some problems. Firstly, if I want something you have, but I don’t have anything you want, I have to find someone who wants something I have, and something you want that they don’t want, and then do that trade, then come back to you to trade the thing you wanted for what I wanted. This quickly becomes a bottleneck on actually getting stuff done. Secondly, once someone, say a potato farmer :), has what they want right now, they will be very hard to trade with : if they trade potatoes for things they don’t want, they are gambling that other folk will want them in the future. That requires everyone to become a good gambler on the future value of things.

But just like money isn’t wealth, money also isn’t work. We work to exchange our time for wealth; except money isn’t wealth, so really we’re exchanging our time for this thing we can exchange for the actual things we want. Government *literally* create money anytime they want, and they destroy it at will too. If there’s too much money floating around, then (at whatever prices folk are used to) everything will be purchasable, and its very likely folk selling stuff will run out and raise prices as a result. Then it becomes harder to buy stuff, although everyone that recieved those raised prices has more money to buy with, so this continues for a while : this is inflation.

Too little money, and things that could be sold won’t sell, because there isn’t enough money at the prices folk are used to, and the folk selling don’t want to “lose money” (which is odd, because money is a promise not a thing, so if you’re in a deflationary situation, selling *right now* may well be better than holding on and selling later :)), so they will be slow to lower prices, will recieve less either way, and just like with increased prices, the decrease gets spread amongst the participants – vendors, owners, employees.

But these things don’t happen instantly :- there’s slack in the system.

So what does matter? What actually matters is a combination of resources and productivity: those are the things that determine whether we, as a society, can produce enough things for our people to have what they want and need. For instance, building a house needs the following resources: land, building materials, labour, power, as well as ongoing supplies of power, water and sewage processing.

If, given the people currently in our country, and what they are being paid to do today, we have both enough resources, and enough labour-and-productivity, to house, feed, heat, transport and entertain everyone, then the failure to do so is not one of¬†money but one of choice.¬†That builder friend you know who doesn’t have work right now could be building a house for that other friend you’ve got whose family is sleeping in a garage. The builder that’s not working because the family in question can’t afford to pay for the land or the resources, and the builder has nowhere to do the building, nor any stuff to make the building out of.

The core choice is : do we as a society think its reasonable anyone should have to sleep rough, or miss out on school, or any of a thousand examples of poverty, when we’ve got the resources and production capability to fix it? Do we think that? Really? And what are we willing to do to fix it? Right now, a lot of the production capability of our society is owned by 1% of our society. So less than 1% of people are deciding what is made and how its made.

Now, there’s a bunch of curly questions like, what about the foreign account deficit? What about the fact that lots of land is already owned by someone? How do we fairly get that family the house they deserve? Won’t some people just ride on the coat-tails of others? Isn’t this going to require taking things other people have already earnt?

These are all fair questions. My answers to those are:

  • If everyone had their needs met we’d have many more people contributing to creative things we can sell to foreign countries, more than enough to address any changes in the foreign account deficit from sorting things out here.
  • Our current system has huge wealth inequality; it doesn’t matter whether that inequality is in the form of money, or ownership of things, either we leave that 1% controlling 99%, or we redistribute things in some equitable ongoing basis. Wealth taxes, CGTs, estate taxes. Lots of options.
  • I’m not sure. I think ultimately it means capping the maximum wealth ratio between our richest and poorest people. e.g. the more wealth you have the more you’re taxed until eventually – at say 500K / year (gross) wealth growth, your marginal tax rate becomes 90%, and at some higher figure, say 1M/year (gross) wealth growth your marginal tax rate exceeds 95%. That way wealthy folk get to choose what things they keep : there’s no central planning department or other bureaucracy involved.
  • Folk already ride on the coat tails of other people. But its nowhere near as simple as ‘those dole bludgers’. Folk on the pension don’t work. Folk with ‘passive income’ (read investments whose growth is high enough those folk don’t need to work). School kids. And yes, folk on the dole. For some folk on the dole, the marginal tax rate already exceeds 100% – there are some steps in our tax system that make part time work while receiving the dole very very hard. Home makers are also something we support as a society. though less directly. But lets assume fully 10% of the country simply don’t want to work. Consider this in productivity terms. We get 10% less things done. Big deal. We’ve enough resources and people to deliver those essentials: food, shelter, power, education, with waaay less than 90% of our workforce. And as automation inproves expect that 90% to drop down towards 10%. At that point we’d want 90% of folk not working, I suspect.
  • Yes, folk will have to get taxed on what they¬†have not just on what they are gaining. This makes sense though: we want the system to slowly drive equity for everyone. (Not equality, and not sameness, just equity). Taxing what you have is actually a lot fairer than taxing what you earn. Because if you have nothing, but start earning a lot, you’re starting way behind everyone else, so not taxing you much is pretty nice. And if you have a lot, but aren’t earning anymore, not taxing you is really just giving you a free pass: supporting you in terms of every single shared resource and infrastructure.

 

Monads and Python

When I wrote this I was going to lead in by saying: I’ve been spending a chunk of time recently thinking about how best to represent Monads in Python. Then I forgot I had this draft for 3 years. So.. I *did* spend a chunk of time. Perhaps it will be of interest anyway… though I had not finished it (otherwise it wouldn’t still be draft would it :))

Why would I do this? Because there are some nifty things you get with them: you get some very mature patterns for dealing with error (Either, Maybe), with nondeterminism (List), with DSLs (Free).

Why wouldn’t you do this? Because you get some baggage. There are two bits in particular. firstly, Monads solve a problem Python doesn’t have. Consider:

x = read_file('fred')
y = delete_file('fred')

In Haskell, the compiler is free to run those functions in either order as there is no data dependency between them. In Python, it is not – the order is specified directly by the code. Haskell requires a data dependency to force ordering (and in fact RealWorld in order to distinguish different invocations of IO). So to define a sequence here it defines a new operator (really just an infix function) called bind (>>= in haskell). You then create a function to run after the monad does whatever it needs to do. Whenever you see code like this in Haskell:

do x <- action1
     y >=
  \x action2 >>=
     \y return x+y

A direct transliteration into Python is possible a few ways. One of the key things though is to preserve the polymorphism Рbind is dependent on the monad instance in use, and the original code is valid under many instances.

def action1(m): return m.unit(1)
def action2(m): return m.unit(2)
m = MonadInstance()
action1(m).bind(
    lambda m, x: action2(m).bind(
        lambda m, y: m.unit(x+y)))

In this style functions in a Monad would take a monad instance as a parameter and use that to access the type. Note in particular that the behavior of bind is involved at every step here.

I’ve recently been diving down into Effect¬†as part of preparing my talk for Kiwi PyCon. Effect was described to me as modelling the Free monad, and I wrote my talk on that basis – only to realise, in doing so, that it doesn’t. The Free monad models a domain specific language – it lets you write interpreters for such a language, and thanks to the¬†lazy nature of Haskell, you essentially end up iterating over a (potentially) infinitely recursive structure until the program ends – the Free bind method steps forward once. This feels very similar to Effect in some ways. Its also used (in some cases) for similar reasons: to let more code be pure and thus reliably testable.

But writing an interpreter for Effect is very different to writing one for Free. Compare these blog posts¬†with the howto for Effect. In the Free Monad the interpreter can hand off to different interpreters at any point. In Effect, a single performer is given just a single Intent, and Intents just return plain values. Its up to the code that processes values and returns new Effect’s to perform flow control.

That said, they are very similar in feel: it feels like one is working with data, not code. Except, in Haskell, its possible to use do notation to write code in the Free monad in imperative style… but Effect provides no equivalent facility.

This confused me, so I reached out to Chris and we had a really fascinating chat about it. He pointed me at another way that Haskellers separate out IO for testing. That approach is to create a class specifically for the IO in your code and have two implementations. One production one and one test implementation. In Python:

class Impure:
    def readline(self):
        raise NotImplementedError(self.readline)
...
class Production:
    def readline(self):
        return sys.stdin.readline()
...
class Test:
    def __init__(self, inputs):
        self.inputs = inputs
    def readline(self):
        return self.inputs.pop(0)
...

Then you write code using that directly.

def echo(impl):
    impl.writeline(impl.readline())

This seems to be a much more direct way to achieve the goal of being able to write pure testable code. And it got me thinking about the actual basic premise of porting monads to Python.

The goal is to be able to write Pythonic, pithy code that takes advantage of the behaviour in the bind for that monad. Lets consider Maybe.

class Something:
    def __init__(self, thing):
        self.thing = thing
@classmethod
def unit(klass, thing):
    return Something(thing)
def bind(self, l):
    return l(self, self.thing)
def __str__(self):
    return str(self.thing)
def action1(m): return m.unit(1)
def action2(m): return m.unit(2)
m = Something
r = action1(m).bind(
    lambda m, x: action2(m).bind(
        lambda m, y: m.unit(x+y)))
print("%s" % r)
# 3

Trivial so far, though having to wrap the output types in our functions is a bit ick. Lets add in None to our example.

class Nothing:
    def bind(self, l):
        return self
    def __str__(self):
        return "Nothing"
def action1(m): return Nothing()
def action2(m): return m.unit(2)
m = Something
r = action1(m).bind(
    lambda m, x: action2(m).bind(
        lambda m, y: m.unit(x+y)))
print("%s" % r)
# Nothing

The programmable semicolon aspect of monads comes in from the bind method – between each bit of code we write, Something chooses to call forward, and Nothing bypasses our code entirely.

But we can’t use that unless we start writing our normally straight forward code such that every statement becomes a closure – which we don’t want.. so we want to interfere with the normal process by which Python chooses to run new code.

There is a mechanism that Python gives us where we get control over that: generators. While they are often used for concurrency, they can also be used for flow control.

Representing monads as generators has been done here, here, and don’t forget other languages like Scala.

The problem is, that its still not regular Python code, and its still somewhat mental gymnastics. Natural for someone thats used to thinking in those patterns, and it works beautiful in Haskell, or Rust, or other languages.

There are two fundamental underpinnings behind this for Haskell; type control from context rather than as part of the call signature and do notation which makes code using it look like Python.  In python we are losing the notation, but gaining the bind operator on the Maybe monad which short circuits Nothing to Nothing across an arbitrary depth of of computation.

What else short circuits across an arbitrary depth of computation?

Exceptions.

This won’t give the full generality of Monads (for instance, a Monad that short circuits up to 50 steps but no more is possible) – but its possibly

Python basically is do notation, and if we just had some way of separating out the side effects from the pure code, we’d have pure code. And we have that from above.

So there you have it, a three year old mull: perhaps we shouldn’t port Monads to Python at all, and instead just:

  • Write pure code
  • Use a strategy object to represent impure activity
  • Use exceptions to handle short circuiting of code

I think there is room if we wanted to to do a really nice, syntax integrated Monad style facility in Python (and Maybe would be a great reference case for it), but generator overloading – possibly async might let a nicer thing be done but I haven’t investigated that yet.

SkyDNS in Kubernetes 1.3 local clusters

If you want to run kubernetes locally – not in a VM¬†–¬†then you’ll probably also want DNS service integration to work. ¬†Thats fine, except by default it doesn’t work :(. This may be due to DNS being a built-in add-on now, but the current docs around that are inconsistent – referencing the deleted 1.2 dns addon docs :/.

I’ve put a pull request¬†up to fix the errors I encountered trying to use the local-up-cluster script per the current in-tree documentation in build. You also need to run it slightly differently than the basic docs suggest.¬†The basic setup (sensibly) doesn’t listen on 0.0.0.0, avoiding exposing your insecure cluster to the world. But since you’re going to be partitioning off your machine into containers,¬†and the kube-dns component which handles DNS integration needs to talk to the kubernetes API, so you need to override that.

sudo KUBE_ENABLE_CLUSTER_DNS=true API_HOST_IP=0.0.0.0 hack/local-up-cluster.sh

Will run a local cluster for you with DNS happily working, assuming the other preconditions (like – you’re not using 10.0.0.0/8) needed to run a local cluster are true. You can start with no environment variables set ar all to check that that works – kubernetes itself runs happily with no DNS integration. Note though, that if you have DNS enabled, it has to work, or the kubernetes API itself will fail to register endpoints,¬†and then gets itself firewalled off.

Some quick debugging things I found useful.

Find the pod

$ cluster/kubectl.sh --namespace kube-system get pods
NAME READY STATUS RESTARTS AGE
kube-dns-v18-mi26o 3/3 Running 0 18m

Check it has registered endpoints successfully

$ cluster/kubectl.sh --namespace kube-system get ep
NAME ENDPOINTS AGE
kube-dns 172.17.0.2:53,172.17.0.2:53 18m

Check its logs

$ cluster/kubectl.sh logs --namespace kube-system kube-dns-v18-mi26o -c kubedns
....

Deploy something and check it both can use DNS and is listed in DNS

I made a trivial Ubuntu image with a little more in it:

$ cat rob/Dockerfile
FROM ubuntu

RUN apt-get update
RUN apt-get install -y iputils-ping curl openssh-client iproute2 dnsutils
RUN apt-get clean && rm -rf /var/lib/apt/lists/*

Which I then deploy via a trivial definition:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  namespace: default
spec:
  containers:
  - image: ubuntu-debug
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
    name: ubuntu
  restartPolicy: Always

And a call to kubectl:

$ cluster/kubectl.sh create -f rob/ubuntu.yaml

And if successfully integrated with DNS, it will be registered with DNS under A-B-C-D.default.pod.cluster.local.

$ cluster/kubectl.sh exec ubuntu -ti /bin/bash
root@ubuntu:/# ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
48: eth0@if49: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.3/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:3/64 scope link tentative dadfailed
       valid_lft forever preferred_lft forever
root@ubuntu:/# ping 172-17-0-3.default.pod.cluster.local
PING 172-17-0-3.default.pod.cluster.local (172.17.0.3) 56(84) bytes of data.
64 bytes from ubuntu (172.17.0.3): icmp_seq=1 ttl=64 time=0.013 ms
^C
--- 172-17-0-3.default.pod.cluster.local ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.013/0.013/0.013/0.000 ms

diagnosing flaky tests

Victor (here, here) grabbed me on IRC yesterday for some testr help. He had a super frustrating bug in glance: about one in thirty unit test runs, a test would fail. He’d spent hours tracking it down and still couldn’t reliably reproduce it.

Part of that was due to the glance tests taking a few minutes to run, so each iteration was slow, but another part was a lack of familiarity with the test tooling we use in OpenStack which can give rich data to help analyse such things.

I helped him out – and this post is a step by step handbook of what I did so that I can point people at it ūüôā

tl;dr

  1. start by duplicating the environment
  2. setup automation so you are only doing the interesting (or at least not time consuming) bits
  3. bisect and bisect and bisect

Firstly, I pulled down exactly the same code he was working on:

cd glance; git review -d 250083

This let me try to reproduce the thing. However, my normal reproduction facility couldn’t be used because the glance testr configuration depended on invoking testr within lockutils-wrapper. I’m still working through the implications, but for the short term I moved that to be testr’s problem.

So now I could make a python 34 venv and run testr directly:

tox -epy34 --notest; . .tox/py34/bin/activate; testr run --parallel

This is pretty important – it lets me get under the setup.py wrapper that projects use and now I have more control over what is happening. Plus I’m not dealing with tox recreating the venv or anything like that.

It turned out that only the unit tests had been ported to Python3, so I needed to filter down to just those tests. And because I didn’t want to sit here watching it, I set testr off to find a reproduction example on its own:

testr run --parallel --until-failure tests.unit

This runs the same set of tests – whatever you’ve specified in the normal way – in parallel, in a loop. It specifically reschedules and starts new backends (processes that are actually executing test code) each time around, so its very close to just scripting it in shell around testr, with only minor differences (such as not re-querying all the tests each time, because testr knows the full set already).

After an hour or so I had toggled back to look at the terminal, and there was a lovely backtrace and information on the failure. It looks something like this:

running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \
lockutils-wrapper \
${PYTHON:-python} -m subunit.run discover -t ./ ./glance/tests --load-list /tmp/tmpafpyzyd5
Ran 2 tests in 0.485s (+0.011s)
PASSED (id=1614)
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \
lockutils-wrapper \
${PYTHON:-python} -m subunit.run discover -t ./ ./glance/tests --load-list /tmp/tmpafpyzyd5
Traceback (most recent call last):
...
glance.common.exception.NotFound: b'Image not found'
======================================================================
FAIL: glance.tests.unit.v1.test_api.TestGlanceAPI.test_upload_image_http_nonexistent_location_url
tags: worker-0
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/robertc/work/openstack/glance/glance/tests/unit/v1/test_api.py", line 1149, in test_upload_image_http_nonexistent_location_url
self.assertEqual(404, res.status_int)
File "/home/robertc/work/openstack/glance/.tox/py34/lib/python3.4/site-packages/testtools/testcase.py", line 350, in assertEqual
self.assertThat(observed, matcher, message)
File "/home/robertc/work/openstack/glance/.tox/py34/lib/python3.4/site-packages/testtools/testcase.py", line 435, in assertThat
raise mismatch_error
testtools.matchers._impl.MismatchError: 404 != 201
Ran 2 tests in 0.419s (-0.061s)
FAILED (id=1615, failures=1 (+1))

Except that the id was much lower, there were multiple concurrent processes being run on each iteration, and the number of tests run much higher – 506 in fact. The important bit to me was the id, because with that we could get programmatic on the problem.

While testr has support for automatically isolating faults, this depends on deterministic behaviour- there isn’t [yet] any support for things that fail 1 time in N when doing bisection, looking for cross-test interactions and so forth. So normally, one would just run:

testr run --analyze-isolation

And testr would churn away and give a useful answer, in this case I needed to do it by hand. I think there is room for getting really smart about dealing with this sort of situation, but the simplest method is to just repeat each test (a run of some X tests looking for a failure) until some confidence level is reached, rather than assuming a pass is actually a pass.

To do that we needed to do two things: we needed a set of tests to run, and a way to reduce the set and repeat. We could use –until-failure as a way to repeat a given test, and stop it after a couple of hours if it hadn’t failed.

Extracting the tests that a given backend ran is straightforward, if not something you’d just luck upon. If the run id you want to investigate is 6, and the backend that you want to report on is worker-0 (see the test tag in the error report above):

cat .testr/6 | subunit-1to2 | subunit-filter -s --xfail --with-tag=worker-0 | subunit-ls > worker-0

This takes the subunit stream from the repository, which is in a legacy format 1, upgrades it to subunit v2, then includes successful tests and expected failures, but only includes tests run on worker-0, pulls out the test ids (thats the subunit-ls bit) and writes it into a file ‘worker-0’.

To run just those tests:

testr run –load-list worker-0

More interestingly though, lets start by not running all the tests that took place after our failure. Inside the file it looks like this:

...
glance.tests.unit.v1.test_api.TestGlanceAPI.test_update_deleted_image
glance.tests.unit.v1.test_api.TestGlanceAPI.test_update_image_size_header_too_big
glance.tests.unit.v1.test_api.TestGlanceAPI.test_update_public_image
glance.tests.unit.v1.test_api.TestGlanceAPI.test_upload_image_http_nonexistent_location_url
glance.tests.unit.v1.test_api.TestImageSerializer.test_meta
glance.tests.unit.v1.test_api.TestImageSerializer.test_show
...

Note that the test we’re interested in is in the middle there – though the file looks sorted, thats due to the test backend, what we have is the actual order tests executed in (and we don’t need to worry about concurrency because we pulled out just one backend process, and in Python unittest thats single-threaded).

First step then is to delete all the tests after the one we care about:

...glance.tests.unit.v1.test_api.TestGlanceAPI.test_update_deleted_image
glance.tests.unit.v1.test_api.TestGlanceAPI.test_update_image_size_header_too_big
glance.tests.unit.v1.test_api.TestGlanceAPI.test_update_public_image
glance.tests.unit.v1.test_api.TestGlanceAPI.test_upload_image_http_nonexistent_location_url

And then we don’t want to run all the tests: we’re assuming there is an interaction with a single other test leaving a stale process or something, so we want to run the one that failed, and half of the earlier tests; if they end up being reliable, we switch to the half we hadn’t been running, and then repeat the process – take half, run until we’re satisfied its reliable, repeat.

The way I do this by hand is to just edit the text file, making a new copy each step, so that I can backtrack easily. So delete half the preceeding lines, save to new file, then run:

testr run --load-list newfile --until-failure

Walk away, do something else, and then come back in a couple of hours.

This worked: from 506 tests on the worker, I then had about 300 that ran before the failing test after the first trim, so 150 was my first bisection which ran for a couple hours before failing. Then 75, then 35, then 18, then 9, then 5, then 2, then 1, then – 0 – thats right, eventually we found that the failing test could fail on its own!

And from that Victor was able to dig deep into what it was doing with confidence – he found a race condition in the test server setup stuff (I haven’t looked closely – I’m parroting what he said on IRC) – and is confident he’s found the bug. Yay!

I also ran one of the smaller sets overnight using Python 2.7, and that didn’t fail at all, so I suspect the failure is in some area that was masked by Python 2.7/eventlet on 2.7 handling of *something*. We saw a bug in subunit of that nature earlier this year, where a different (but legitimate) behaviour in eventlet on 3.4 led to subunit dropping writes silently. Thats fixed now in both eventlet and subunit ūüôā

signalling via exit status in Python

A common idiom in non-trivial command line tools is to have more than two return codes. For instance, diff uses 0 for ‘same inputs’, 1 for ‘different inputs’, 2 for ‘trouble’.

Doing that in Python is a little harder though, and since I’ve gotten it wrong in the past, I want to write it down for both myself and anyone else contemplating it.

The issue is that both your program and the Python VM itself can fail, and so if you attempt to use a common status code with those the Python VM uses for failures, you have to make sure that the meanings are at least broadly compatible. There’s also a bug in existing Python releases that will cause an exit status of 0 sometimes when an error is actually appropriate.

I’ve only researched this on CPython, its possible that other Python VM’s behave differently, and as far as I know this is not a language spec issue (but perhaps it should be).

tl;dr:

  1. Always flush stdout and stderr yourself, even when signalling errors.
  2. Never use status 1 or 2 for non-error conditions.
  3. (Provisional) don’t use status 120 at all.

Details:

CPython exits with 0 when the interpreter cleanup code fails to flush stdout/stderr, even though that would be an error if it happened earlier. To address that, add an explicit flush of both streams before your program ends. We may end up making CPython exit with 120 when the stdout/err flushing fails. There’s also a possibility that a very early threading error may result in a 0 exit code, though I haven’t managed to make this actually happen yet.

CPython exits with 1 when site.py fails to import, so using 1 for non-error conditions makes it hard for callers to discriminate between your meaning and site.py failures.

Cpython exits with 2 when CLI arguments fail to parse, so using 2 for non-error conditions is similar there. optparse also uses 2 for this, so even if you are using a different interpreter, it is not a safe status code to reuse with different semantics.