Monads and Python

When I wrote this I was going to lead in by saying: I’ve been spending a chunk of time recently thinking about how best to represent Monads in Python. Then I forgot I had this draft for 3 years. So.. I *did* spend a chunk of time. Perhaps it will be of interest anyway… though I had not finished it (otherwise it wouldn’t still be draft would it :))

Why would I do this? Because there are some nifty things you get with them: you get some very mature patterns for dealing with error (Either, Maybe), with nondeterminism (List), with DSLs (Free).

Why wouldn’t you do this? Because you get some baggage. There are two bits in particular. firstly, Monads solve a problem Python doesn’t have. Consider:

x = read_file('fred')
y = delete_file('fred')

In Haskell, the compiler is free to run those functions in either order as there is no data dependency between them. In Python, it is not – the order is specified directly by the code. Haskell requires a data dependency to force ordering (and in fact RealWorld in order to distinguish different invocations of IO). So to define a sequence here it defines a new operator (really just an infix function) called bind (>>= in haskell). You then create a function to run after the monad does whatever it needs to do. Whenever you see code like this in Haskell:

do x <- action1
     y >=
  \x action2 >>=
     \y return x+y

A direct transliteration into Python is possible a few ways. One of the key things though is to preserve the polymorphism – bind is dependent on the monad instance in use, and the original code is valid under many instances.

def action1(m): return m.unit(1)
def action2(m): return m.unit(2)
m = MonadInstance()
    lambda m, x: action2(m).bind(
        lambda m, y: m.unit(x+y)))

In this style functions in a Monad would take a monad instance as a parameter and use that to access the type. Note in particular that the behavior of bind is involved at every step here.

I’ve recently been diving down into Effect as part of preparing my talk for Kiwi PyCon. Effect was described to me as modelling the Free monad, and I wrote my talk on that basis – only to realise, in doing so, that it doesn’t. The Free monad models a domain specific language – it lets you write interpreters for such a language, and thanks to the lazy nature of Haskell, you essentially end up iterating over a (potentially) infinitely recursive structure until the program ends – the Free bind method steps forward once. This feels very similar to Effect in some ways. Its also used (in some cases) for similar reasons: to let more code be pure and thus reliably testable.

But writing an interpreter for Effect is very different to writing one for Free. Compare these blog posts with the howto for Effect. In the Free Monad the interpreter can hand off to different interpreters at any point. In Effect, a single performer is given just a single Intent, and Intents just return plain values. Its up to the code that processes values and returns new Effect’s to perform flow control.

That said, they are very similar in feel: it feels like one is working with data, not code. Except, in Haskell, its possible to use do notation to write code in the Free monad in imperative style… but Effect provides no equivalent facility.

This confused me, so I reached out to Chris and we had a really fascinating chat about it. He pointed me at another way that Haskellers separate out IO for testing. That approach is to create a class specifically for the IO in your code and have two implementations. One production one and one test implementation. In Python:

class Impure:
    def readline(self):
        raise NotImplementedError(self.readline)
class Production:
    def readline(self):
        return sys.stdin.readline()
class Test:
    def __init__(self, inputs):
        self.inputs = inputs
    def readline(self):
        return self.inputs.pop(0)

Then you write code using that directly.

def echo(impl):

This seems to be a much more direct way to achieve the goal of being able to write pure testable code. And it got me thinking about the actual basic premise of porting monads to Python.

The goal is to be able to write Pythonic, pithy code that takes advantage of the behaviour in the bind for that monad. Lets consider Maybe.

class Something:
    def __init__(self, thing):
        self.thing = thing
def unit(klass, thing):
    return Something(thing)
def bind(self, l):
    return l(self, self.thing)
def __str__(self):
    return str(self.thing)
def action1(m): return m.unit(1)
def action2(m): return m.unit(2)
m = Something
r = action1(m).bind(
    lambda m, x: action2(m).bind(
        lambda m, y: m.unit(x+y)))
print("%s" % r)
# 3

Trivial so far, though having to wrap the output types in our functions is a bit ick. Lets add in None to our example.

class Nothing:
    def bind(self, l):
        return self
    def __str__(self):
        return "Nothing"
def action1(m): return Nothing()
def action2(m): return m.unit(2)
m = Something
r = action1(m).bind(
    lambda m, x: action2(m).bind(
        lambda m, y: m.unit(x+y)))
print("%s" % r)
# Nothing

The programmable semicolon aspect of monads comes in from the bind method – between each bit of code we write, Something chooses to call forward, and Nothing bypasses our code entirely.

But we can’t use that unless we start writing our normally straight forward code such that every statement becomes a closure – which we don’t want.. so we want to interfere with the normal process by which Python chooses to run new code.

There is a mechanism that Python gives us where we get control over that: generators. While they are often used for concurrency, they can also be used for flow control.

Representing monads as generators has been done here, here, and don’t forget other languages like Scala.

The problem is, that its still not regular Python code, and its still somewhat mental gymnastics. Natural for someone thats used to thinking in those patterns, and it works beautiful in Haskell, or Rust, or other languages.

There are two fundamental underpinnings behind this for Haskell; type control from context rather than as part of the call signature and do notation which makes code using it look like Python.  In python we are losing the notation, but gaining the bind operator on the Maybe monad which short circuits Nothing to Nothing across an arbitrary depth of of computation.

What else short circuits across an arbitrary depth of computation?


This won’t give the full generality of Monads (for instance, a Monad that short circuits up to 50 steps but no more is possible) – but its possibly

Python basically is do notation, and if we just had some way of separating out the side effects from the pure code, we’d have pure code. And we have that from above.

So there you have it, a three year old mull: perhaps we shouldn’t port Monads to Python at all, and instead just:

  • Write pure code
  • Use a strategy object to represent impure activity
  • Use exceptions to handle short circuiting of code

I think there is room if we wanted to to do a really nice, syntax integrated Monad style facility in Python (and Maybe would be a great reference case for it), but generator overloading – possibly async might let a nicer thing be done but I haven’t investigated that yet.


Testrepository roadmap 2015/16

Testrepository has been moderately successful – its very good at some of the things it aspired to (e.g. debugging sporadic test failures in parallel environments), but other angles have not really been explored.

I’ve set some time aside to correct this, in large part to facilitate some important features for tempest (which has its concurrency currently built on the meta-runner included in testrepository – and I’d like to enable the tempest authors to avoid having to write gnarly concurrency code :))

So my plan is to tackle a few things in the lead up to, and perhaps just after the Tokyo OpenStack summit. I wanted to socialise the proposed changes though, and thus this blog post.


Firstly, a long standing issue is that when one tests several different configurations, testrepository is poor at reporting failures that are configuration specific. For instance, imagine that your test suite is run with both Python 2.7 and 3.4, and both results are loaded into your repository. If a given test ‘X’ fails in the first run, and not the second… after the second run is loaded, it will be reported as ‘passing’.

My proposed fix for this is to call the name of each such run a ‘profile’ and use tags to differentiate between the two runs. So you’d tag the 2.7 run perhaps ‘py27’ and the second ‘py34’, and then tell testrepository that the ‘py27’ and ‘py34’ tags are being used to identify profiles. After that testrepository will only consider two test to apply to the same test if the tags match. Tags that are not specified as being for profiles (e.g. the worker-N tags that the testrepository runner adds to track backends that tests run in) won’t be considered in that comparison. This well then allow testrepository to track that each run was separate and the results are not meant to replace each other. The use of tags allows for test matrices too, in principle– consider python version as one dimension, operating system version as another, and database engine as a third — it would be up to the user. I don’t plan to directly implement a matrix system in the first iteration. A different, more dynamic model is in principle possible: don’t tag things, just log events that will give clues and correlate later – thats not precluded by this tag based approach, and we can always add such a thing later.

The output for the queries of the datastore need to be updated though – we don’t currently report tags in e.g. ‘testr failing –list’. This is a little tricky: the listing format is intended to be a mix of nice-for-humans, and machine consumption. Another approach we considered was to namespace the tests with the profile. This has a couple of disadvantages: it may break an unknown number of deployments if the chosen separator is already in use by people, and secondly, it mixes structured and free-form data in a lossy way. One example of that would be that we’d start interpreting all test ids to see if they are – or are not – namespaced with a profile : thats likely to be fragile, at best. On the other hand it would very easily fit into the list format – which is why it was appealing. On balance though, the fragility and conflation would just add technical debt. Instead, we’ll do the following:

  1. Anything that needs to output a flat list of tests will output that for just one profile. An option will be added to allow querying the profiles for which results might be given. The default will start erroring with a list of available profiles if more than one profile has been specified.
  2. We’ll define a minimal JSON schema for reporting multiple profiles in such places. The excellent jq tool can be used to manipulate that in shell command lines. A command line option will opt into receiving this.

Testrepository has two very related programs inside itself. There is the data store and the various queries it can do – e.g. ‘testr load’ and ‘testr failing’. Then there is the meta-runner, which knows how to run some test processes to execute tests. While strictly speaking this is optional, its been very convenient for working with Python tests to have the meta-runner connected to testr and able to do in-process querying.

The meta-runner will benefit from being updated as well. My intent is to make it capable of running all the tests from all the profiles the user specifies, storing that as one single run in the datastore. Two commands in particular need to change here – `testr list-tests` needs to change in line with the test listing above, and `testr run –load-list` needs to be taught how to deal with multiple profiles. I plan to add a command line option to tell it that JSON is being used, and to select tests across all profiles when a simple list or a test regex is given. Finally the command line can benefit from a command line option to select one or more profiles.


The meta-runner has a crude scheduler – it balances based on historic performance prior to running any backend. An online scheduler will give much greater performance in both unseeded, and skewed data cases- e.g.if many long tests fail due to a bug the run after that will often have some workers finishing well before others – leading to slow test times.

The plan here is to finish the implementation of bidirectional channels to test backends, and then dispatch work to them incrementally

Concurrency plans

Tempest wants to be able to run some tests completely independently, and then others can run together arbitrarily. To facilitate this, the online scheduler will be extended to permit describing an overall plan to run through – e.g. a list of segments, where each segment describes one or more tests that can be run together. The UI to supply that to the scheduler will probably start out as a JSON file listing exact test ids and we can iterate from there based on their experience.

Revisiting the Fixture API – handling leaky resources

Fixtures are one of the innovations I’m most happy with.

A Fixture is an enhanced context manager. The enhancements are:

  • There’s an API for gathering debugging information from the fixture (rather than depending on side effects such as the logging module or stdout). This makes it easy to attach log files from servers (for instance rabbitfixture does this).
  • There is glue to support composing other fixtures while still exposing errors from any fixture in the composed set.

OpenStack’s Neutron has been using fixtures in its test suite for some time, but is finding that writing correct fixtures is hard. In particular, they were leaking processes when a fixture would fail during setUp / __enter__ – and then not be cleaned up by the testtools / fixtures useFixture function.

There are several things we can do to improve the situation.

  • We could make the convenience APIs like useFixture add a try:/finally: and call cleanUp() when setUp fails. This involves making cleanUp() be callable in more situations than it is today.
  • We could make setUp itself do that, advising users to override a different function; this would hide the failure interactions internally, but wouldn’t benefit existing fixtures until they are rewritten to not override setUp.
  • We could provide a decorator that folk with fragile setUp’s (e.g. those that involve IO) could use to robustify their fixtures.

The highest leverage change is the first, but is it safe and suitable? Lets look at PEP-343.

In PEP-343 we see the following translation of with expressions:

with EXPR as VAR:
mgr = (EXPR)
exit = type(mgr).__exit__
value = type(mgr).__enter__(mgr)
exc = True
        VAR = value
        exc = False
        if not exit(mgr, *sys.exc_info()):
    if exc:
        exit(mgr, None, None, None)

This means that using a Fixture which may leak external resources when setUp fails is unsafe via with. Therefore we can’t use the first solution.

Decorators are nice, but somewhat noisy and opt-in. Both decorators and a different setUp in the base class will require extending the protocol to specify when cleanUp can be called more precisely.

If we make the documentation advise users to override a specific method, and setUp does this in the event of failure, I think we’ll have somewhat more uptake. So – thats the route I’m going to head down.

There’s one more thing to consider, which is access to debugging information of failures in setUp. Since the object will have been cleaned up, accessing logs etc will be hard. I think if we raise an additional exception into the MultiException with the details objects, it will be possible for fixtures to provide those details, though they will need buffering in memory (or some sophisticated lazy-delete logic such as holding a reference to an unlinked fd).

Subunit and subtests

Python 3 recently introduced a nice feature – subtests. When I was putting subunit version 2 together I tried to cater for this via a heuristic approach – permitting the already known requirement that some tests which are reported are not runnable be combined with substring matching to identify subtests.

However that has panned out poorly, when I went to integrate this with testr the code started to get fugly.

So, I’m going to extend the StreamResult API to know about subtests, and issue a subunit protocol bump – to 2.1 – to add a new field for labelling subtest events. My plan is to make this build a recursive tree structure – that is given test “test_foo” with subtest “i=3” which the Python subtest code would identify as “test_foo (i=3)”, they should be identified in StreamResult as test_id “test_foo (i=3)” and parent_test_id “test_foo”. This can then nest arbitrarily deep if test runners decide to do that, and the individual runnability becomes up to the test runner, not testrepository / subunit / StreamResult.

subunit version 2 progress

Subunit V2 is coming along very well.

Current status:

  • I have a complete implementation of the StreamResult API up as a patch for testtools. Thats 2K LOC including comeprehensive tests.
  • Similarly, I have an implementation of a StreamResult parser and emitter for subunit. Thats 1K new LOC including comprehensive tests, and another 500 lines of churn where I migrate all the subunit filters to v2.
  • pdb debugging works through subunit v2, permitting dropping into a debugger to work. Yay.

Remaining things to do:

  • Update the other language bindings – the C library in particular.
  • Teach testrepository to expect v2 input (and probably still store v1 for a while)
  • Teach testrepository to use pipes for the stdin of test runner backends, and some control mechanism to switch input between different backends.
  • Discuss the in-Python API with more folk.
  • Get code merged 🙂

Simpler is better – a single event type for StreamResult

StreamResult, covered in my last few blog posts, has panned out pretty well.

Until that is, that I sat down to do a serialised version of it. It became fairly clear that the wire protocol can be very simple – just one event type that has a bunch of optional fields – test ids, routing code, file data, mime-type etc. It is up to the recipient at the far end of a stream to derive semantic meaning, which means that encoding a lot of rules (such as a data packet can have either a test status or file data) into the wire protocol isn’t called for.

If the wire protocol doesn’t have those rules, Python parsers that convert a bytestream into StreamResult API calls will have to manually split packets that have both status() and file() data in them… this means it would be impossible to create many legitimate bytestreams via the normal StreamResult API.

That seems to be an unnecessary restriction, and thinking about it, having a very simple ‘here is an event about a test run’ API that carries any information we have and maps down a very simple wire protocol should be about as easy to work with as the current file or status API.

Most combinations of file+status parameters is trivially interpretable, but there is one that had no prior definition – a test_status with no test id specified. Files with no testid are easily considered as ‘global scope’ for their source, so perhaps test_status should be treated the same way? [Feedback in comments or email please]. For now I’m going to leave the meaning undefined and unconstrained.

So I’m preparing a change to my patchset for StreamResult to:

  • Drop the file() method altogether.
  • Add file_bytes, mime_type and eof parameters to status().
  • Make the test_id and test_status parameters to status() optional.

This will make the API trivially serialisable (both to JSON or protobufs or whatever, or to the custom binary format I’m considering for subunit), and equally trivially parsable, which I think is a good thing.

First experience implementing StreamResult

My last two blog posts were largely about the needs of subunit, but a key result of any protocol is how easy working with it in a high level language is.

In the weekend and evenings I’ve done an implementation of a new set of classes – StreamResult and friends – that provides:

  • Adaption to and from the existing TestResult APIs (the 2.6 and below API, 2.7 API, and the testtools extended API).
  • Multiplexing multiple streams together.
  • Adding timing data to a stream if it is absent.
  • Summarising a stream.
  • Copying a stream to multiple outputs
  • A split out API for instructing a test run to stop.
  • A simple test-at-a-time stream processor that makes it easy to just deal with tests rather than the innate complexities of an event based interface.

So far the code has been uniformly simple to write. I started with an API that included an ‘estimate’ function, which I’ve since removed – I don’t believe the complexity is justified; enumeration is not significantly more expensive than counting, and runners that want to be efficient can either not enumerate or remember the enumeration from prior runs.

The documentation in the linked pull request is a good place to start to get a handle on the API; I’d love feedback.

Next steps for me are to do a subunit protocol revision that maps to the Python API, both parser and generator and see how it feels. One wrinkle there is that the reason for doing this is to fix intrinsic limits in the existing protocol – so doing forward and backward wire protocol compatibility would defeat the point. However… we can make the output side explicitly choose a protocol version, and if we can autodetect the protocol version in the parser, even if we cannot handle mixed streams we can get the benefits of the new protocol once data has been detected. That said, I think we can start without autodetection during prototyping, and add it later. Without autodetection, programs like TestRepository will need configuration options to control what protocol variant to expect. This could be done by requiring this new protocol and providing a stream filter that can be deployed when needed.