Time to revise the subunit protocol

Subunit is seven and a half years old now – Conrad Parker and I first sketched it up at a CodeCon – camping and coding, a brilliant combination – in mid 2005.

revno: 1
committer: Robert Collins <robertc@robertcollins.net>
timestamp: Sat 2005-08-27 15:01:20 +1000
message:  design up a protocol with kfish

It has proved remarkably resilient as a protocol – the basic nature hasn’t changed at all, even though we’ve added tags, timestamps, support for attachments of arbitrary size.

However a growing number of irritations have been building up with it. I think it is time to design another iteration of the protocol, one that will retain the positive qualities of the current protocol, while helping it become suitable for the next 7 years. Ideally we can keep compatibility and make it possible for a single stream to be represented in any format.

Existing design

The existing design is a mostly human readable line orientated protocol that can be sniffed out from the regular output of ‘make’ or other build systems. Binary attachments are done using HTTP chunking, and the parser has to maintain state about the current test, tags, timing data and test progression [a simple stack of progress counters]. How to arrange subunit output is undefined, how to select tests to run is undefined.

This makes writing a parser quite easy, and the tagging and timestamp facility allow multiplexing streams from two or more concurrent test runs into one with good fidelity – but also requires that state be buffered until the end of a test, as two tests cannot be executing at once.

Dealing with debuggers

The initial protocol was intended to support dropping into a debugger – just pass each line read through to stdout, and connect stdin to the test process, and voila, you have a working debugger connection. This works, but the current line based parsers make using it tedious – the line buffered nature of it makes feedback on what has been typed fiddly, and stdout tends to be buffered, leading to an inability to see print statements and the like.  All in-principle fixable, right ?

When running two or more test processes, which test process should stdin be connected to? What if two or more drop into a debugger at once? What is being typed to which process is more luck than anything else.

We’ve added some idioms in testrepository that control test execution by a similar but different format – one test per line to list tests, and have runners permit listing and selecting by a list. This works well, but the inconsistency with subunit itself is a little annoying – you need two parsers, and two output formats.

Good points

The current protocol is extremely easy to implement for emitters, and the arbitrary attachments and tagging features have worked extremely well. There is a comprehensive Python parser which maps everything into Python unittest API calls (an extended version of the standard, with good backwards compatibility).

Pain points

The debugging support was a total failure, and the way the parser depraminates it’s toys when a test process corrupts an outcome line is extremely frustrating. (other tests execute but the parser sees them as non-subunit chatter and passes the lines on through stdout).

Dealing with concurrency

The original design didn’t cater for concurrency. There are three concurrency issues – the corruption issue (see below for more detail) and multiplexing. Consider two levels of nested concurrency: A supervisor process such as testrepository starts 2 (or more but 2 is sufficient to reason about the issue) subsidiary worker processes (I1 and I2), each of which starts 2 subsidiary processes of their own (W1, W2, W3, W4). Each of the 4 leaf processes is outputting subunit which gets multiplexed in the 2 intermediary processes, and then again in the supervisor. Why would there be two layers? A concrete example is using testrepository to coordinate test runs on multiple machines at once, with each machine running a local testrepository to broker tests amongst the local CPUs. This could be done with 4 separate ssh sessions and no intermediaries, but that only removes a fraction of the issues. What issues?

Well, consider some stdout chatter that W1 outputs. That will get passed to I1 and from there to the supervisor and captured. But there is nothing marking the chatter as belonging to W1: there is no way to tell where it came from. If W1 happened to fail, and there was a diagnostic message printed, we’ve lost information. Or at best muddled it all up.

Secondly, buffering – imagine that a test on W1 hangs. I1 will know that W1 is running a test, but has no way to tell the supervisor (and thus the user) that this is the case, without writing to stdout [and causing a *lot* of noise if that happens a lot]. We could have I1 write to stdout only if W1’s test is taking more than 5 seconds or something – but this is a workaround for a limitation of the protocol. Adding to the confusion, the clock on W1 and W3 may be very skewed, so timestamps for everything have to be carefully synchronised by the multiplexer.

Thirdly, scheduling – if W1/W2 are on a faster machine than W3/W4 then a partition of equal-timed tests onto each machine will lead one idle before the other finishes. It would be nice to be able to pass tests to run to the faster machine when it goes idle, rather than having to start a new runner each time.

Lastly, what to do when W1 and W2 both wait for user input on stdin (e.g. passphrases, debugger input, $other). Naively connecting stdin to all processes doesn’t work well. A GUI supervisor could connect a separate fd to each of I1 and I2, but that doesn’t help when it is W1 and W2 reading from stdin.

So additional requirement over baseline subunit:

  1. make it possible for stdout and stderr output to be captured from W1 and routed through I1 to the supervisor without losing its origin. It might be chatter from a noisy test, or it might be build output. Either way, the user probably will benefit if we can capture it and show it to them later when they review the test run. The supervisor should probably show it immediately as well – the protocol doesn’t need to care about that, just make it possible.
  2. make it possible to pass information about tests that have not completed through one subunit stream while they are still incomplete.
  3. make it possible (but optional) to pass tests to run to a running process that accepts subunit.
  4. make it possible to route stdin to a specific currently process like W1. This and point 3 suggest that we need a bidirectional protocol rather than the solely unidirectional protocol we have today. I don’t know of a reliable portable way to tell when some process is seeking such input, so that will be up to the user I think. (e.g. printing (pdb) to stdout might be a sufficiently good indicator.)

Dealing with corruption

Consider the following subunit fragment:

test: foo
starting serversuccess:foo

This is a classic example of corruption: the test ‘foo’ started a server and helpfully wrote to stdout explaining that it did that, but missed the newline. As a result the success message for the test wasn’t printed on a line of its own, and the subunit parser will believe that foo never completed. Every subsequent test is then ignored. This is usually easy to identify and fix, but its a head-scratcher when it happens. Another way it can happen is when a build tool like ‘make’ runs tests in parallel, and they output subunit onto the same stdout file handle. A third way is when a build tool like make runs two separate test scripts serially, and the first one starts a test but errors hard and doesn’t finish it. That looks like:

test: foo
test: bar
success: bar

One way that this sort of corruption can be mitigated is to put subunit on it’s own file descriptor, but this has several caveats: it is harder to tunnel through things like ssh and it doesn’t solve the failing test script case.

I think it is unreasonable to require a protocol where arbitrary interleaving of bytes between different test runner streams will work – so the ‘make -j2’ case can be ignored at the wire level – though we should create a simple way to safely mux the output from such tests when the execute.

The root of the issue is that a dropped update leaves bad state in the parser and it never recovers. So some way to recover, or less state to carry in the parser, would neatly solve things. I favour reducing parser state as that should shift stateful complexity onto end nodes / complex processors, rather than being carried by every node in the transmission path.

Dependencies

Various suggestions have been made – JSON, Protobufs, etc…

A key design goal of the first subunit was a low barrier to entry. We keep that by being backward compatible, but being easy to work with for the new revision is also a worthy goal.

High level proposal

A packetised length prefixed binary protocol, with each packet containing a small signature, length, routing code, a binary timestamp in UTC, a set of UTF8 tags (active only, no negative tags), a content tag – one of (estimate + number, stdin, stdout, stderr, test- + test id), test status (one of exists/inprogress/xfail/xsuccess/success/fail/skip), an attachment name, mime type, a last-block marker and a block of bytes.

The content tags:

  • estimate – the stream is reporting how many tests are expected to run. It affects everything with the same routing code only, and replaces (doesn’t adjust) any current estimate for that routing code. A estimate packet of 0 can be used to say that a routing target has shutdown and cannot run more tests. Routing codes can be used by a subunit aware runner to separate out separate threads in a single process, or even just separate ‘TestSuite’ objects within a single test run (though doing so means that they will need to process subunit and strip packets on stdin. This supercedes the stack of progress indicators that current subunit has. estimates cannot have test status or attachments.
  • stdin/stdout/stderr: a packet of data for one of these streams. The routing code identifies the test process that the data came from/should go to in the tree of test workers. These packets cannot have test status but should have a non-empty attachment block.
  • test- + testid: a packet of data for a single test. test status may be included, as may attachment name, mime type, last-block and binary data.

Test status values are pretty ordinary. Exists is used to indicate a test that can be run when listing tests, and inprogress is used to report a test that has started but not necessarily completed.

Attachment names must be unique per routing code + testid.

So how does this line up?

Interleaving and recovery

We could dispense with interleaving and say the streams are wholly binary, or we can say that packets can start either after a \n or directly after another packet. If we say that binary-only is the approach to take, it would be possible to write a filter that would apply the newline heuristic (or even look for packet headers at every byte offset. I think mandating adjacent to a packet or after \n is a small concession to make and will avoid tools like testrepository forcing users to always configure a heuristic filter. Non-subunit content can be wrapped in subunit for forwarding (the I1 in W1->I1->Supervisor chain would do the wrapping). This won’t eliminate corruption but it will localise it and permit the stream to recover: the test that was corrupted will show up as incomplete, or with incomplete attachment data.

listing

Test listing would emit many small non-timestamped packets. It may be useful to have a wrapper packet for bulk amounts of fine grained data like listing is, or for multiplexers with many input streams that will often have multiple data packets available to write at once.

Selecting tests to run

Same as for listing – while passing regexes down to the test runner to select groups of tests is a valid use case, thats not something subunit needs to worry about : if the selection is not the result of the supervisor selecting by test id, then it is known at the start of the test run and can just be a command line parameter to the backend : subunit is relevant for passing instructions to a runner mid-execution. Because the supervisor cannot just hand out some tests and wait for the thing it ran to report that it can accept incremental tests on stdin, supervisor processes will need to be informed about that out of band.

Debugging

Debugging is straight forward . The parser can read the first 4 or so bytes of a packet one at a time to determine if it is a packet or a line of stdout/stderr, and then either read to end of line, or the binary length of the packet. So, we combine a few things; non-subunit output should be wrapped and presented to the user. Subunit that is being multiplexed and forwarded should prepend a routing code to the packet (e.g. I1 would append ‘1’ or ‘2’ to select which of W1/W2 the content came from, and then forward the packet. S would append ‘1’ or ‘2’ to indicate I1/I2 – the routing code is a path through the tree of forwarding processes). The UI the user is using needs to supply some means to switch where stdin is attached. And stdin input should be routed via stdin packets. When there is no routing code left, the packet should be entirely unwrapped and presented as raw bytes to the process in question.

Multiplexing

Very straight forward – unwrap the outer layer of the packet, add or adjust the routing code, serialise a header + adjusted length + rest of packet as-is. No buffering is needed, so the supervisor can show in-progress tests (and how long they have been running for).

Parsing / representation in Python or other languages

The parser should be very simple to write. Once parsed, this will be fundamentally different to the existing Python TestCase->TestResult API that is in used today. However it should be easy to write two adapters: old-style <-> this new-style. old-style -> new-style is useful for running existing tests suites and generating subunit, because thats way the subunit generation is transparent. new-style->old-style is useful for using existing test reporting facilities (such as junitxml or html TestResult objects) with subunit streams.

Importantly though, a new TestResult style that supports the features of this protocol would enable some useful features for regular Python test suites:

  • Concurrent tests (e.g. in multiprocessing) wouldn’t need multiplexers and special adapters – a regular single testresult with a simple mutex around it would be able to handle concurrent execution of tests, and show hung tests etc.
  • The routing of input to a particular debugger instance also applies to a simple python process running tests via multiprocessing, so the routing feature would help there.
  • The listing facility and incrementally running tests would be useful too I think – we could go to running tests concurrently with test collection happening, but this would apply to other parts of unittest than just the TestResult

The API might be something like:

class StreamingResult(object):
    def startTestRun(self):
        pass
    def stopTestRun(self):
        pass
    def estimate(self, count, route_code=None):
        pass
    def stdin(self, bytes, route_code=None):
        pass
    def stdout(self, bytes, route_code=None):
        pass
    def test(self, test_id, status, attachment_name=None, attachment_mime=None, attachment_eof=None, attachment_bytes=None):
        pass

This would support just-in-time debugging  by wiring up pdb to the stdin/stdout handlers of the result object, rather than actual stdin/stdout of the process – a simple matter once written. Alternatively, the test runner could replace sys.stdin/stdout etc with thunk file-like objects, which might be a good idea anyway to capture spurious output happening during a test run. That would permit pdb to Just Work (even if the test process is internally running concurrent tests.. until it has two pdb objects running concurrently 🙂

Generation new streams

Should be very easy in anything except shell. For shell, we can have a command line tool that when invoked outputs a subunit stream for one instruction. E.g. ‘test foo completed + some attachments’ or ‘test foo starting’.

11 thoughts on “Time to revise the subunit protocol

  1. I tried to reply yesterday. tl;dr, NEEDSINFO

    – I don’t understand the protocol you are describing. If you could present a packet diagram or a grammar or something that would help a lot. Maybe some example packets

    – Are you developing your own binary protocol anyway? Why? The “Dependencies” section makes it clear that it wasn’t an option for v1, but there seems to be no reason now.

    – I’m struggling to follow the “Selecting” case, I think it’s because ‘supervisor’, ‘backend’, and ‘runner’ are terms that now carry weight but aren’t defined.

    – `disttrial` has some prior art here. It’s built using AMP, a bidirectional asynchronous binary protocol, which might break a couple of requirements here, but it’s worth mentioning.

  2. I’ll do a detailed wire protocol – EBNF probably – tomorrow. It is a lot easier to reason about behaviour on the wire IME. I think thats a large reason. I’m also hesitant to use something with arbitrary extensability. I’d like to know where and how it is extending. Perhaps I should use protobufs etc, but then it also increases porting overhead for new languages.

    backend and runner and runner and supervisor are ambiguous because testrepository is a runner runner :). The supervisor is a process that has a UI on one side and a number of subunit emitting processes on the other. Those are backends from the supervisor perspective. When the supervisor’s UI is subunit, then the supervisor can act as a backend for another supervisor…

    disttrial is indeed interesting, but AFAIK it doesn’t concern itself with muxing, by having a single depth hub and spoke network. Got a link to an english description of its workings?

  3. (somewhat drive-by comments)

    I don’t see anything here that would conflict with using protobufs (or Thrift, or probably Datomic.)

    If you want to write your own byte-packing code yourself (maybe to run on a teeny microcomputer), producing protobufs will not be much harder than the format you allude to here.

    On the other hand protobufs seem easier in some ways:

    – the bottom layer of encoding is already done for people who want to read/write it from a new language
    – importantly, probably more safely against malicious input than a by-hand parser
    – you get a formal grammar essentially for free
    – there is a clear systematic way to add new fields in future, without breaking old peers, which seems likely to be useful

    1. It is true, or msgpack or bson or … however most / all of those options are not designed for even slightly noisey channels. For instance msgpack doesn’t have a checksum itself, and will happily serialise 2GB of data in a single stream – and when reading it back does a single malloc for that rather than chunking it. It is a lot easier to be sure of the semantics if I do the whole stack – though sure, it won’t be as simple as just reusing an existing stack. I ended up doing my own wire protocol definition after having another look around, and it performs acceptably, particularly compared to the (not particularly slow) v1 parser which was line based.

      The extensability thing is a mixed blessing. As far as malicious input, there are lots of ways to get things wrong – see the msgpack issue above, which is arguably intrinsic to the msgpack definition.

  4. What do you think about providing more information about the program being tested as a part of the stream? Just a space to include the program’s name and version (and exact code revision, where possible) would be quite useful. Right now we seem to just assume that a particular subunit stream is for a particular revision of a particular program, but there’s nothing to specify either way. Granted, my measure of that situation is from frantically poking at testrepository / subunit / testtools / bzr to do what I need them to do in a hurry, so I could be missing something obvious 🙂

    1. That is an interesting question; one the one side does the protocol need to know, and the other side is the stream the right unit to assess this on?

      Lets consider a related bit of information, what version of a dependency (e.g. subvertpy for bzr-svn) is present when the test run is started.

      Further, consider a distributed testing case, where we have two machines A and B both running tests (partitioned from the whole suite). In this scenario we have three streams: A – tests from A, B – tests from B, and C- the multiplexed stream once testrepository has combined the streams. Ideally we could recover this dependency information for any test, regardless of source stream A or B. So, ‘stream’ is the wrong granularity.

      In V2 we have the route_code (or tags could be used for V1 compat) which can uniquely group all the tests for the same context within a stream. So we could associate some structured data with that.

      A slightly weaker thing though, that doesn’t need any test support, is to just use an attachment.

      Given a subunit v2 serialiser, something like:

      serializer.status(file_name=”version-info”, file_bytes=bzrlib.version_info….().encode(‘utf8’), mime_type=”text/json; charset=utf8″)

      emitted from the test runner will capture that data, and you can recover it at any later point by starting with the test, and then looking for the version-info that matches the route code within the combined stream.

  5. A good starting point for a protocol like this would be bencode (http://en.wikipedia.org/wiki/Bencode). It is the encoding for BitTorrent and while it is a headache to read for humans, it is very machine readable, and easy to extend to handle more types. I have been looking at using for a testing log format for a while now.

    1. So, benc is an ok protocol, but limitations here: it has no synchronisation primitives – its either in sync, or its broken: this makes it unsuitable for use in noisy environments (which testing can, all too often be).

Leave a reply to rbtcollins Cancel reply