I feel like I’m taking a big personal risk writing this, even though I know the internet is large and probably no-one will read this :-).
So, dear reader, please be gentle.
As we grow – as people, as developers, as professionals – some lessons are are hard to learn (e.g. you have to keep trying and trying to learn the task), and some are hard to experience (they might still be hard to learn, but just being there is hard itself…) I want to talk about a particular lesson I started learning in late 2008/early 2009 – while I was at Canonical – sadly one of those that was hard to experience.
At the time I was one of the core developers on Bazaar, and I was feeling pretty happy about our progress, how bzr was developing, features, community etc. There was a bunch of pressure on to succeed in the marketplace, but that was ok, challenges bring out the stubborn in me :). There was one glitch though – we’d been having a bunch of contentious code reviews, and my manager (Martin Pool) was chatting to me about them.
I was – as far as I could tell – doing precisely the right thing from a peer review perspective: I was safeguarding the project, preventing changes that didn’t fit properly, or that reduced key aspects- performance, usability – from landing until they were fixed.
However, the folk on the other side of the review were feeling frustrated, that nothing they could do would fix it, and generally very unhappy. Reviews and design discussions would grind to a halt, and they felt I was the cause. [They were right].
And here was the thing – I simply couldn’t understand the issue. I was doing my job; I wasn’t angry at the people submitting code; I wasn’t hostile; I wasn’t attacking them (but I was being shall we say frank about the work being submitted). I remember saying to Martin one day ‘look, I just don’t get it – can you show me what I said wrong?’ … and he couldn’t.
Canonical has a 360′ review system – every 6 months / year (it changed over time) you review your peers, subordinate(s) and manager(s), and they review you. Imagine my surprise – I was used to getting very positive reports with some constructive suggestions – when I scored low on a bunch of the inter-personal metrics in the review. Martin explained that it was the reviews thing – folk were genuinely unhappy, even as they commended me on my technical merits. Further to that, he said that I really needed to stop worrying about technical improvement and focus on this inter-personal stuff.
Two really important things happened around this time. Firstly, Steve Alexander, who was one of my managers-once-removed at the time, reached out to me and suggested I read a book – Getting out of the box – and that we might have a chat about the issue after I had read it. I did so, and we chatted. That book gave me a language and viewpoint for thinking about the problem. It didn’t solve it, but it meant that I ‘got it’, which I hadn’t before.
So then the second thing happened – we had a company all hands and I got to chat with Claire Davis (head of HR at Canonical at the time) about what was going on. To this day the sheer embarrassment I felt when she told me that the broad perception of me amongst other teams managers was – and I paraphrase a longer, more nuance conversation here – “technically fantastic but very scary to have on the team – will disrupt and cause trouble”.
So, at this point about 6 months had passed, I knew what I wanted – I wanted folk to want to work with me, to find my presence beneficial and positive on both technical and team aspects. I already knew then that what I seek is technical challenges: I crave novelty, new challenges, new problems. Once things become easy, it call all too easily slip into tedium. So at that time my reasoning was somewhat selfish: how was I to get challenges if no-one wanted to work with me except in extremis?
I spent the next year working on myself as much as specific projects: learning more and more about how to play well with others.
In June 2010 I got a performance review I could be proud of again – I was – in no way – perfect, but I’d made massive strides. This journey had also made huge improvements to my personal life – a lot of stress between Lynne and I had gone away. Shortly after that I was invited to apply for a new role within Canonical as Technical Architect for Launchpad – and Francis Lacoste told me that it was only due to my improved ability to play well with others that I was even considered. I literally could not have done the job 18 months before. I got the job, and I think I did pretty well – in fact I was awarded an internal ‘Spotlight on Success’ award for what we (it was a whole Launchpad team team effort) achieved while I was in that role.
So, what did I change/learn? There’s just a couple of key changes I needed to make in myself, but a) they aren’t sticky: if I get overly tired, ye old terrible Robert can leak out, and b) there’s actually a /lot/ of learnable skills in this area, much of which is derived – lots of practice and critical self review is a good thing. The main thing I learnt was that I was Selfish. Yes – capital S. For instance, in a discussion about adding working tree filter to bzr, I would focus on the impact/risk on me-and-things-I-directly-care-about: would it make my life harder, would it make bzr slower, was there anything that could go wrong. And I would spend only a little time thinking about what the proposer needed: they needed support and assistance making their idea reach the standards the bzr community had agreed on. The net effect of my behaviours was that I was a class A asshole when it came to getting proposals into a code base.
The key things I had to change were:
- I need to think about the needs of the person I’m speaking to *and not my own*. [Thats not to say you should ignore your needs, but you shouldn't dwell on them: if they are critical, your brain will prompt you].
- There’s always a reason people do things: if it doesn’t make sense, ask them! [The crucial conversations books have some useful modelling here on how and why people do things, and on how-and-why conversations and confrontations go bad and how to fix them.]
Ok so this is all interesting and so forth, but why the blog post?
Firstly, I want to thank four folk who were particularly instrumental in helping me learn this lesson: Martin, Steve, Claire and of course my wife Lynne – I owe you all an unmeasurable debt for your support and assistance.
Secondly, I realised today that while I’ve apologised one on one to particular folk who I knew I’d made life hard for, I’d never really made a widespread apology. So here it is: I spent many years as an ass, and while I didn’t mean to be one, intent doesn’t actually count here – actions do. I’m sorry for making your life hell in the past, and I hope I’m doing better now.
Lastly, if I’m an ass to you now, I’m sorry, I’m probably regressing to old habits because I’m too tired – something I try to avoid, but it’s not always possible. Please tell me, and I will go get some sleep then come and apologise to you, and try to do better in future.
Filed under: Uncategorized | 9 Comments
Tags: Bazaar, Canonical, Launchpad, ubuntu
I’ve transitioned to a new key – announcement here or below. If you’ve signed my key in the past please consider signing my new key to get it integrated into the web of trust. Thanks!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1,SHA256 Sun, 2013-10-13 Time for me to migrate to a new key (shockingly late - sorry!). My old key is set to expire early next year. Please use my new key effective immediately. If you have signed my old key then please sign my key - this message is signed by both keys (and the new key is signed by my old key). old key: pub 1024D/FBD3EB8E 2002-07-20 Key fingerprint = 9222 8732 859D 25CC 2560 B617 867B F9A9 FBD3 EB8E new key: pub 4096R/AAC0E286 2013-10-13 Key fingerprint = 8244 0CEA B440 83C7 9431 D2CC 298E 9A19 AAC0 E286 The new key is up on the keyservers, so you can just pull it from there. - -Rob -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEARECAAYFAlJZ8FEACgkQhnv5qfvT644WxACfWBoKdVW+YDrMR1H9IY6iJUk8 ZC8AoIMRc55CTXsyn3S7GWCfOR1QONVhiQEcBAEBCAAGBQJSWfBRAAoJEInv1Yjp ddbfbvgIAKDsvPLQil/94l7A3Y4h4CME95qVT+m9C+/mR642u8gERJ1NhpqGzR8z fNo8X3TChWyFOaH/rYV+bOyaytC95k13omjR9HmLJPi/l4lnDiy/vopMuJaDrqF4 4IS7DTQsb8dAkCVMb7vgSaAbh+tGmnHphLNnuJngJ2McOs6gCrg3Rb89DzVywFtC Hu9t6Sv9b0UAgfc66ftqpK71FSo9bLQ4vGrDPsAhJpXb83kOQHLXuwUuWs9vtJ62 Mikb0kzAjlQYPwNx6UNpQaILZ1MYLa3JXjataAsTqcKtbxcyKgLQOrZy55ZYoZO5 +qdZ1+wiD3+usr/GFDUX9KiM/f6N+Xo= =EVi2 -----END PGP SIGNATURE-----
Filed under: Uncategorized | Leave a Comment
Tags: Debian, gpg, ubuntu
Python 3 recently introduced a nice feature – subtests. When I was putting subunit version 2 together I tried to cater for this via a heuristic approach – permitting the already known requirement that some tests which are reported are not runnable be combined with substring matching to identify subtests.
However that has panned out poorly, when I went to integrate this with testr the code started to get fugly.
So, I’m going to extend the StreamResult API to know about subtests, and issue a subunit protocol bump – to 2.1 – to add a new field for labelling subtest events. My plan is to make this build a recursive tree structure – that is given test “test_foo” with subtest “i=3″ which the Python subtest code would identify as “test_foo (i=3)”, they should be identified in StreamResult as test_id “test_foo (i=3)” and parent_test_id “test_foo”. This can then nest arbitrarily deep if test runners decide to do that, and the individual runnability becomes up to the test runner, not testrepository / subunit / StreamResult.
Filed under: Uncategorized | Leave a Comment
Tags: Python, Subunit, testing, testrepository, testtools, upstream
The Rackspace docs describe how to use rackspace’s custom extensions, but not how to use plain ol’ nova. Using plain nova is important if you want cloud portability in your scripts.
So – for future reference – these are the settings:
Filed under: Uncategorized | Leave a Comment
Subunit V2 is coming along very well.
- I have a complete implementation of the StreamResult API up as a patch for testtools. Thats 2K LOC including comeprehensive tests.
- Similarly, I have an implementation of a StreamResult parser and emitter for subunit. Thats 1K new LOC including comprehensive tests, and another 500 lines of churn where I migrate all the subunit filters to v2.
- pdb debugging works through subunit v2, permitting dropping into a debugger to work. Yay.
Remaining things to do:
- Update the other language bindings – the C library in particular.
- Teach testrepository to expect v2 input (and probably still store v1 for a while)
- Teach testrepository to use pipes for the stdin of test runner backends, and some control mechanism to switch input between different backends.
- Discuss the in-Python API with more folk.
- Get code merged :)
Filed under: Uncategorized | 8 Comments
Tags: Python, Subunit, testing, testrepository, testsupport, testtools, unittest
I’ve been hitting the limits of gigabit ethernet at home for quite a while now, and as I spend more time working with cloud technologies this started to frustrate me.
I’d heard of other folk getting good results with second hand Infiniband cards and decided to give it a go myself.
I bought two Voltaire dual-port Infiniband adapters – a 4X SDR PCI-E x4 card. And in a 2 metre 8470 cable, and we’re in business.
There are other, more comprehensive guides around to setting this up – e.g. http://davidhunt.ie/wp/?p=2291 or http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html
On ubuntu the hardware was autodetected; all I needed to do was:
modprobe ib_ipoib sudo apt-get install opensm # on one machine
And configure /etc/network/interfaces – e.g.:
iface ib1 inet static address 192.168.2.3 netmask 255.255.255.0 network 192.168.2.0 up echo connected >`find /sys -name mode | grep ib1` up echo 65520 >`find /sys -name mtu | grep ib1`
With no further tuning I was able to get 2Gbps doing linear file copies via Samba, which I suspect is rather pushing the limits of my circa 2007 home server – I’ll investigate futher to identify where the bottlenecks are, but the networking itself I suspect is ok – netperf got me 6.7Gbps in a trivial test.
Filed under: Uncategorized | 4 Comments
Tags: cloud, infiniband, performance, ubuntu
StreamResult, covered in my last few blog posts, has panned out pretty well.
Until that is, that I sat down to do a serialised version of it. It became fairly clear that the wire protocol can be very simple – just one event type that has a bunch of optional fields – test ids, routing code, file data, mime-type etc. It is up to the recipient at the far end of a stream to derive semantic meaning, which means that encoding a lot of rules (such as a data packet can have either a test status or file data) into the wire protocol isn’t called for.
If the wire protocol doesn’t have those rules, Python parsers that convert a bytestream into StreamResult API calls will have to manually split packets that have both status() and file() data in them… this means it would be impossible to create many legitimate bytestreams via the normal StreamResult API.
That seems to be an unnecessary restriction, and thinking about it, having a very simple ‘here is an event about a test run’ API that carries any information we have and maps down a very simple wire protocol should be about as easy to work with as the current file or status API.
Most combinations of file+status parameters is trivially interpretable, but there is one that had no prior definition – a test_status with no test id specified. Files with no testid are easily considered as ‘global scope’ for their source, so perhaps test_status should be treated the same way? [Feedback in comments or email please]. For now I’m going to leave the meaning undefined and unconstrained.
So I’m preparing a change to my patchset for StreamResult to:
- Drop the file() method altogether.
- Add file_bytes, mime_type and eof parameters to status().
- Make the test_id and test_status parameters to status() optional.
This will make the API trivially serialisable (both to JSON or protobufs or whatever, or to the custom binary format I’m considering for subunit), and equally trivially parsable, which I think is a good thing.
Filed under: Uncategorized | Leave a Comment
Tags: Python, Subunit, testing, testsupport, testtools, unittest
My last two blog posts were largely about the needs of subunit, but a key result of any protocol is how easy working with it in a high level language is.
In the weekend and evenings I’ve done an implementation of a new set of classes – StreamResult and friends – that provides:
- Adaption to and from the existing TestResult APIs (the 2.6 and below API, 2.7 API, and the testtools extended API).
- Multiplexing multiple streams together.
- Adding timing data to a stream if it is absent.
- Summarising a stream.
- Copying a stream to multiple outputs
- A split out API for instructing a test run to stop.
- A simple test-at-a-time stream processor that makes it easy to just deal with tests rather than the innate complexities of an event based interface.
So far the code has been uniformly simple to write. I started with an API that included an ‘estimate’ function, which I’ve since removed – I don’t believe the complexity is justified; enumeration is not significantly more expensive than counting, and runners that want to be efficient can either not enumerate or remember the enumeration from prior runs.
The documentation in the linked pull request is a good place to start to get a handle on the API; I’d love feedback.
Next steps for me are to do a subunit protocol revision that maps to the Python API, both parser and generator and see how it feels. One wrinkle there is that the reason for doing this is to fix intrinsic limits in the existing protocol – so doing forward and backward wire protocol compatibility would defeat the point. However… we can make the output side explicitly choose a protocol version, and if we can autodetect the protocol version in the parser, even if we cannot handle mixed streams we can get the benefits of the new protocol once data has been detected. That said, I think we can start without autodetection during prototyping, and add it later. Without autodetection, programs like TestRepository will need configuration options to control what protocol variant to expect. This could be done by requiring this new protocol and providing a stream filter that can be deployed when needed.
Filed under: Uncategorized | Leave a Comment
Tags: performance, Python, Subunit, testing, testrepository, testsupport, testtools, unittest
Of course, as happens sadly often, the scope creeps..
Additional pain points
Zope’s test runner runs things that are not tests, but which users want to know about – ‘layers’. At the moment these are reported as individual tests, but this is problematic in a couple of ways. Firstly, the same ‘test’ runs on multiple backend runners, so timing and stats get more complex. Secondly, if a layer fails to setup or teardown, tools like testrepository that have watched the stream will think a test failed, and on the next run try to explicitly run that ‘test’ – but that test doesn’t really exist, so it won’t run [unless an actual test that needs the layer is being run].
Openstack uses python coverage to gather coverage statistics during test runs. Each worker running tests needs to gather and return such statistics. The current subunit protocol has no space to hand this around, without it pretending to be a test [see a pattern here?]. And that has the same negative side effect – test runners like testrepository will try to run that ‘test’. While testrepository doesn’t want to know about coverage itself, it would be nice to be able to pass everything around and have a local hook handle the aggregation of that data.
The way TAP is reflected into subunit today is to mangle each tap ‘test’ into a subunit ‘test’, but for full benefits subunit tests have a higher bar – they are individually addressable and runnable. So a TAP test script is much more equivalent to a subunit test. A similar concept is landing in Python’s unittest soon – ‘subtests’ – which will give very lightweight additional assertions within a larger test concept. Many C test runners that emit individual tests as simple assertions have this property as well – there may be 5 or 10 executables each with dozens of assertions, but only the executables are individually addressable – there is no way to run just one assertion from an executable as a ‘test’. It would be nice to avoid the friction that currently exists when dealing with that situation.
Minimum requirements to support these
Layers can be supported via timestamped stdout output, or fake tests. Neither is compelling, as the former requires special casing in subunit processors to data mine it, and the latter confuses test runners. A way to record something that is structured like a test (has an id – the layer, an outcome – in progress / ok / failed, and attachment data for showing failure details) but isn’t a test would allow the data to flow around without causing confusion in the system.
TAP support could change to just show the entire output as progress on one test and then fail or not at the end. This would result in a cognitive mismatch for folk from the TAP world, as TAP runners report each assertion as a ‘test’, and this would be hidden from subunit. Having a way to record something that is associated with an actual test, and has a name, status, attachment content for the TAP comments field – that would let subunit processors report both the addressable tests (each TAP script) and the individual items, but know that only the overall scripts are runnable.
Python subtests could use a unique test for each subtest, but that has the same issue has layers. Python will ensure a top level test errors if a subtest errors, so strictly speaking we probably don’t need an associated-with concept, but we do need to be able to say that a test-like thing happened that isn’t actually addressable.
Coverage information could be about a single test, or even a subtest, or it could be about the entire work undertaken by the test process. I don’t think we need a single standardised format for Coverage data (though that might be an excellent project for someone to undertake). It is also possible to overthink things :). We have the idea of arbitrary attachments for tests. Perhaps arbitrary attachments outside of test scope would be better than specifying stdout/stderr as specific things. On the other hand stdout and stderr are well known things.
Proposal version 2
A packetised length prefixed binary protocol, with each packet containing a small signature, length, routing code, a binary timestamp in UTC, a set of UTF8 tags (active only, no negative tags), a content tag – one of (estimate + number,
stdin, stdout, stderr, file, test), test-id, runnable, test-status (one of exists/inprogress/xfail/xsuccess/success/fail/skip), an attachment name, mime type, a last-block marker and a block of bytes.
The std/stdout/stderr content tags are gone, replaced with file. The names stdin,stdout,stderr can be placed in the attachment name field to signal those well known files, and any other files that the test process wants to hand over can be simply embedded. Processors that don’t expect them can just pass them on.
Runnable is a boolean, indicating whether this packet is describing a test that can be executed deliberately (vs an individual TAP assertion, Python sub-test etc). This permits describing things like zope layers which are top level test-like things (they start, stop and can error) though they cannot be run.. and it doesn’t explicitly model the setup/teardown aspect that they have. Should we do that?
Testid is for identifying tests. With the runnable flag to indicate whether a test really is a test, subtests can just be namespaced by the generator – reporters can choose whether to be naive and report every ‘test’, or whether to use simple string prefix-with-non-character-seperator to infer child elements.
Impact on Python API
If we change the API to:
class TestInfo(object): id = unicode status = ('exists', 'inprogress', 'xfail', 'xsuccess', 'success', 'fail', 'error', 'skip') runnable = boolean class StreamingResult(object): def startTestRun(self): pass def stopTestRun(self): passs def estimate(self, count, route_code=None, timestamp=None): pass def file(self, name, bytes, eof=False, mime=None, test_info=None, route_code=None, timestamp=None): """Inform the result about the contents of an attachment.""" def status(self, test_info, route_code=None, timestamp=None): """Inform the result about a test status with no attached data."""
This would permit the full semantics of a subunit stream to be represented I think, while being a narrow interface that should be easy to implement.
Please provide feedback! I’ll probably start implementing this soon.
Filed under: Uncategorized | 1 Comment
Tags: coverage, junit, Python, Subunit, TAP, testing, testrepository, unittest
Subunit is seven and a half years old now – Conrad Parker and I first sketched it up at a CodeCon – camping and coding, a brilliant combination – in mid 2005.
revno: 1 committer: Robert Collins <firstname.lastname@example.org> timestamp: Sat 2005-08-27 15:01:20 +1000 message: design up a protocol with kfish
It has proved remarkably resilient as a protocol – the basic nature hasn’t changed at all, even though we’ve added tags, timestamps, support for attachments of arbitrary size.
However a growing number of irritations have been building up with it. I think it is time to design another iteration of the protocol, one that will retain the positive qualities of the current protocol, while helping it become suitable for the next 7 years. Ideally we can keep compatibility and make it possible for a single stream to be represented in any format.
The existing design is a mostly human readable line orientated protocol that can be sniffed out from the regular output of ‘make’ or other build systems. Binary attachments are done using HTTP chunking, and the parser has to maintain state about the current test, tags, timing data and test progression [a simple stack of progress counters]. How to arrange subunit output is undefined, how to select tests to run is undefined.
This makes writing a parser quite easy, and the tagging and timestamp facility allow multiplexing streams from two or more concurrent test runs into one with good fidelity – but also requires that state be buffered until the end of a test, as two tests cannot be executing at once.
Dealing with debuggers
The initial protocol was intended to support dropping into a debugger – just pass each line read through to stdout, and connect stdin to the test process, and voila, you have a working debugger connection. This works, but the current line based parsers make using it tedious – the line buffered nature of it makes feedback on what has been typed fiddly, and stdout tends to be buffered, leading to an inability to see print statements and the like. All in-principle fixable, right ?
When running two or more test processes, which test process should stdin be connected to? What if two or more drop into a debugger at once? What is being typed to which process is more luck than anything else.
We’ve added some idioms in testrepository that control test execution by a similar but different format – one test per line to list tests, and have runners permit listing and selecting by a list. This works well, but the inconsistency with subunit itself is a little annoying – you need two parsers, and two output formats.
The current protocol is extremely easy to implement for emitters, and the arbitrary attachments and tagging features have worked extremely well. There is a comprehensive Python parser which maps everything into Python unittest API calls (an extended version of the standard, with good backwards compatibility).
The debugging support was a total failure, and the way the parser depraminates it’s toys when a test process corrupts an outcome line is extremely frustrating. (other tests execute but the parser sees them as non-subunit chatter and passes the lines on through stdout).
Dealing with concurrency
The original design didn’t cater for concurrency. There are three concurrency issues – the corruption issue (see below for more detail) and multiplexing. Consider two levels of nested concurrency: A supervisor process such as testrepository starts 2 (or more but 2 is sufficient to reason about the issue) subsidiary worker processes (I1 and I2), each of which starts 2 subsidiary processes of their own (W1, W2, W3, W4). Each of the 4 leaf processes is outputting subunit which gets multiplexed in the 2 intermediary processes, and then again in the supervisor. Why would there be two layers? A concrete example is using testrepository to coordinate test runs on multiple machines at once, with each machine running a local testrepository to broker tests amongst the local CPUs. This could be done with 4 separate ssh sessions and no intermediaries, but that only removes a fraction of the issues. What issues?
Well, consider some stdout chatter that W1 outputs. That will get passed to I1 and from there to the supervisor and captured. But there is nothing marking the chatter as belonging to W1: there is no way to tell where it came from. If W1 happened to fail, and there was a diagnostic message printed, we’ve lost information. Or at best muddled it all up.
Secondly, buffering – imagine that a test on W1 hangs. I1 will know that W1 is running a test, but has no way to tell the supervisor (and thus the user) that this is the case, without writing to stdout [and causing a *lot* of noise if that happens a lot]. We could have I1 write to stdout only if W1′s test is taking more than 5 seconds or something – but this is a workaround for a limitation of the protocol. Adding to the confusion, the clock on W1 and W3 may be very skewed, so timestamps for everything have to be carefully synchronised by the multiplexer.
Thirdly, scheduling – if W1/W2 are on a faster machine than W3/W4 then a partition of equal-timed tests onto each machine will lead one idle before the other finishes. It would be nice to be able to pass tests to run to the faster machine when it goes idle, rather than having to start a new runner each time.
Lastly, what to do when W1 and W2 both wait for user input on stdin (e.g. passphrases, debugger input, $other). Naively connecting stdin to all processes doesn’t work well. A GUI supervisor could connect a separate fd to each of I1 and I2, but that doesn’t help when it is W1 and W2 reading from stdin.
So additional requirement over baseline subunit:
- make it possible for stdout and stderr output to be captured from W1 and routed through I1 to the supervisor without losing its origin. It might be chatter from a noisy test, or it might be build output. Either way, the user probably will benefit if we can capture it and show it to them later when they review the test run. The supervisor should probably show it immediately as well – the protocol doesn’t need to care about that, just make it possible.
- make it possible to pass information about tests that have not completed through one subunit stream while they are still incomplete.
- make it possible (but optional) to pass tests to run to a running process that accepts subunit.
- make it possible to route stdin to a specific currently process like W1. This and point 3 suggest that we need a bidirectional protocol rather than the solely unidirectional protocol we have today. I don’t know of a reliable portable way to tell when some process is seeking such input, so that will be up to the user I think. (e.g. printing (pdb) to stdout might be a sufficiently good indicator.)
Dealing with corruption
Consider the following subunit fragment:
test: foo starting serversuccess:foo
This is a classic example of corruption: the test ‘foo’ started a server and helpfully wrote to stdout explaining that it did that, but missed the newline. As a result the success message for the test wasn’t printed on a line of its own, and the subunit parser will believe that foo never completed. Every subsequent test is then ignored. This is usually easy to identify and fix, but its a head-scratcher when it happens. Another way it can happen is when a build tool like ‘make’ runs tests in parallel, and they output subunit onto the same stdout file handle. A third way is when a build tool like make runs two separate test scripts serially, and the first one starts a test but errors hard and doesn’t finish it. That looks like:
test: foo test: bar success: bar
One way that this sort of corruption can be mitigated is to put subunit on it’s own file descriptor, but this has several caveats: it is harder to tunnel through things like ssh and it doesn’t solve the failing test script case.
I think it is unreasonable to require a protocol where arbitrary interleaving of bytes between different test runner streams will work – so the ‘make -j2′ case can be ignored at the wire level – though we should create a simple way to safely mux the output from such tests when the execute.
The root of the issue is that a dropped update leaves bad state in the parser and it never recovers. So some way to recover, or less state to carry in the parser, would neatly solve things. I favour reducing parser state as that should shift stateful complexity onto end nodes / complex processors, rather than being carried by every node in the transmission path.
Various suggestions have been made – JSON, Protobufs, etc…
A key design goal of the first subunit was a low barrier to entry. We keep that by being backward compatible, but being easy to work with for the new revision is also a worthy goal.
High level proposal
A packetised length prefixed binary protocol, with each packet containing a small signature, length, routing code, a binary timestamp in UTC, a set of UTF8 tags (active only, no negative tags), a content tag – one of (estimate + number, stdin, stdout, stderr, test- + test id), test status (one of exists/inprogress/xfail/xsuccess/success/fail/skip), an attachment name, mime type, a last-block marker and a block of bytes.
The content tags:
- estimate – the stream is reporting how many tests are expected to run. It affects everything with the same routing code only, and replaces (doesn’t adjust) any current estimate for that routing code. A estimate packet of 0 can be used to say that a routing target has shutdown and cannot run more tests. Routing codes can be used by a subunit aware runner to separate out separate threads in a single process, or even just separate ‘TestSuite’ objects within a single test run (though doing so means that they will need to process subunit and strip packets on stdin. This supercedes the stack of progress indicators that current subunit has. estimates cannot have test status or attachments.
- stdin/stdout/stderr: a packet of data for one of these streams. The routing code identifies the test process that the data came from/should go to in the tree of test workers. These packets cannot have test status but should have a non-empty attachment block.
- test- + testid: a packet of data for a single test. test status may be included, as may attachment name, mime type, last-block and binary data.
Test status values are pretty ordinary. Exists is used to indicate a test that can be run when listing tests, and inprogress is used to report a test that has started but not necessarily completed.
Attachment names must be unique per routing code + testid.
So how does this line up?
Interleaving and recovery
We could dispense with interleaving and say the streams are wholly binary, or we can say that packets can start either after a \n or directly after another packet. If we say that binary-only is the approach to take, it would be possible to write a filter that would apply the newline heuristic (or even look for packet headers at every byte offset. I think mandating adjacent to a packet or after \n is a small concession to make and will avoid tools like testrepository forcing users to always configure a heuristic filter. Non-subunit content can be wrapped in subunit for forwarding (the I1 in W1->I1->Supervisor chain would do the wrapping). This won’t eliminate corruption but it will localise it and permit the stream to recover: the test that was corrupted will show up as incomplete, or with incomplete attachment data.
Test listing would emit many small non-timestamped packets. It may be useful to have a wrapper packet for bulk amounts of fine grained data like listing is, or for multiplexers with many input streams that will often have multiple data packets available to write at once.
Selecting tests to run
Same as for listing – while passing regexes down to the test runner to select groups of tests is a valid use case, thats not something subunit needs to worry about : if the selection is not the result of the supervisor selecting by test id, then it is known at the start of the test run and can just be a command line parameter to the backend : subunit is relevant for passing instructions to a runner mid-execution. Because the supervisor cannot just hand out some tests and wait for the thing it ran to report that it can accept incremental tests on stdin, supervisor processes will need to be informed about that out of band.
Debugging is straight forward . The parser can read the first 4 or so bytes of a packet one at a time to determine if it is a packet or a line of stdout/stderr, and then either read to end of line, or the binary length of the packet. So, we combine a few things; non-subunit output should be wrapped and presented to the user. Subunit that is being multiplexed and forwarded should prepend a routing code to the packet (e.g. I1 would append ’1′ or ’2′ to select which of W1/W2 the content came from, and then forward the packet. S would append ’1′ or ’2′ to indicate I1/I2 – the routing code is a path through the tree of forwarding processes). The UI the user is using needs to supply some means to switch where stdin is attached. And stdin input should be routed via stdin packets. When there is no routing code left, the packet should be entirely unwrapped and presented as raw bytes to the process in question.
Very straight forward – unwrap the outer layer of the packet, add or adjust the routing code, serialise a header + adjusted length + rest of packet as-is. No buffering is needed, so the supervisor can show in-progress tests (and how long they have been running for).
Parsing / representation in Python or other languages
The parser should be very simple to write. Once parsed, this will be fundamentally different to the existing Python TestCase->TestResult API that is in used today. However it should be easy to write two adapters: old-style <-> this new-style. old-style -> new-style is useful for running existing tests suites and generating subunit, because thats way the subunit generation is transparent. new-style->old-style is useful for using existing test reporting facilities (such as junitxml or html TestResult objects) with subunit streams.
Importantly though, a new TestResult style that supports the features of this protocol would enable some useful features for regular Python test suites:
- Concurrent tests (e.g. in multiprocessing) wouldn’t need multiplexers and special adapters – a regular single testresult with a simple mutex around it would be able to handle concurrent execution of tests, and show hung tests etc.
- The routing of input to a particular debugger instance also applies to a simple python process running tests via multiprocessing, so the routing feature would help there.
- The listing facility and incrementally running tests would be useful too I think – we could go to running tests concurrently with test collection happening, but this would apply to other parts of unittest than just the TestResult
The API might be something like:
class StreamingResult(object): def startTestRun(self): pass def stopTestRun(self): pass def estimate(self, count, route_code=None): pass def stdin(self, bytes, route_code=None): pass def stdout(self, bytes, route_code=None): pass def test(self, test_id, status, attachment_name=None, attachment_mime=None, attachment_eof=None, attachment_bytes=None): pass
This would support just-in-time debugging by wiring up pdb to the stdin/stdout handlers of the result object, rather than actual stdin/stdout of the process – a simple matter once written. Alternatively, the test runner could replace sys.stdin/stdout etc with thunk file-like objects, which might be a good idea anyway to capture spurious output happening during a test run. That would permit pdb to Just Work (even if the test process is internally running concurrent tests.. until it has two pdb objects running concurrently :)
Generation new streams
Should be very easy in anything except shell. For shell, we can have a command line tool that when invoked outputs a subunit stream for one instruction. E.g. ‘test foo completed + some attachments’ or ‘test foo starting’.
Filed under: Uncategorized | 9 Comments
Tags: concurrency, debugging, junit, performance, Python, Subunit, testing, testrepository