subunit version 2 progress

Subunit V2 is coming along very well.

Current status:

  • I have a complete implementation of the StreamResult API up as a patch for testtools. Thats 2K LOC including comeprehensive tests.
  • Similarly, I have an implementation of a StreamResult parser and emitter for subunit. Thats 1K new LOC including comprehensive tests, and another 500 lines of churn where I migrate all the subunit filters to v2.
  • pdb debugging works through subunit v2, permitting dropping into a debugger to work. Yay.

Remaining things to do:

  • Update the other language bindings – the C library in particular.
  • Teach testrepository to expect v2 input (and probably still store v1 for a while)
  • Teach testrepository to use pipes for the stdin of test runner backends, and some control mechanism to switch input between different backends.
  • Discuss the in-Python API with more folk.
  • Get code merged :)

Simpler is better – a single event type for StreamResult

StreamResult, covered in my last few blog posts, has panned out pretty well.

Until that is, that I sat down to do a serialised version of it. It became fairly clear that the wire protocol can be very simple – just one event type that has a bunch of optional fields – test ids, routing code, file data, mime-type etc. It is up to the recipient at the far end of a stream to derive semantic meaning, which means that encoding a lot of rules (such as a data packet can have either a test status or file data) into the wire protocol isn’t called for.

If the wire protocol doesn’t have those rules, Python parsers that convert a bytestream into StreamResult API calls will have to manually split packets that have both status() and file() data in them… this means it would be impossible to create many legitimate bytestreams via the normal StreamResult API.

That seems to be an unnecessary restriction, and thinking about it, having a very simple ‘here is an event about a test run’ API that carries any information we have and maps down a very simple wire protocol should be about as easy to work with as the current file or status API.

Most combinations of file+status parameters is trivially interpretable, but there is one that had no prior definition – a test_status with no test id specified. Files with no testid are easily considered as ‘global scope’ for their source, so perhaps test_status should be treated the same way? [Feedback in comments or email please]. For now I’m going to leave the meaning undefined and unconstrained.

So I’m preparing a change to my patchset for StreamResult to:

  • Drop the file() method altogether.
  • Add file_bytes, mime_type and eof parameters to status().
  • Make the test_id and test_status parameters to status() optional.

This will make the API trivially serialisable (both to JSON or protobufs or whatever, or to the custom binary format I’m considering for subunit), and equally trivially parsable, which I think is a good thing.

First experience implementing StreamResult

My last two blog posts were largely about the needs of subunit, but a key result of any protocol is how easy working with it in a high level language is.

In the weekend and evenings I’ve done an implementation of a new set of classes – StreamResult and friends – that provides:

  • Adaption to and from the existing TestResult APIs (the 2.6 and below API, 2.7 API, and the testtools extended API).
  • Multiplexing multiple streams together.
  • Adding timing data to a stream if it is absent.
  • Summarising a stream.
  • Copying a stream to multiple outputs
  • A split out API for instructing a test run to stop.
  • A simple test-at-a-time stream processor that makes it easy to just deal with tests rather than the innate complexities of an event based interface.

So far the code has been uniformly simple to write. I started with an API that included an ‘estimate’ function, which I’ve since removed – I don’t believe the complexity is justified; enumeration is not significantly more expensive than counting, and runners that want to be efficient can either not enumerate or remember the enumeration from prior runs.

The documentation in the linked pull request is a good place to start to get a handle on the API; I’d love feedback.

Next steps for me are to do a subunit protocol revision that maps to the Python API, both parser and generator and see how it feels. One wrinkle there is that the reason for doing this is to fix intrinsic limits in the existing protocol – so doing forward and backward wire protocol compatibility would defeat the point. However… we can make the output side explicitly choose a protocol version, and if we can autodetect the protocol version in the parser, even if we cannot handle mixed streams we can get the benefits of the new protocol once data has been detected. That said, I think we can start without autodetection during prototyping, and add it later. Without autodetection, programs like TestRepository will need configuration options to control what protocol variant to expect. This could be done by requiring this new protocol and providing a stream filter that can be deployed when needed.

Multi-machine parallel testing of nova with testrepository

I recently added a formal interface to testrepository to enable cross-machine scaling of test runs. As testrepository is still a static scheduler, this isn’t perfect, but its quite a minimal interface, which makes it easy to implement. I will likely evolve it in reaction to feedback and experience.

In the long term I’d love to have a super generic tool that matches that interface, so the project VCS copy of .testr.conf can just call out to it. However I don’t yet have that, but I do have a simple by-hand implementation that I use to run nova’s tests across my personal laptop, desktop and work laptop.

Testr models this by assuming each test running process can be mapped to a single ‘instance id’ (which could be a chroot, vm, cloud instances, …) and then running one or more commands in the instance, before disposing of it.

This by hand implementation consists of 4 things:

  1. A tiny script to rsync my source directory to the relevant places before I run tests. (This takes <2seconds on my home wifi).
  2. A script to allocate instance ids (I just use ints)
  3. A script to discard them
  4. And a script to copy tempfiles onto the target machine and run a given command.

I do my testing in lxc containers, because I like my primary environment to be free of project-specific quirks and workarounds. lxc is not needed though, if you don’t want it.

So, to set this up for yourself:

  1. on each host, make an lxc container (e.g. following) http://wiki.openstack.org/DependsOnUbuntu
  2. start them all (lxc-start -n nova -d)
  3. Make SSH config entries for the lxc containers, so you can get at them remotely. (make sure your host * rules are at the end of the file otherwise the master overrides won’t work [and you might not notice for some time…]):
    Host desktop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      hostname 10.0.3.19
      ProxyCommand ssh 192.168.1.106 nc -q0 %h %p
    
    Host hplaptop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      hostname 10.0.3.244
      ProxyCommand ssh 192.168.1.116 nc -q0 %h %p
  4. make a script to copy your nova source tree to each test location. I called mine ‘sync’
    #!/bin/bash           
    cd $(dirname $0)
    echo syncing in $(pwd) 
    (rsync -a . desktop-nova.lxc:source/openstack/nova --delete-after && echo dell done) &
    (rsync -a . hplaptop-nova.lxc:source/openstack/nova --delete-after && echo hp done)
  5. Make sure you have the base directory on each location
    ssh desktop-nova.lxc mkdir -p source/openstack
    ssh hplaptop-nova.lxc mkdir -p source/openstack
  6. Sync your code over.
    ./sync
  7. And check tests run by running a few.
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./run_tests.sh compute"
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./run_tests.sh compute"

    This will check the test environment: we’re not going to be running tests on each node via run-tests or even testr (because it gets immediately meta), but if this fails, later attempts won’t work. Your test virtualenv is inside the source tree, so it is copied implicitly by the sync.

  8. Decide what concurrency you want. For me, I picked 12: I have a desktop i7 with 4 cores, and two laptops with 2 cores each, and hyperthreads are on on all of them – I’m going to set a concurrency figure of 12 – between the cores (8) and threads (16) counts, and possibly balance it more in future. A higher number assumes less contention between ALU’s and other elements of the core pipeline, and I expect quite some contention because most of nova’s unittests are CPU bound not I/O. If the test servers are not busy, I can always raise it later.
  9. Create scripts to create / dispose / execute logical worker threads.
  10. Creation. I call this ‘instance-provision’ and all it does is find the lowest ints not currently allocated and return them.
    #!/usr/bin/env python
    import os.path
    import sys
    
    if not os.path.isdir('.instances'):
        os.mkdir('.instances')
    
    running_ids = os.listdir('.instances')
    count = int(sys.argv[1])
    top = count + len(running_ids)
    ids = [str(i) for i in range(top)]
    new = set(ids) - set(running_ids)
    for id in new:
        file('.instances/%s' % id, 'w').close()
    print(' '.join(new))
  11. Disposal is easy: remove the file marking the instance as in-use.
    #!/bin/bash
    echo freeing $@
    cd .instances
    rm $@
  12. Execution is a little trickier. We need to run some commands locally, and other ones by copying in temp files that testr has setup to the machine sshing to the remote machine, cd’ing to the right directory, sourcing the virtual env, and finally running the command.
    #!/bin/bash
    instance="$(($1 % 4))"
    case $instance in
    [0]) node=
         local="true"
         ;;
    [1]) node=hplaptop-nova.lxc
         local=""
         ;;
    [2-3]) node=desktop-nova.lxc
         local=""
         ;;
    *)   echo "Unknown instance $instance" >&2
         exit 1
         ;;
    esac
    shift
    files=
    # accumulate files to copy
    while [ "--" != "$1" ]; do 
    files="$files $1"
    shift ; done 
    shift   
    if [ -n "$files" -a -z "$local" ]; then
        echo copying $files to node.
        for f in $files; do
            rsync $f $node:$(dirname $f) ;
        done
    fi  
    if [ -n "$local" ]; then
        eval $@
    else
        echo ssh to $node
        ssh $node "cd source/openstack/nova && . .venv/bin/activate && $@"
    fi
  13. Finally, tell testr how to use this. (Don’t commit this change to nova, as it would break other people). Add this to your .testr.conf.
    test_run_concurrency=echo 12
    instance_provision=./instance-provision $INSTANCE_COUNT
    instance_execute=./instance-execute $INSTANCE_ID $FILES -- $COMMAND
    instance_dispose=./instance-dispose $INSTANCE_IDS

Now, when you run testr run –parallel, it will run across your machines. Just do a ./sync before running tests to get the code out there. It is possible to wrap all of this up via automation (or to include just-in-time provisioned cloud instances), but I like the results of still rough scripts here – it strikes a good balance between effort, reliability and performance.

Edit: I spent a bit of time poking at my config – it turns out that my laptop (coming up on 3 years old now) has relatively less grunt – so I’m now running mod 8, with 0 my laptop, 1-2 my work laptop, 3-7 my desktop, and interestingly by running a proportionately overloaded set of tests I get a time reduction.

time testr run --parallel --concurrency=16
...
real 2m34.950s

Announcing testrepository

For a while now I’ve been using subunit as part of my regular development workflow. I would pipe test results to a file, use subunit to report on failures from that file, and be able to inspect all the failures at my leisure without rerunning tests or copy and pasting from far back in my history.

However this is a bit adhoc, and its not trivial to get good pipelines together – while its not hard, its not obvious either. And commands like tee are less readily available for Windows users.

So during my holidays I started a small project to automate this workflow. I didn’t get all that much done due to a combination of travel and coming down with a nasty bug near the end of my holidays – which I’m now recovering from. Yay health returning + medicines. If only we had medichines :).

However, I managed to get a reasonable first release out the door this evening. Grab it from launchpad or pypi.

Testrepository has a few deps – all listed in INSTALL.txt. Folk on Ubuntu Lucid should be able to just apt-get them all (sudo apt-get install subunit will be enough to run testrepository). If you’re not on Lucid you can grab the debs manually, or use the subunit ppa (sudo add-apt-repository ppa:subunit), though I’ve noticed just today that that karmic subunit build there only works with python 2.5, not the default of 2.6 – I will fix that at some point.

Using Testrepository is easy if you are developing python code:

$ testr init
$ python -m subunit.run project.tests.test_suite | testr load
id: 0 tests: 114

This will report any failures that occur. To see them again:

$ testr last
id: 0 tests: 114

The actual subunit streams are stored in .testrepository in sequentially numbered files (for now at least). So its very easy to get at them (for instance, subunit-stats < .testrepository/12).

If you are not using python, you can still use subunit easily if you are using shunit, ‘check’ or ‘cppunit’. subunit ships with bindings for shunit and cppunit, and check uses libsubunit with the CK_SUBUNIT output mode. TAP users can use tap2subunit to get a subunit stream from a TAP based testsuite.

It’s still early days but I’m finding this much nicer than the adhoc subunit management I was doing before.

Various releases

Recently I’ve been working on the Python unittest API in my spare time, with a long term goal of making it possible to safely and sensibly glue many different plugins together into the core.

Two important components of that goal are being able to extend the data included in a test result, and being able to change how a test is run (such as adding new exceptions that should be treated as specific outcomes – python unittest uses exceptions to signal outcomes).

In testtools 0.9.2 we have an answer to both those issues. I’m really happy with the data included in outcomes API, ‘TestCase.addDetail’. The API for extending outcomes works, but only addresses part of that issue for now.

Subunit 0.0.4, which is available for older Ubuntu releases in the Subunit releases PPA now, and mostly built on Debian (so it will propogate through to Lucid in due course) has support for the addDetail API. Subunit now depends on testtools, reducing the non-protocol related code and generally making things simpler.

Using those two together, bzr’s parallelised test suite has been improved as well, allowing it to include the log file for tests run in separate processes (previously it was silently discarded). The branch to do this will be merged soon, its just waiting on some sysadmin love to get these new versions into its merge-test environment. This change also provides complete capturing of the log when users want to supply a subunit log containing failed tests. The python code to do this is pretty simple:

def setUp(self):
    super(TestCase, self).setUp()
    self.addDetail("log", content.Content(content.ContentType("text", "plain",
        {"charset": "utf8"}), lambda:[self._get_log(keep_log_file=True)]))

I’ve made a couple of point releases to python-junitxml recently, fixing some minor bugs. I need to figure out how to add the extra data that addDetails permits  to the xml output. I suspect its a strict superset and so I’ll have to filter stuff down. If anyone knows about similar extensions done to junit’s XML format before, please leave a comment :)

Python unittest API : Time to fix it

So, for ages now I’ve been saying that unittest is, at its core, pretty sound. I incited a talk to this effect.

I have a vision; I dream of a python testing library that:

  1. Is in the python core
  2. Is simple
  3. Is extensible
  4. Has tests take care of testing
  5. Has results take care of reporting
  6. Aids communication from test to test reader

Hopefully those are pretty modest and agreeable things to want.

However we don’t have this: nose is lovely but not in the core [and is a moderately complex API]. py.test is also not in the core, and has previously tripped my too-much-magic alerts. I must admit to not having checked if this is fixed yet. unittest itself is in the core but has some cruft which we should clean up, but more importantly is not extensible enough, which leads to extensions such as the zope testrunner having to muddy the waters between testing and reporting.

The point “Aids communication from test to test reader” is worth expanding on: automated testing is something that doesn’t need observation…until the unexpected happens. At that point some poor schmuck such as you or I ends up trying to guess what went wrong. The more data that we gather and communicate about the event, the greater the chance it can be corrected without needing a repeat run under a debugger, or worse, single stepping through the code.

There is a problem with ‘assertFoo’ methods in unittest, something that I’m not going to cram into this blog post. I will say, if you find the tendency of such methods to crawl to the base class frustrating, that you should look at hamcrest – it and similar things have been very successful in the Java unit testing world; we can learn from them.

Going back to my vision, we need to make unittest more powerfully extensible to allow projects like nose to do all the cool things they want to while still being unittest compatible. I don’t mean that nose can’t run unittest tests; I mean that unittest can’t run nose tests: nose has had to expand the contract, not simply add implementations that do more.

To that end I have a number of bugs which I need to file. Solving them piecemeal will create a fractured API – particularly if this is done over more than one release. So I am planning on prototyping in other projects, discussing like mad on the testing-in-python list, and when it all starts to come together writing up a PEP.

The bugs I have are:

  1. streams nicely: countTestCases must die/be made optional. This function is inherently incompatible with generative tests or anything beyond the simplest lightweight environments
  2. no way to wrap code around a single test. This would permit profiling, debugging, tracing, and I’m sure other things more cleanly.  (At the moment, one must ‘turn on’ the profiler in startTestCase, and turn it off in stopTestCase. This is much more awkward than simply being in the call stack). Some care will be needed here, particularly for generative tests.
  3. code that isn’t part of the implementation in the core needs to be able to work with the reporting code; allowing an optionally wider API permits extensions to be debuggable. This needs thought: do we allow direct access to TestResults? Do we come up with some added level of indirection and ‘events’? I don’t know.
  4. More data than just the backtrace needs to be included when an outcome is reporter. I’ve started a discussion on the testing in python list about this. I’m proposing that we use a dict of named content objects, and use the HTTP content-type abstraction to make the content objects introspectable and reliably handleable without tying the unittest object protocol to any given wire format – loose coupling is good!
  5. The way we signal outcomes between TestCase and TestResult – the addFailure etc methods is concerning: there are many grades of outcome that users of the framework may usefully wish to represent; in fact there are more than we probably want to put in the core. Finding a way to decouple the intent of a particular outcome from how its signalled would allow users more control while still being able to use the core framework. One particular issue in this area is that its possible with the current API to have a single test object succeed multiple times. Or fail (addFailure) then succeed (addSuccess). This causes no end of confusion, as test counts can mismatch failure counts, and so on.

I’ve got some ideas about these bugs, but I’m approaching a kiloword already, and I hope this post has enough to provoke some serious thought about how we can fix these 5 bugs, compatibly, and end up with a significantly better unittest module. We’ll have been sucessful if projects like Trial, nose and the zope testrunner are able to remove all their code that duplicates standard library functionality or otherwise worksaround these bugs, and can instead focus on adding the specific test support needed by their environments (in the Trial and zope cases), or on UI and plug-n-play (for nose).