Testrepository roadmap 2015/16

Testrepository has been moderately successful – its very good at some of the things it aspired to (e.g. debugging sporadic test failures in parallel environments), but other angles have not really been explored.

I’ve set some time aside to correct this, in large part to facilitate some important features for tempest (which has its concurrency currently built on the meta-runner included in testrepository – and I’d like to enable the tempest authors to avoid having to write gnarly concurrency code :))

So my plan is to tackle a few things in the lead up to, and perhaps just after the Tokyo OpenStack summit. I wanted to socialise the proposed changes though, and thus this blog post.

Profiles

Firstly, a long standing issue is that when one tests several different configurations, testrepository is poor at reporting failures that are configuration specific. For instance, imagine that your test suite is run with both Python 2.7 and 3.4, and both results are loaded into your repository. If a given test ‘X’ fails in the first run, and not the second… after the second run is loaded, it will be reported as ‘passing’.

My proposed fix for this is to call the name of each such run a ‘profile’ and use tags to differentiate between the two runs. So you’d tag the 2.7 run perhaps ‘py27’ and the second ‘py34’, and then tell testrepository that the ‘py27’ and ‘py34’ tags are being used to identify profiles. After that testrepository will only consider two test to apply to the same test if the tags match. Tags that are not specified as being for profiles (e.g. the worker-N tags that the testrepository runner adds to track backends that tests run in) won’t be considered in that comparison. This well then allow testrepository to track that each run was separate and the results are not meant to replace each other. The use of tags allows for test matrices too, in principle– consider python version as one dimension, operating system version as another, and database engine as a third — it would be up to the user. I don’t plan to directly implement a matrix system in the first iteration. A different, more dynamic model is in principle possible: don’t tag things, just log events that will give clues and correlate later – thats not precluded by this tag based approach, and we can always add such a thing later.

The output for the queries of the datastore need to be updated though – we don’t currently report tags in e.g. ‘testr failing –list’. This is a little tricky: the listing format is intended to be a mix of nice-for-humans, and machine consumption. Another approach we considered was to namespace the tests with the profile. This has a couple of disadvantages: it may break an unknown number of deployments if the chosen separator is already in use by people, and secondly, it mixes structured and free-form data in a lossy way. One example of that would be that we’d start interpreting all test ids to see if they are – or are not – namespaced with a profile : thats likely to be fragile, at best. On the other hand it would very easily fit into the list format – which is why it was appealing. On balance though, the fragility and conflation would just add technical debt. Instead, we’ll do the following:

  1. Anything that needs to output a flat list of tests will output that for just one profile. An option will be added to allow querying the profiles for which results might be given. The default will start erroring with a list of available profiles if more than one profile has been specified.
  2. We’ll define a minimal JSON schema for reporting multiple profiles in such places. The excellent jq tool can be used to manipulate that in shell command lines. A command line option will opt into receiving this.

Testrepository has two very related programs inside itself. There is the data store and the various queries it can do – e.g. ‘testr load’ and ‘testr failing’. Then there is the meta-runner, which knows how to run some test processes to execute tests. While strictly speaking this is optional, its been very convenient for working with Python tests to have the meta-runner connected to testr and able to do in-process querying.

The meta-runner will benefit from being updated as well. My intent is to make it capable of running all the tests from all the profiles the user specifies, storing that as one single run in the datastore. Two commands in particular need to change here – `testr list-tests` needs to change in line with the test listing above, and `testr run –load-list` needs to be taught how to deal with multiple profiles. I plan to add a command line option to tell it that JSON is being used, and to select tests across all profiles when a simple list or a test regex is given. Finally the command line can benefit from a command line option to select one or more profiles.

Scheduling

The meta-runner has a crude scheduler – it balances based on historic performance prior to running any backend. An online scheduler will give much greater performance in both unseeded, and skewed data cases- e.g.if many long tests fail due to a bug the run after that will often have some workers finishing well before others – leading to slow test times.

The plan here is to finish the implementation of bidirectional channels to test backends, and then dispatch work to them incrementally

Concurrency plans

Tempest wants to be able to run some tests completely independently, and then others can run together arbitrarily. To facilitate this, the online scheduler will be extended to permit describing an overall plan to run through – e.g. a list of segments, where each segment describes one or more tests that can be run together. The UI to supply that to the scheduler will probably start out as a JSON file listing exact test ids and we can iterate from there based on their experience.

Revisiting the Fixture API – handling leaky resources

Fixtures are one of the innovations I’m most happy with.

A Fixture is an enhanced context manager. The enhancements are:

  • There’s an API for gathering debugging information from the fixture (rather than depending on side effects such as the logging module or stdout). This makes it easy to attach log files from servers (for instance rabbitfixture does this).
  • There is glue to support composing other fixtures while still exposing errors from any fixture in the composed set.

OpenStack’s Neutron has been using fixtures in its test suite for some time, but is finding that writing correct fixtures is hard. In particular, they were leaking processes when a fixture would fail during setUp / __enter__ – and then not be cleaned up by the testtools / fixtures useFixture function.

There are several things we can do to improve the situation.

  • We could make the convenience APIs like useFixture add a try:/finally: and call cleanUp() when setUp fails. This involves making cleanUp() be callable in more situations than it is today.
  • We could make setUp itself do that, advising users to override a different function; this would hide the failure interactions internally, but wouldn’t benefit existing fixtures until they are rewritten to not override setUp.
  • We could provide a decorator that folk with fragile setUp’s (e.g. those that involve IO) could use to robustify their fixtures.

The highest leverage change is the first, but is it safe and suitable? Lets look at PEP-343.

In PEP-343 we see the following translation of with expressions:

with EXPR as VAR:
    BLOCK
....
mgr = (EXPR)
exit = type(mgr).__exit__
value = type(mgr).__enter__(mgr)
exc = True
try:
    try:
        VAR = value
        BLOCK
    except:
        exc = False
        if not exit(mgr, *sys.exc_info()):
            raise
finally:
    if exc:
        exit(mgr, None, None, None)

This means that using a Fixture which may leak external resources when setUp fails is unsafe via with. Therefore we can’t use the first solution.

Decorators are nice, but somewhat noisy and opt-in. Both decorators and a different setUp in the base class will require extending the protocol to specify when cleanUp can be called more precisely.

If we make the documentation advise users to override a specific method, and setUp does this in the event of failure, I think we’ll have somewhat more uptake. So – thats the route I’m going to head down.

There’s one more thing to consider, which is access to debugging information of failures in setUp. Since the object will have been cleaned up, accessing logs etc will be hard. I think if we raise an additional exception into the MultiException with the details objects, it will be possible for fixtures to provide those details, though they will need buffering in memory (or some sophisticated lazy-delete logic such as holding a reference to an unlinked fd).

Subunit and subtests

Python 3 recently introduced a nice feature – subtests. When I was putting subunit version 2 together I tried to cater for this via a heuristic approach – permitting the already known requirement that some tests which are reported are not runnable be combined with substring matching to identify subtests.

However that has panned out poorly, when I went to integrate this with testr the code started to get fugly.

So, I’m going to extend the StreamResult API to know about subtests, and issue a subunit protocol bump – to 2.1 – to add a new field for labelling subtest events. My plan is to make this build a recursive tree structure – that is given test “test_foo” with subtest “i=3” which the Python subtest code would identify as “test_foo (i=3)”, they should be identified in StreamResult as test_id “test_foo (i=3)” and parent_test_id “test_foo”. This can then nest arbitrarily deep if test runners decide to do that, and the individual runnability becomes up to the test runner, not testrepository / subunit / StreamResult.

subunit version 2 progress

Subunit V2 is coming along very well.

Current status:

  • I have a complete implementation of the StreamResult API up as a patch for testtools. Thats 2K LOC including comeprehensive tests.
  • Similarly, I have an implementation of a StreamResult parser and emitter for subunit. Thats 1K new LOC including comprehensive tests, and another 500 lines of churn where I migrate all the subunit filters to v2.
  • pdb debugging works through subunit v2, permitting dropping into a debugger to work. Yay.

Remaining things to do:

  • Update the other language bindings – the C library in particular.
  • Teach testrepository to expect v2 input (and probably still store v1 for a while)
  • Teach testrepository to use pipes for the stdin of test runner backends, and some control mechanism to switch input between different backends.
  • Discuss the in-Python API with more folk.
  • Get code merged 🙂

Simpler is better – a single event type for StreamResult

StreamResult, covered in my last few blog posts, has panned out pretty well.

Until that is, that I sat down to do a serialised version of it. It became fairly clear that the wire protocol can be very simple – just one event type that has a bunch of optional fields – test ids, routing code, file data, mime-type etc. It is up to the recipient at the far end of a stream to derive semantic meaning, which means that encoding a lot of rules (such as a data packet can have either a test status or file data) into the wire protocol isn’t called for.

If the wire protocol doesn’t have those rules, Python parsers that convert a bytestream into StreamResult API calls will have to manually split packets that have both status() and file() data in them… this means it would be impossible to create many legitimate bytestreams via the normal StreamResult API.

That seems to be an unnecessary restriction, and thinking about it, having a very simple ‘here is an event about a test run’ API that carries any information we have and maps down a very simple wire protocol should be about as easy to work with as the current file or status API.

Most combinations of file+status parameters is trivially interpretable, but there is one that had no prior definition – a test_status with no test id specified. Files with no testid are easily considered as ‘global scope’ for their source, so perhaps test_status should be treated the same way? [Feedback in comments or email please]. For now I’m going to leave the meaning undefined and unconstrained.

So I’m preparing a change to my patchset for StreamResult to:

  • Drop the file() method altogether.
  • Add file_bytes, mime_type and eof parameters to status().
  • Make the test_id and test_status parameters to status() optional.

This will make the API trivially serialisable (both to JSON or protobufs or whatever, or to the custom binary format I’m considering for subunit), and equally trivially parsable, which I think is a good thing.

First experience implementing StreamResult

My last two blog posts were largely about the needs of subunit, but a key result of any protocol is how easy working with it in a high level language is.

In the weekend and evenings I’ve done an implementation of a new set of classes – StreamResult and friends – that provides:

  • Adaption to and from the existing TestResult APIs (the 2.6 and below API, 2.7 API, and the testtools extended API).
  • Multiplexing multiple streams together.
  • Adding timing data to a stream if it is absent.
  • Summarising a stream.
  • Copying a stream to multiple outputs
  • A split out API for instructing a test run to stop.
  • A simple test-at-a-time stream processor that makes it easy to just deal with tests rather than the innate complexities of an event based interface.

So far the code has been uniformly simple to write. I started with an API that included an ‘estimate’ function, which I’ve since removed – I don’t believe the complexity is justified; enumeration is not significantly more expensive than counting, and runners that want to be efficient can either not enumerate or remember the enumeration from prior runs.

The documentation in the linked pull request is a good place to start to get a handle on the API; I’d love feedback.

Next steps for me are to do a subunit protocol revision that maps to the Python API, both parser and generator and see how it feels. One wrinkle there is that the reason for doing this is to fix intrinsic limits in the existing protocol – so doing forward and backward wire protocol compatibility would defeat the point. However… we can make the output side explicitly choose a protocol version, and if we can autodetect the protocol version in the parser, even if we cannot handle mixed streams we can get the benefits of the new protocol once data has been detected. That said, I think we can start without autodetection during prototyping, and add it later. Without autodetection, programs like TestRepository will need configuration options to control what protocol variant to expect. This could be done by requiring this new protocol and providing a stream filter that can be deployed when needed.

Less SPOFs: pyjunitxml, testscenarios

I’ve made the Testtools committers team own both the project and the trunk branch for both pyjunitxml and testscenarios. This removes me as a SPOF if anything needs doing in those projects – any Testtools committer can now do it. (Including code review and landing). If you are a testtools committer and need PyPI release rights, ping me and I’ll add you. (I wish PyPI had group management).

testrepository iteration for python projects

Tesetrepository has a really nice workflow for fixing a set of failing tests:

  1. Tell it about the failing tests (e.g. by doing a full test run, or running a single known failing test)
  2. Run just the known failing tests (testr run –failing)
  3. Make a change
  4. Goto step 2

As you fix up the tests testr will just give your test runner a smaller and smaller list of tests to run.

However I haven’t been able to use that feature when developing (most) Python programs.

Today though, I added the necessary support to testtools, and as a result subunit (which inherits its thin test runner shim from testtools) now supports –load-list. With this a simple .testr.conf can support this lovely workflow. This is the one used in testrepository itself: it runs the testrepository tests, which are regular unittest tests, using subunit.run – this gives it subunit output, and tells testrepository how to run a subset of tests.

[DEFAULT]
test_command=python -m subunit.run $IDOPTION testrepository.tests.test_suite
test_id_option=--load-list $IDFILE

Subunit and nose

Looks like someone has come up with a nose plugin for subunit – excellent! http://www.liucougar.net/blog/projects/nose-subunit

In their post the author notes that subunit is not easy_installable at the moment. It will be shortly. Thanks to Tres Seaver there is a setup.py for the python component of Subunit, and he has offered to maintain that going forward. His patch is in trunk, and the next release will include a pypi upload.

The next subunit release should be pretty soon too – the unicode support in testtools has been overhauled thanks to Martin[gz], and so we’re in much better shape on Python 2.x than we were before. Python3 for testtools is trouble free in this area because confused strings don’t exist there 🙂

Maintainable pyunit test suites

There’s a test code maintenance issue I’ve been grappling with, and watching others grapple with for a while now. I’ve blogged about some infrastructural things related to it before, but now I think its time to talk about the problem itself. The problem shows up as soon as you start writing setUp functions, or custom assertThing functions. And the problem is – where do you put this code?

If you have a single TestCase, its easy. But as soon as you have two test classes it becomes more difficult. If you choose either class, the other class cannot use your setUp or assertion code. If you create a base class for your tests and put the code there you end up with a huge base class, and every test paying the total overhead of your test needs, rather than just the overhead needed to test the particular system you want to test. Or with a large and growing list of assertions most of which are irrelevant for most tests.
The reason the choices have to be made is because test code is just code; and all the normal issues there – separation of concerns, composition often being better than inheritance, do-one-thing-well – all apply to our test code. These issues are exacerbated by pyunit (that is the Python ‘unittest’ module included with the standard library and extended by various projects)
Lets look some (some) of the concerns involved in a test environment: Test execution, fixture management, outcome decision making. I’m using slightly abstract terms here because I don’t want to bind the discussion down to an existing implementation. However the down side is that I need to define these terms a little.
Test execution – by this I mean the basic machinery of running a single test: the test framework calling into user code and receiving back an outcome with details. E.g. in pyunit your test_method() code is called, success is determined by it returning successfully, and other outcomes by raising specific exceptions. Other languages without exceptions might do this returning an outcome object, or passing some object into the user code to be called by the test.
Fixture management – the non trivial code that prepares a situation where you can make assertions. On the small side, creating a few object instances and glueing them together, on the large end, loading data into a database (and creating the database instance at the same time). Isolation issues such as masking out environment variables and creating temp directories are included in this category in my opinion.
Outcome decision making – possibly the most obtuse label I’ve ever given this, I’m referring the process of deciding *what* outcome you wish to have happen. This takes different forms depending on your testing framework. For instance, in Python’s doctest:
>>> x
45
provides a specification – the test framework calls str(x) and then compares that to the string ’45’. In pyunit assertions are typically used:
self.assertEqual(45, x)
Will call 45 == x and if the result is not True, raise an exception indicating a Failure has occured. Unexpected exceptions cause Errors, and in the most recent pyunit, and some extensions, other exceptions can signal that a test should not be run, or should have failed.
So, those are the three concerns that we have when testing; where should each be expressed (in pyunit)? Pragmatically the test execution code is the hardest to separate out: Its partly outside of ‘user control’, in that the contract is with the test framework. So lets start by saying that this core facility, which we should very rarely need to change, should be in TestCase.
That leaves fixture management and outcome decision making. Lets tackle decision making… if you consider the earlier doctest and assertion examples, I think its fairly clear that there are multiple discrete components at play. Two in particular I’d like to highlight are: matching and signalling. In the doctest example the matching is done by string matching – the reference object(s) are stringified and compared to an example the test writer provides. In the pyunit example the matching is done by the __eq__ protocol. The signalling in the doctest example is done inside the test framework (so we don’t see any evidence of it at all). In the pyunit example the signalling is done by the assertion method calling self.fail(), that being the defined contract for causing a failure. Now for a more complex example: testing a float. In doctest:
>>> “%0.3f” % x
0.123
In pyunit:
self.assertAlmostEqual(0.123, x, places=3)
This very simple check – that a floating point number is effectively 0.123 exposes two problems immediately. The first, in doctest, is that literal string comparisons are extremely limited. A regex or other language would be much more powerful (and there are some extensions to doctest; the point remains though – the … operator is not enough). The second problem is in pyunit. It is that the contract of assertEqual and assertAlmostEqual are different: you cannot substitute one in where the other was expected without partial function application – something that while powerful is not the most obvious thing to reach for, or to read in code. The JUnit folk came up with a nice way to address this: they decoupled /matching/ and /deciding/ with a new assertion called ‘assertThat’ and a language for matching – expressed as classes. The initial matcher library, hamcrest, is pretty ugly in Python; I don’t use it because it tries too hard to be ‘english like’ rather than being honest about being code. (Aside, what would ‘is_()’ in a python library mean to you? Unless you’ve read the hamcrest code, or are not a Python programmer, you’ll probably get it wrong. However the concept is totally sound. So, ‘outcome decision making’ should be done by using a matching language totally seperate from testing, and a small bit of glue for your test framework. In ‘testtools’ that glue is ‘assertThat’, and the matching language is a narrow Matcher contract (in testtools.matchers) which I’m going to describe here, in case you cannot or don’t want to use the testtools one.
class Matcher:
    def __str__(self):
        "Describe this matcher."""
    def match(self, something):
        """Determine if something is matched.
        :param something: Something to match.
        :return: None if something matched, or a Mismatch object otherwise.
        """
class Mismatch:
    def describe(self):
        """Describe a mismatch that has occured."""
This permits composition and inheritance within your matching code in a pretty clean way. Using == only permits this if you can simultaneously define an __eq__ for your objects that matches with arbitrarily sensitivity (e.g. you might not want to be examining the process_id value for a process a test ran, but do want to check other fields).
Now for fixture management. This one is pretty simple really: stop using setUp (and other similar on-TestCase methods). If you use them, you will end up with a hierarchy like this:
BaseTestCase1
 +TestCase1
 +TestCase2
 +BaseTestCase2
   +TestCase3
   +TestCase4
   +BaseTestCase3
     +TestCase5
     ...
That is, you’ll have a tree of base classes, and hanging off them actual test cases. Instead, write on your base TestCase a single glue method – e.g.
def useFixture(self, fixture):
      fixture.setUp()
      self.addCleanup(fixture.tearDown)
      return fixture
And then rather than having a setUp function which performs complex operations, define a ‘fixture’ – an object with a setUp and a tearDown method. Use this in tests that need that code::
def test_foo(self):
      server = self.useFixture(NewServerWithUsers())
      self.assertThat(server, HasUser('fred'))
Note that there are some things around that offer this sort of convention already: thats all it is – convention. Pick one, and run with it. But please don’t use setUp; it was a conflated idea in the first place and is a concrete problem. Something like testresources or testscenarios may fit your needs – if it does, great! However they are not the last word – they aren’t convenient enough to replace just calling a simple helper like I’ve presented here.
To conclude, the short story is:
  • use assertThat and have a seperate hierarchy of composable matchers
  • use or create a fixture/resouce framework rather than setUp/tearDown
  • any old TestCase that has the outcomes you want should do at this point (but I love testtools).