Testrepository roadmap 2015/16

Testrepository has been moderately successful – its very good at some of the things it aspired to (e.g. debugging sporadic test failures in parallel environments), but other angles have not really been explored.

I’ve set some time aside to correct this, in large part to facilitate some important features for tempest (which has its concurrency currently built on the meta-runner included in testrepository – and I’d like to enable the tempest authors to avoid having to write gnarly concurrency code :))

So my plan is to tackle a few things in the lead up to, and perhaps just after the Tokyo OpenStack summit. I wanted to socialise the proposed changes though, and thus this blog post.


Firstly, a long standing issue is that when one tests several different configurations, testrepository is poor at reporting failures that are configuration specific. For instance, imagine that your test suite is run with both Python 2.7 and 3.4, and both results are loaded into your repository. If a given test ‘X’ fails in the first run, and not the second… after the second run is loaded, it will be reported as ‘passing’.

My proposed fix for this is to call the name of each such run a ‘profile’ and use tags to differentiate between the two runs. So you’d tag the 2.7 run perhaps ‘py27’ and the second ‘py34’, and then tell testrepository that the ‘py27’ and ‘py34’ tags are being used to identify profiles. After that testrepository will only consider two test to apply to the same test if the tags match. Tags that are not specified as being for profiles (e.g. the worker-N tags that the testrepository runner adds to track backends that tests run in) won’t be considered in that comparison. This well then allow testrepository to track that each run was separate and the results are not meant to replace each other. The use of tags allows for test matrices too, in principle– consider python version as one dimension, operating system version as another, and database engine as a third — it would be up to the user. I don’t plan to directly implement a matrix system in the first iteration. A different, more dynamic model is in principle possible: don’t tag things, just log events that will give clues and correlate later – thats not precluded by this tag based approach, and we can always add such a thing later.

The output for the queries of the datastore need to be updated though – we don’t currently report tags in e.g. ‘testr failing –list’. This is a little tricky: the listing format is intended to be a mix of nice-for-humans, and machine consumption. Another approach we considered was to namespace the tests with the profile. This has a couple of disadvantages: it may break an unknown number of deployments if the chosen separator is already in use by people, and secondly, it mixes structured and free-form data in a lossy way. One example of that would be that we’d start interpreting all test ids to see if they are – or are not – namespaced with a profile : thats likely to be fragile, at best. On the other hand it would very easily fit into the list format – which is why it was appealing. On balance though, the fragility and conflation would just add technical debt. Instead, we’ll do the following:

  1. Anything that needs to output a flat list of tests will output that for just one profile. An option will be added to allow querying the profiles for which results might be given. The default will start erroring with a list of available profiles if more than one profile has been specified.
  2. We’ll define a minimal JSON schema for reporting multiple profiles in such places. The excellent jq tool can be used to manipulate that in shell command lines. A command line option will opt into receiving this.

Testrepository has two very related programs inside itself. There is the data store and the various queries it can do – e.g. ‘testr load’ and ‘testr failing’. Then there is the meta-runner, which knows how to run some test processes to execute tests. While strictly speaking this is optional, its been very convenient for working with Python tests to have the meta-runner connected to testr and able to do in-process querying.

The meta-runner will benefit from being updated as well. My intent is to make it capable of running all the tests from all the profiles the user specifies, storing that as one single run in the datastore. Two commands in particular need to change here – `testr list-tests` needs to change in line with the test listing above, and `testr run –load-list` needs to be taught how to deal with multiple profiles. I plan to add a command line option to tell it that JSON is being used, and to select tests across all profiles when a simple list or a test regex is given. Finally the command line can benefit from a command line option to select one or more profiles.


The meta-runner has a crude scheduler – it balances based on historic performance prior to running any backend. An online scheduler will give much greater performance in both unseeded, and skewed data cases- e.g.if many long tests fail due to a bug the run after that will often have some workers finishing well before others – leading to slow test times.

The plan here is to finish the implementation of bidirectional channels to test backends, and then dispatch work to them incrementally

Concurrency plans

Tempest wants to be able to run some tests completely independently, and then others can run together arbitrarily. To facilitate this, the online scheduler will be extended to permit describing an overall plan to run through – e.g. a list of segments, where each segment describes one or more tests that can be run together. The UI to supply that to the scheduler will probably start out as a JSON file listing exact test ids and we can iterate from there based on their experience.

The merits of (careful) impatience

The Python packaging ecosystem has long desired a overhaul and implementation of designed features, but it often stalls on adoption.

I think its time to propose a guiding principle for incremental change.

be carefully impatient

The cautious approach for delivering a new feature in the infrastructure looks like this:

  1. Design the change.
  2. Implement the change(s) needed, in a new major version (e.g Metadata-2.0).
  3. Wait for the new version to be the default everywhere.
  4. Tell users they can use it.

This is frankly terrible. Firstly, we cannot really identify ‘default everywhere’. We can identify ‘default in known distributions’, but behind the firewall setups may lag arbitrarily far behind. Secondly, it makes the cycle time for getting user feedback extraordinarily long: decade plus time windows. Thirdly, as a consequence, we run a large risk of running ahead of our users and delivering less good fixes and improvements than we might do if they were using our latest things and giving us feedback.

So here is how I think we should deliver things instead:

  1. Design the change with specific care that it fails closed and is opt-in.
  2. Implement the change(s) needed, in a new minor version of the tools.
  3. Tell users they can use it.

So, why do I think we can skip waiting for it to be a default?

pip, wheel and setuptools are just as able to be updated as any other Python component. If someone is installing (say) numpy via pip (or easy-install), then by definition they are willing to use things from PyPI, and pip and setuptools are in that category.

And if they are not installing via pip, then the Python packaging ecosystem does not affect them.

If we have opt-in as a design principle, then the adoption process will be bottom up: projects that are willing to say to their users ‘you need new versions of pip and setuptools’ can do so, and use the feature immediately. Projects that want to support users installing their packages with pip but aren’t willing to ask that they also upgrade their pip can hold off.

If we have fails-closed as a design principle, then when a project has opted in, and the user installing the package hasn’t upgraded their pip, things will at least fail rather than silently doing the wrong thing.

I had experience of this in Mock recently: the 1.1.0 and up releases depended on setuptools 17.1. The minimum setuptools we could have depended on (while publishing wheels) was still newer than that in Ubuntu Precise (not to mention RHEL!), so we were forcing an upgrade regardless.

This worked ok but we had two significant issues. Firstly, folk with incorrect Python paths can end up shadowing system installed packages, and for some reason ‘six’ triggered this for multiple users. Secondly, we had a number of different attempts to clearly signal the dependency, as the new features we were using did not fail closed: they were silently ignored by sufficiently old setuptools.

We ended up with a setup_requires="setuptools>17.1" clause in setup.py, which we’re hopeful will fail, or Just Work, consistently.

Revisiting the Fixture API – handling leaky resources

Fixtures are one of the innovations I’m most happy with.

A Fixture is an enhanced context manager. The enhancements are:

  • There’s an API for gathering debugging information from the fixture (rather than depending on side effects such as the logging module or stdout). This makes it easy to attach log files from servers (for instance rabbitfixture does this).
  • There is glue to support composing other fixtures while still exposing errors from any fixture in the composed set.

OpenStack’s Neutron has been using fixtures in its test suite for some time, but is finding that writing correct fixtures is hard. In particular, they were leaking processes when a fixture would fail during setUp / __enter__ – and then not be cleaned up by the testtools / fixtures useFixture function.

There are several things we can do to improve the situation.

  • We could make the convenience APIs like useFixture add a try:/finally: and call cleanUp() when setUp fails. This involves making cleanUp() be callable in more situations than it is today.
  • We could make setUp itself do that, advising users to override a different function; this would hide the failure interactions internally, but wouldn’t benefit existing fixtures until they are rewritten to not override setUp.
  • We could provide a decorator that folk with fragile setUp’s (e.g. those that involve IO) could use to robustify their fixtures.

The highest leverage change is the first, but is it safe and suitable? Lets look at PEP-343.

In PEP-343 we see the following translation of with expressions:

with EXPR as VAR:
mgr = (EXPR)
exit = type(mgr).__exit__
value = type(mgr).__enter__(mgr)
exc = True
        VAR = value
        exc = False
        if not exit(mgr, *sys.exc_info()):
    if exc:
        exit(mgr, None, None, None)

This means that using a Fixture which may leak external resources when setUp fails is unsafe via with. Therefore we can’t use the first solution.

Decorators are nice, but somewhat noisy and opt-in. Both decorators and a different setUp in the base class will require extending the protocol to specify when cleanUp can be called more precisely.

If we make the documentation advise users to override a specific method, and setUp does this in the event of failure, I think we’ll have somewhat more uptake. So – thats the route I’m going to head down.

There’s one more thing to consider, which is access to debugging information of failures in setUp. Since the object will have been cleaned up, accessing logs etc will be hard. I think if we raise an additional exception into the MultiException with the details objects, it will be possible for fixtures to provide those details, though they will need buffering in memory (or some sophisticated lazy-delete logic such as holding a reference to an unlinked fd).

Improving dependency handling upstream (for openstack)

This is, in part, a follow up to my post a few weeks ago.

I want to touch on the things we need to improve to have robust plumbing supporting openstack’s CI and devstack needs.


We want to be able to use ‘extras’ to declare the dependencies needed for different backends. This is a setuptools requirement syntax where a project can advertise additional dependencies for different use cases, which users (or other depending projects) can then trigger using '[]'. E.g. 'pip install requests[security]' says ‘install requests and the additional ‘security’ extras. We don’t know yet whether we will use 'nova[mysql]' or 'nova oslo.db[mysql]', but something like that. To use this we need to:

  1. teach pbr about reflecting requirements into the 'extras_require' keyword to setup (because while setuptools supports it in setup.py, we want a constant value setup.py with everything about individual projects declarative).  James Polley has a patch for pbr.
  2. Fix pip to handle 'pip install ./nova[mysql]'. This is issue 1236 – which has an open PR that may fix it. We should help review and test it.

Testing different setups may well need a similar facility, but its not clear yet how to best express that. We may need to standardise on using an extra called 'test' and just ensure our tox.ini knows to install that. That would be nice anyway, to get away from having to know about 'test_requirements.txt'.

pip dependency resolution

Currently pip has a very straight forward resolution algorithm: Only user supplied requirements can conflict at all, and the first mention of any distribution causes a distribution to be selected that matches that mention – all other mentions are simply ignored. This is issue 988, and its one of a cluster that affect OpenStack. The impact on OpenStack is that we have things install ok with pip, and then break in CI, because an incompatible version is installed. I have a patch up for this. Early adopters solicited!

incremental installations need dependency resolution

Say you’ve installed Neutron, which depends on oslo.db >=1.10. And you then install an older Nova which depends on oslo.db <1.10. What should happen? Ideally an error in this case, because the requirements are disjoint. And if they do overlap, the installed version should be adjusted to be compatible. Right now, no error occurs and oslo.db will be downgraded breaking Neutron. This is pip issue 2687. Currently no-one is working on this, and since it requires dependency resolution, fixing 988 first makes a lot of sense. It should be possible to at least make things error with a much more shallow patch though, if someone wished to work on it right now – or you could build on top of my resolver branch. This has also been a cause of numerous CI failures when we do releases, typically right around the time the servers branch. One thing that might be nice for us, since we know a full set of working packages, is to be able to say upfront to pip what versions are compatible, and then let only the needed things be brought in. pip issue 2731

PEP-426 environment markers need polish

PEP-426 introduced a micro-language for describing the situations when a particular dependency applies. For instance, to use argparse on Python < 2.7, you can say "python_version<'2.7'" as a marker for the argparse entry in your requirements. But there are some rough edges.

  • Some comparison operators are missing.
  • The documentation and user guidance needs improvement.
  • Environment markers can’t be used inside individual requirements, only as a filter on extra_requires. To express the argparse example above today (using a working operator), you need to pass the following to setup().
    extra_requires={':python_version=="2.6"': ['argparse']}

    It would be more straightforward to permit the syntax pip supports, where each requirement can be annotated with a marker.


    This might be setuptools issue 353.

  • pbr doesn’t reflect environment markers from its input files (requirements.txt etc) into setup keyword argument. James Polley has a patch for this (the same one enabling extras support in setup.cfg).

pip handling setup_requires

We run into setup_requires in two places in OpenStack; firstly we use that ourselves for pbr, but to avoid triggering easy_install we manually install pbr everywhere ourselves. Secondly, projects that are in the transitive dependencies of OpenStack use setup_requires, and we end up triggering easy_install for them. easy_install is a concern for us because of the decreased reliability and issues with corporate egress firewalls, and its security is not as robust as pips – and there’s no reason it should be, with pip being such a good tool.

However pip can’t handle setup_requires today. Doing so requires changes to setuptools and to pip.

  • setuptools needs some way to report to pip what the setup_requires are without triggering easy_install. Ronny Pfannschmidt has mentioned he may be working on this, but I’m not sure if there is a patch ready or not. A possible further enhancement would be to put the setup_requires in setup.cfg in a totally declarative fashion, but this may require environment marker support first, since the current procedural approach is very flexible and can take Python version and platform into account.
  • pip needs to be able to temporarily put things that it won’t be installing into the PYTHONPATH for packages it is building. The current internals are not suited for this (the target and source and needs of requirements being downloaded are all confounded). However once my resolver patch lands, there will be a nice cache layer that can deliver a ready-to-install directory for any requirement, which should make a simple recursive implementation quite reasonable. The resolver work will probably need further refactoring to make the resolver be decoupled from the user supplied requirements, but compared to the ground already covered, that should be straight forward. One thing folk tackling this should be aware of is an open question around location requirements. Say someone is installing foo from a git repository. And foo is also a setup requirement of some other package bar being installed at the same time. Should that foo from git be used for the setup of bar? I’m not sure of the answer (what if the version of foo is incompatible with version bar needs?) – but one is needed :).

So thats about it – if you’re interested in helping the plumbing that supports OpenStacks CI and devstack systems, please pick one of these issues and help out. Test patches, review code, write a patch, or just tell me why we don’t need to do something :)

Dealing with deps in OpenStack

We’ve got a problem in OpenStack.. dependency management.

In this post I explore it as input to the design summit session on this in Vancouver.


We have some goals that are broadly agreed:

  1. Guarantee co-installability of a single release of OpenStack
  2. Be able to deliver known-good installs of OpenStack at any point in time – e.g. ‘this is known to work’
  3. Deliver good, clear dependency metadata to redistributors
  4. Support CD deployments of OpenStack from git. Both production and devstack for developers to hack on/with
  5. Avoid firedrills in CI – both internal situations where we run incompatible things we produced, and external situations where some dependency releases a broken version, like the pycparsing one last week
  6. Deployments using the Python dependencies should be up to date and secure
  7. Support doing upgrades in the same Python environment


And we have some baseline assumptions:

  1. We cooperate with the Python ecosystem – publishing our libraries to PyPI for instance
  2. Every commit of server projects is a ‘release’ from the perspective of e.g. schema management
  3. Other things release when they release, not per-commit

The current approach uses a single global list of acceptable install-requires for all our projects, and then merges that into the git trees being tested during the test. Note in particular that this doesn’t take place for things not being tested, which we install from PyPI. We create a branch of that global list for each stable release, and we also create branches of nearly everything when we do the stable release, a system that has evolved in part due to the issues in CI when new releases would break stable releases. These new branches have tightly defined constraints – e.g. “DEP >= version-at-this-release < next-point-release”‘. The idea behind this is that if the transitive closure of deps is constrained, we can install from PyPI such a version, and it won’t bring in a different version. One of the reasons we needed that was PIP bug 988, where pip takes the first occurrence of a dependency, and so servers would depend on oslo.utils which would depend on an unversioned cliff or some such, and if cliff wasn’t already installed we’d get the next releases cliff. Now – semver says we’re keeping those things compatible, but mistakes happen, and for stable branches there’s really little reason to upgrade.


We have some practical issues with the current system:

  1. Just one dependency uncapped anywhere in the wider ecosystem (including packages outside of OpenStack) that depends on a dependency that we wanted to stay unchanged, and if that dep is encountered first by the pip scanner… game over. Worse, there are components out there that introspect the installed dependencies and fail hard if one is not listed as compatible, which takes a ‘testing with unexpected version’ situation and makes it a hard error
  2. We have to run stable branches for everything, even things like OpenStackClient which are intended for end users, and are aimed at a semver rather than branched release model
  3. Due to PIP bug 2687 each time we call pip may introduce the skew that breaks the gate
  4. We don’t deliver goal 1:- because we override the requirements at test time, the actual co-installability may be different, and we don’t know
  5. We deliver goal 2 but its hard to use:- you have to dig through a specific CI log, and if the CI system has pruned it, you’re toast
  6. We don’t avoid external firedrills:- because most of our external dependencies are broad, external releases break us trivially and frequently
  7. Lastly, our requirements are too tight to support upgrades: if bug 2687 was fixed, installing the first upgraded server component would error because its requirements are declared as being incompatible with the last release.

We do deliver goals 3,4 and 6 though, which is good.

So what can we do differently? In an ideal world, can we get all 6 goals?


I think we can. Here’s one way it could work:

  1. We fix the two pip bugs above (I’m working on that now)
  2. We teach pip about constraints *if* something is requested without actually requesting it
  3. We change our project overrides in CI to use a single constraints file rather than merging into each projects requirements
  4. The single constraints file would be exactly specified: “DEP == VERSION”, not semver or compatible matched.
  5. We make changes to the single constraints file by running a proposed set of constraints
  6. We find out that we should change the constraints file by having a periodic task which compares the constraints file to the published versions on  PyPI and proposes changes to the constraints repository automatically
  7. We loosen up the constraints in all our release branches to permit upgrade co-installability

And some optional bits…

  1. We could start testing new-library old-servers again
  2. We could potentially change our branching strategy for non-server components, but I don’t think it harms things – it may just be unnecessary
  3. We could add periodic jobs for testing with unreleased versions of dependencies

Working through each point. Bug 988 causes compatible requirements to be ignored – if we have one constraint of “X > 1.4” and another of “X > 1.3, !=1.5.1” but the “> 1.4” constraint is encountered first, we can end up with 1.5.1 installed, violating a known-bad constraint. Fixing this means that rather than having to have global knowledge of deps at the point where pip is being entered, we can have local knowledge about compatible versions in each package, and as long as the union of requirements is satisfiable, we’ll be ok. Bug 2687 causes the constraints that thing A had when it was installed by pip be ignored by the requirements checking for thing B. For instance, pip install python-openstackclient after pip install nova, will meet python-openstackclient’s requirements even if that means breaking nova’s requirements.

The reason we can’t just use a requirements file today, is that a requirements file specifies what needs to be installed as well as what versions are acceptable. We don’t want devstack, when configured for nova-network, to install neutron dependencies. But it would today unless we put in place a bunch of complex processing logic. Whereas pip could do this very easily internally.

Merging each requirement into things we’re installing from git fails when we install releases – e.g. of client libraries, in particular because of the interactions with bug 988 above. A single constraints file could include all known good versions of everything we might use, and would apply globally in concert with local project requirements. Best of both worlds, in theory :)

The use of inexact versions is a hard limitation today – we can’t upgrade multiple project trees local version needs atomically, and because we’re supplying all the version constraints in one place – the project’s merged install_requirements – they have to be broad enough to co-exist during changes to the requirements, and to remain co-installed during upgrades from release to release of OpenStack. But inexact versions leads to variation in CI – every single run becomes a gamble. The primary goal of CI is to tell  us whether a new commit X meets all of our quality criteria – change one thing at a time. Running with every new version of every dependency doesn’t tell us more about X, it tells us about ecosystem things. Using exact constraints will solve this: we’ll decouple ‘update dependencies’ or ‘pycparsing Y is broken’ from testing X – e.g. ‘improve nova cells’.

We need to be able to update those dependencies though, and the existing global requirements mechanisms are pretty much right, they just need to work with a constraints file instead of patching each repo at test time. We will still want to check that the local requirements are compatible with the global constraints file.

One of the big holes such approaches have is that we may miss out on important improvements – security, performance or just plain old features – if we don’t update our constraints. So we need to be on top of that. A small amount of automation can give us a lot of assistance on that. Just try the new versions and if they work – great. If they don’t, show a failing proposal where we can assess what to do.

As I mentioned earlier today we can’t actually upgrade: kilo’s version locks exclude liberty versions of our libraries, meaning that trying to upgrade nova/kilo to nova/liberty will bring in library versions that conflict with the version deps neutron expresses. We need to open up the project local requirements to avoid this – and we also need to make some guarantees about compatibility with our prior release in our library development (otherwise rebooting a server with only one component upgraded will be a gamble).

Making those guarantees will either require testing every commit against the prior server, or if we can find some way of doing it, testing proposed releases against the prior servers – which would allow more latitude during development of our libraries. The use of constraints files will give us hermetic insulation against bad releases though – we’ll be able to stay productive while we fix the issue and issue a new better release. The crucial thing is to have a tight feedback loop though – so I’m in favour of us either testing each commit against last-stable, or figuring out the ‘tests before releases’ logic (perhaps by removing direct tag access and instead having a thing we propose the intent to as a review).

All this might be enough that we choose to make less stable branches of libraries and go back to plain semver – but its not a requirement: thats something we can discuss in detail if people care, or just wait and see what the overheads and benefits of keeping those branches are.

Lastly, this new structure will make it possible, if we want to, to test that unreleased versions of external dependencies work with a given component, by using a periodic job. Why periodic? There are two sides to each dependencies, and neither side would want their gate to wedge if an accident breaks the other side. E.g. using two of our own components – oslo.messaging and nova. oslo.messaging releases must not break nova, but an individual oslo.messaging commit isn’t necessarily constrained (if we have the before-release testing described above). External dependencies are exactly the same, except even less closely aligned than intra-OpenStack components. So running tests with a git version of e.g. libvirt in a periodic job might give us (and libvirt) valuable prior warning about issues.

Subunit and subtests

Python 3 recently introduced a nice feature – subtests. When I was putting subunit version 2 together I tried to cater for this via a heuristic approach – permitting the already known requirement that some tests which are reported are not runnable be combined with substring matching to identify subtests.

However that has panned out poorly, when I went to integrate this with testr the code started to get fugly.

So, I’m going to extend the StreamResult API to know about subtests, and issue a subunit protocol bump – to 2.1 – to add a new field for labelling subtest events. My plan is to make this build a recursive tree structure – that is given test “test_foo” with subtest “i=3” which the Python subtest code would identify as “test_foo (i=3)”, they should be identified in StreamResult as test_id “test_foo (i=3)” and parent_test_id “test_foo”. This can then nest arbitrarily deep if test runners decide to do that, and the individual runnability becomes up to the test runner, not testrepository / subunit / StreamResult.

subunit version 2 progress

Subunit V2 is coming along very well.

Current status:

  • I have a complete implementation of the StreamResult API up as a patch for testtools. Thats 2K LOC including comeprehensive tests.
  • Similarly, I have an implementation of a StreamResult parser and emitter for subunit. Thats 1K new LOC including comprehensive tests, and another 500 lines of churn where I migrate all the subunit filters to v2.
  • pdb debugging works through subunit v2, permitting dropping into a debugger to work. Yay.

Remaining things to do:

  • Update the other language bindings – the C library in particular.
  • Teach testrepository to expect v2 input (and probably still store v1 for a while)
  • Teach testrepository to use pipes for the stdin of test runner backends, and some control mechanism to switch input between different backends.
  • Discuss the in-Python API with more folk.
  • Get code merged :)