diagnosing flaky tests

Victor (here, here) grabbed me on IRC yesterday for some testr help. He had a super frustrating bug in glance: about one in thirty unit test runs, a test would fail. He’d spent hours tracking it down and still couldn’t reliably reproduce it.

Part of that was due to the glance tests taking a few minutes to run, so each iteration was slow, but another part was a lack of familiarity with the test tooling we use in OpenStack which can give rich data to help analyse such things.

I helped him out – and this post is a step by step handbook of what I did so that I can point people at it 🙂


  1. start by duplicating the environment
  2. setup automation so you are only doing the interesting (or at least not time consuming) bits
  3. bisect and bisect and bisect

Firstly, I pulled down exactly the same code he was working on:

cd glance; git review -d 250083

This let me try to reproduce the thing. However, my normal reproduction facility couldn’t be used because the glance testr configuration depended on invoking testr within lockutils-wrapper. I’m still working through the implications, but for the short term I moved that to be testr’s problem.

So now I could make a python 34 venv and run testr directly:

tox -epy34 --notest; . .tox/py34/bin/activate; testr run --parallel

This is pretty important – it lets me get under the setup.py wrapper that projects use and now I have more control over what is happening. Plus I’m not dealing with tox recreating the venv or anything like that.

It turned out that only the unit tests had been ported to Python3, so I needed to filter down to just those tests. And because I didn’t want to sit here watching it, I set testr off to find a reproduction example on its own:

testr run --parallel --until-failure tests.unit

This runs the same set of tests – whatever you’ve specified in the normal way – in parallel, in a loop. It specifically reschedules and starts new backends (processes that are actually executing test code) each time around, so its very close to just scripting it in shell around testr, with only minor differences (such as not re-querying all the tests each time, because testr knows the full set already).

After an hour or so I had toggled back to look at the terminal, and there was a lovely backtrace and information on the failure. It looks something like this:

lockutils-wrapper \
${PYTHON:-python} -m subunit.run discover -t ./ ./glance/tests --load-list /tmp/tmpafpyzyd5
Ran 2 tests in 0.485s (+0.011s)
PASSED (id=1614)
lockutils-wrapper \
${PYTHON:-python} -m subunit.run discover -t ./ ./glance/tests --load-list /tmp/tmpafpyzyd5
Traceback (most recent call last):
glance.common.exception.NotFound: b'Image not found'
FAIL: glance.tests.unit.v1.test_api.TestGlanceAPI.test_upload_image_http_nonexistent_location_url
tags: worker-0
Traceback (most recent call last):
File "/home/robertc/work/openstack/glance/glance/tests/unit/v1/test_api.py", line 1149, in test_upload_image_http_nonexistent_location_url
self.assertEqual(404, res.status_int)
File "/home/robertc/work/openstack/glance/.tox/py34/lib/python3.4/site-packages/testtools/testcase.py", line 350, in assertEqual
self.assertThat(observed, matcher, message)
File "/home/robertc/work/openstack/glance/.tox/py34/lib/python3.4/site-packages/testtools/testcase.py", line 435, in assertThat
raise mismatch_error
testtools.matchers._impl.MismatchError: 404 != 201
Ran 2 tests in 0.419s (-0.061s)
FAILED (id=1615, failures=1 (+1))

Except that the id was much lower, there were multiple concurrent processes being run on each iteration, and the number of tests run much higher – 506 in fact. The important bit to me was the id, because with that we could get programmatic on the problem.

While testr has support for automatically isolating faults, this depends on deterministic behaviour- there isn’t [yet] any support for things that fail 1 time in N when doing bisection, looking for cross-test interactions and so forth. So normally, one would just run:

testr run --analyze-isolation

And testr would churn away and give a useful answer, in this case I needed to do it by hand. I think there is room for getting really smart about dealing with this sort of situation, but the simplest method is to just repeat each test (a run of some X tests looking for a failure) until some confidence level is reached, rather than assuming a pass is actually a pass.

To do that we needed to do two things: we needed a set of tests to run, and a way to reduce the set and repeat. We could use –until-failure as a way to repeat a given test, and stop it after a couple of hours if it hadn’t failed.

Extracting the tests that a given backend ran is straightforward, if not something you’d just luck upon. If the run id you want to investigate is 6, and the backend that you want to report on is worker-0 (see the test tag in the error report above):

cat .testr/6 | subunit-1to2 | subunit-filter -s --xfail --with-tag=worker-0 | subunit-ls > worker-0

This takes the subunit stream from the repository, which is in a legacy format 1, upgrades it to subunit v2, then includes successful tests and expected failures, but only includes tests run on worker-0, pulls out the test ids (thats the subunit-ls bit) and writes it into a file ‘worker-0’.

To run just those tests:

testr run –load-list worker-0

More interestingly though, lets start by not running all the tests that took place after our failure. Inside the file it looks like this:


Note that the test we’re interested in is in the middle there – though the file looks sorted, thats due to the test backend, what we have is the actual order tests executed in (and we don’t need to worry about concurrency because we pulled out just one backend process, and in Python unittest thats single-threaded).

First step then is to delete all the tests after the one we care about:


And then we don’t want to run all the tests: we’re assuming there is an interaction with a single other test leaving a stale process or something, so we want to run the one that failed, and half of the earlier tests; if they end up being reliable, we switch to the half we hadn’t been running, and then repeat the process – take half, run until we’re satisfied its reliable, repeat.

The way I do this by hand is to just edit the text file, making a new copy each step, so that I can backtrack easily. So delete half the preceeding lines, save to new file, then run:

testr run --load-list newfile --until-failure

Walk away, do something else, and then come back in a couple of hours.

This worked: from 506 tests on the worker, I then had about 300 that ran before the failing test after the first trim, so 150 was my first bisection which ran for a couple hours before failing. Then 75, then 35, then 18, then 9, then 5, then 2, then 1, then – 0 – thats right, eventually we found that the failing test could fail on its own!

And from that Victor was able to dig deep into what it was doing with confidence – he found a race condition in the test server setup stuff (I haven’t looked closely – I’m parroting what he said on IRC) – and is confident he’s found the bug. Yay!

I also ran one of the smaller sets overnight using Python 2.7, and that didn’t fail at all, so I suspect the failure is in some area that was masked by Python 2.7/eventlet on 2.7 handling of *something*. We saw a bug in subunit of that nature earlier this year, where a different (but legitimate) behaviour in eventlet on 3.4 led to subunit dropping writes silently. Thats fixed now in both eventlet and subunit 🙂