Maintainable pyunit test suites

There’s a test code maintenance issue I’ve been grappling with, and watching others grapple with for a while now. I’ve blogged about some infrastructural things related to it before, but now I think its time to talk about the problem itself. The problem shows up as soon as you start writing setUp functions, or custom assertThing functions. And the problem is – where do you put this code?

If you have a single TestCase, its easy. But as soon as you have two test classes it becomes more difficult. If you choose either class, the other class cannot use your setUp or assertion code. If you create a base class for your tests and put the code there you end up with a huge base class, and every test paying the total overhead of your test needs, rather than just the overhead needed to test the particular system you want to test. Or with a large and growing list of assertions most of which are irrelevant for most tests.
The reason the choices have to be made is because test code is just code; and all the normal issues there – separation of concerns, composition often being better than inheritance, do-one-thing-well – all apply to our test code. These issues are exacerbated by pyunit (that is the Python ‘unittest’ module included with the standard library and extended by various projects)
Lets look some (some) of the concerns involved in a test environment: Test execution, fixture management, outcome decision making. I’m using slightly abstract terms here because I don’t want to bind the discussion down to an existing implementation. However the down side is that I need to define these terms a little.
Test execution – by this I mean the basic machinery of running a single test: the test framework calling into user code and receiving back an outcome with details. E.g. in pyunit your test_method() code is called, success is determined by it returning successfully, and other outcomes by raising specific exceptions. Other languages without exceptions might do this returning an outcome object, or passing some object into the user code to be called by the test.
Fixture management – the non trivial code that prepares a situation where you can make assertions. On the small side, creating a few object instances and glueing them together, on the large end, loading data into a database (and creating the database instance at the same time). Isolation issues such as masking out environment variables and creating temp directories are included in this category in my opinion.
Outcome decision making – possibly the most obtuse label I’ve ever given this, I’m referring the process of deciding *what* outcome you wish to have happen. This takes different forms depending on your testing framework. For instance, in Python’s doctest:
>>> x
45
provides a specification – the test framework calls str(x) and then compares that to the string ’45’. In pyunit assertions are typically used:
self.assertEqual(45, x)
Will call 45 == x and if the result is not True, raise an exception indicating a Failure has occured. Unexpected exceptions cause Errors, and in the most recent pyunit, and some extensions, other exceptions can signal that a test should not be run, or should have failed.
So, those are the three concerns that we have when testing; where should each be expressed (in pyunit)? Pragmatically the test execution code is the hardest to separate out: Its partly outside of ‘user control’, in that the contract is with the test framework. So lets start by saying that this core facility, which we should very rarely need to change, should be in TestCase.
That leaves fixture management and outcome decision making. Lets tackle decision making… if you consider the earlier doctest and assertion examples, I think its fairly clear that there are multiple discrete components at play. Two in particular I’d like to highlight are: matching and signalling. In the doctest example the matching is done by string matching – the reference object(s) are stringified and compared to an example the test writer provides. In the pyunit example the matching is done by the __eq__ protocol. The signalling in the doctest example is done inside the test framework (so we don’t see any evidence of it at all). In the pyunit example the signalling is done by the assertion method calling self.fail(), that being the defined contract for causing a failure. Now for a more complex example: testing a float. In doctest:
>>> “%0.3f” % x
0.123
In pyunit:
self.assertAlmostEqual(0.123, x, places=3)
This very simple check – that a floating point number is effectively 0.123 exposes two problems immediately. The first, in doctest, is that literal string comparisons are extremely limited. A regex or other language would be much more powerful (and there are some extensions to doctest; the point remains though – the … operator is not enough). The second problem is in pyunit. It is that the contract of assertEqual and assertAlmostEqual are different: you cannot substitute one in where the other was expected without partial function application – something that while powerful is not the most obvious thing to reach for, or to read in code. The JUnit folk came up with a nice way to address this: they decoupled /matching/ and /deciding/ with a new assertion called ‘assertThat’ and a language for matching – expressed as classes. The initial matcher library, hamcrest, is pretty ugly in Python; I don’t use it because it tries too hard to be ‘english like’ rather than being honest about being code. (Aside, what would ‘is_()’ in a python library mean to you? Unless you’ve read the hamcrest code, or are not a Python programmer, you’ll probably get it wrong. However the concept is totally sound. So, ‘outcome decision making’ should be done by using a matching language totally seperate from testing, and a small bit of glue for your test framework. In ‘testtools’ that glue is ‘assertThat’, and the matching language is a narrow Matcher contract (in testtools.matchers) which I’m going to describe here, in case you cannot or don’t want to use the testtools one.
class Matcher:
    def __str__(self):
        "Describe this matcher."""
    def match(self, something):
        """Determine if something is matched.
        :param something: Something to match.
        :return: None if something matched, or a Mismatch object otherwise.
        """
class Mismatch:
    def describe(self):
        """Describe a mismatch that has occured."""
This permits composition and inheritance within your matching code in a pretty clean way. Using == only permits this if you can simultaneously define an __eq__ for your objects that matches with arbitrarily sensitivity (e.g. you might not want to be examining the process_id value for a process a test ran, but do want to check other fields).
Now for fixture management. This one is pretty simple really: stop using setUp (and other similar on-TestCase methods). If you use them, you will end up with a hierarchy like this:
BaseTestCase1
 +TestCase1
 +TestCase2
 +BaseTestCase2
   +TestCase3
   +TestCase4
   +BaseTestCase3
     +TestCase5
     ...
That is, you’ll have a tree of base classes, and hanging off them actual test cases. Instead, write on your base TestCase a single glue method – e.g.
def useFixture(self, fixture):
      fixture.setUp()
      self.addCleanup(fixture.tearDown)
      return fixture
And then rather than having a setUp function which performs complex operations, define a ‘fixture’ – an object with a setUp and a tearDown method. Use this in tests that need that code::
def test_foo(self):
      server = self.useFixture(NewServerWithUsers())
      self.assertThat(server, HasUser('fred'))
Note that there are some things around that offer this sort of convention already: thats all it is – convention. Pick one, and run with it. But please don’t use setUp; it was a conflated idea in the first place and is a concrete problem. Something like testresources or testscenarios may fit your needs – if it does, great! However they are not the last word – they aren’t convenient enough to replace just calling a simple helper like I’ve presented here.
To conclude, the short story is:
  • use assertThat and have a seperate hierarchy of composable matchers
  • use or create a fixture/resouce framework rather than setUp/tearDown
  • any old TestCase that has the outcomes you want should do at this point (but I love testtools).

21 thoughts on “Maintainable pyunit test suites

  1. As one who has written a large pyunit (well, Twisted Trial) test suite for one of $EMPLOYER’s products, I’m a little bemused. Maybe my test suite (which has four or five hundred tests and takes ten minutes to run) is still small enough that I just haven’t run into the problems your suggestions are trying to solve, but I’m honestly not sure what those problems are.

    You say “The second problem is in pyunit. It is that the contract of assertEqual and assertAlmostEqual are different: you cannot substitute one in where the other was expected” — well, yes. They have different names and different behaviours, I wouldn’t expect to be able to replace assertAlmostEqual with assertEqual any more than I would assertTrue or assertRaises.

    This “assertThat(x, HasFoo(y))” syntax… I guess that’s a solution to the near-ridiculous number of assert* (and assertNot* and failUnless* and failIf*) methods every TestCase must bear. That sounds like a good idea, but I’d be hesitant to use it in a codebase already filled with custom assert* methods, just for consistency reasons.

    You also say “That is, you’ll have a tree of base classes, and hanging off them actual test cases” as though I ought to be horrified. Isn’t that just proper code-reuse, moving common code into a base class? Are you claiming that with an extended hierarchy it will become difficult to figure out which base a given TestCase ought to inherit from? Is it really a net win to add half a dozen self.useFixture() calls to the beginning of each test in a TestCase, instead of just putting the common setup code in setUp where it’s meant to be?

    I’m not trying to troll; you have some genuinely interesting new testing ideas here that I haven’t heard of before, but if I’m going to use these techniques on any codebase large enough to matter, I have to be able to explain the benefits to my cow-orkers.

  2. So, about 500 tests in 10 minutes or ~1.2 seconds per test. As a comparison, bzr has about 20000 tests which take up to 60 minutes, or 0.18 seconds per test – and we feel that this is slow!

    So I’d say that your test suite is still pretty small, and even so it sounds slow – it would be interesting to analyse where the time is going.

    The assert signature stuff – compare ‘assertListEqual’ – if you have a list of floats you then need assertListAlmostEqual, or to pass in an assertion to (say) assertListTest(self, expected, actual, assertion=self.assertEqual) – and assertAlmostEqual can’t be passed in like that – you need to use a partial function application to get an assertion to be passed in. With a matching approach, you have a single SequenceMatcher and do
    assertThat(actual, SequenceMatcher(map(AlmostEqual, [0.12, 1.23])))

    Yes, you may want to be cautious about introducing a change; on the other hand, I wouldn’t say ‘no changes possible’ – do an experiment, decide if you like it, and if you do change your style and migrate as time goes by.

    The point about the hierarchy that forms is that subclassing is (generally) a single-shot event, so over time more and more things migrate up to the top – it becomes unweildy. I’m proposing a *single* useFixture call in a test, not multiple calls.

    HTH, I don’t think you’re trolling; this is my first attempt to really communicate how these things interact, and I appreciate the questions you’re asking: they help me highlight what things I need to pay more attention to.

  3. Obligatory nitpickery: the doctest protocol uses repr(), not str(), and the 3-significant-digits example needs a “print” in front of the string expression.

    Incidentally, my unit test suite runs at 0.15 seconds per test, and I consider it to be slow-ish.

    I find your ideas interesting, although

    self.assertThat(actual, SequenceMatcher(map(AlmostEqual, [0.12, 1.23])))

    is hard for me to swallow. This is Python, not Lisp, after all. To me

    assertListAlmostEqual(actual, [0.12, 1.23])

    seems more right. I dislike PyUnit for making assertions into methods. Does anybody override self.fail? Why not import custom assertion functions from a helper module and hardcode the signalling protocol to “raise AssertionError(message)”?

    addCleanup is the best thing that happened to unittest in years. I don’t know what I think about one fixture per test requirement, though. My fixtures tend to be ad-hoc and test-specific. I use helper methods to construct them, but a test often calls more than one of those.

  4. That’s a pretty nice summary of the space.

    Marius, I agree that it’s annoying that assert* and fail* methods need to be on the test case. It tends to mean things must be connected back to or defined on the test case when doing so is not otherwise necessary or desirable. It also seems pointless because the test case necessarily needs to handle an uncaught exception as a failure.

    Leaving that aside, the change here is between a choice of three contracts

    1- call TestCase.fail() if it doesn’t match
    2- raise an assertion if it doesn’t match (and assume fail() raises an exception)
    3- return a Mismatch object if it doesn’t match

    1 is what you’re supposed to do in Python today. 2 is arguably an abuse of the unittest interface, but I think highly likely to work. So the question is, why is 3 any better? Robert has alluded to it but not really answered it.

    One example is, suppose I’m implementing a vector of floats and I want to compare them. I can write

    def assertVectorsApproximatelyEqual(self, a b):
    for x, y in zip(a, b):
    self.assertApproximatelyEqual(x, y)

    but this has the disadvantage that only the first one that differs will be reported and there will be little context. Alternatively I can reimplement assertApproximatelyEqual inline, or I can catch and rethrow the exceptions. The composition Robert’s referring to is something like

    class MatchVectors…
    def match(self, a, b):
    return any_match(MatchFloats, zip(a, b))

    more examples needed…

  5. One more thing: I like the useFixture concept but ‘fixture’ seems a bit like the wrong name. istm that in xunit the word ‘fixture’ means the whole setup in which the test runs. Perhaps it’s not worth distinguishing it.

  6. Hey Rob,

    Good article (although please use more headings next time).

    I’d like to know where you think displaying / rendering the test results fits in here, even if it’s just to dismiss the problem as solved 🙂

    Marius, I can definitely sympathize with the “It’s Python not Lisp” line of thinking (although sometimes my response is “therefore, we should fix Python”). Personally, I find:
    self.assertThat(actual, SequenceMatcher(map(AlmostEqual, [0.12, 1.23])))

    difficult to read.

    Just to see whether it is easier in Lisp:
    (assert-that actual (sequence-matcher (map almost-equal ‘(0.12 1.23))))

    Hmm. I must say I do find that easier. How about Haskell?
    assertThat actual (SequenceMatcher $ map AlmostEqual [0.12 1.23])

    Certainly less backward-scanning than the Python version.

    Anyway, regardless of syntax, when you start to get a proliferation of methods that are all combinations of each other (assertEqual, assertListEqual, assertAlmostEqual, assertListAlmostEqual), then I think it’s time to start looking for ways to refactor – and damn being “Pythonic”.

    jml

    1. I don’t think display or rendering of test results has much impact on test suite maintainability: folk tend to do something a little weird, a little how-it-worked for us, and then it stays there indefinitely. I like what we’ve done in testtools, and I think addDetails goes a long way to removing the need to do those little tweaks altogether.

  7. Man, I’ve been meaning to reply to the replies to my comment for well over a week now; sorry about that.

    Our ~1.2 second-per-test suite is indeed somewhat slow; one of the more basic TestCase classes we use drops and recreates the database for every test instead of mocking it, and many of the tests communicate with a server over the network; hardly Best Practice, I know, but such is commercial software development I’m afraid. 😦

    I’m also not entirely convinced about .useFixture() being called exactly once per test; if you need a separate Fixture object for each set of fixtures, I think you’d pretty much wind up creating the same class hierarchy as before, but inheriting from Fixture rather than from TestCase.

    Incidentally, after I made my original comment, I had a discussion with a coworker about your Matcher idea; he was of the opinion that having special Matcher objects was a regrettably Java-esque notion, and that really any old callable should do. Personally, I think that while “composable assertions and descriptions” is a noble goal, classes feel too heavyweight and raw callables too lightweight for the job. Then again, Python seems to have withstood the callable-gymnastics involved with decorators, so maybe callables are the right tool after all:

    def equals(expected):
        def matcher(actual):
            if actual == expected:
                return None
    
            return "%r != %r" % (actual, expected)
    
        return matcher
    
    def almostEquals(expected, epsilon):
        def matcher(actual):
            if abs(expected - actual) < epsilon:
                return None
    
            return "%r not within %r of %r" % (actual, epsilon, expected)
    
        return matcher
    
    def eachItem(innerMatcher):
        def matcher(actual):
            res = [innerMatcher(item) for item in actual]
            res = [item for item in res if item]
    
            if res:
                return ", ".join(res)
    
            else:
                return None
    
        return matcher
    

    Looking at that quick sketch, I’m not entirely happy with how mismatches in a list would be rendered, but I think the basic interface is OK.

    @Marius: Twisted’s PyUnit system doesn’t use AssertionError for its assert* methods; it has a special failure exception it raises. This means that any “assert” statements in the system-under-test cause “ERROR” results in the test suite, not just “FAIL” results. I feel this is a useful distinction to make – it draws a line between “the system under test was asked to do X but did Y instead”, versus “the system under test was asked to do X but exploded instead”.

  8. Man, that code block really screwed up the code. I hope you can figure out what I meant. 😦

  9. If you use square rather than angled brackets on code, it will format as I have changed your comment.

    Re: callables – I considered that, and even tried it: but you can’t [sensibly] print a callable – and it is desirable to be able to print out a description of what the matcher wants. So I think classes are the right protocol; however helpers to curry classes for one can make the pain less visible and have it feel lighter weight.

    As long as one is writing direct and to the point code, I don’t think classes are unpythonic :). Using classes to replace language features like generators is definitely odd 😉

  10. Oh, also regarding the hierarchy you end up with: it will be a very different heirarchy, because you can’t usefully compose TestCase classes in the same way. You may have as many classes, but they will be flatter and more composed – and you’ll likely find things don’t grow as fast. Am testing this at the moment.

  11. 2.5 years later, I still think needing to call a method on the test case is ugly and non Pythonic: more typing, and it requires some code be reachable from your testcase object.

    In other words I’d like to just say

    SomeMatcher().check(some_value)

    and have that raise an exception if they mismatch.

    But now I have more ammunition, from looking at the new unittest.mock standard library , which follows this same pattern: you construct an object, call a method on it, and if there’s a problem it raises an exception.

    It seems to me that if code inside unittest itself is happy to just raise an exception and not call UnitTest.fail(), testtools should be happy with that too.

    1. I think you’ve conflated two things there: the matcher API doesn’t raise because raising is constrained by unittest, it doesn’t raise because raising isn’t a good interface for combining checks. Matchers have no dependencies on or links to the TestCase object [unless someone chooses to make a matcher that does that].

      assertThat could easily be made into a standalone function if that was desired, and that would address your stated issue, but still not look like your code 🙂

      Perhaps I’m missing your point?

      1. I thought you had previously told me that you believed the opposite, ie that signalling an error by raising was not in line with the unittest protocol. Although obviously it does work. If you don’t believe that, that’s great.

        I agree that the base “did it match or not” is better expressed as a return value not an exception.

        However, I don’t see you couldn’t add a method on Matcher that is just

        def check(self, value):
        mismach = self.match(value)
        if mismatch:
        raise MismatchError(mismatch)

        ie very similar to what is currently in assertThat, but without involving the testcase.

        I see in the current assertThat, the only thing it uses `self` for is handling details but it seems not impossible to just pull them out of the exception when it’s caught.

      2. The code in unittest that catches Failures (vs Error) is parameterised. If mock isn’t also parameterised, then even if the maintainer of unittest in python wrote it, its incompatible. Its not a matter of belief or opinion.

        Whether that parameterisation is useful is a whole different discussion.

        I don’t see why a method on Matcher is better than a standalone function? It would force subclassing or reimplementation.

        As far as details go, pulling them out of the exception without TestCase knowing that its doing a matcher check would need a generic protocol for pulling them out : doable, but I don’t understand why thats better?

      3. The problem is that

        self.assertThat(haystack, Contains(needle))

        is longer compared to

        self.assertIn(needle, haystack)

        More typing, more visual noise. Coupling methods to objects where they have no essential connection creates cognitive noise.

        So it is a harder sell to get people to consistently use them, and if they won’t call them they are unlikely to write new matchers, and we will not get away from inline check or randomly scattered assertion methods.

        Letting arbitrary exceptions provide details to the test suite when they terminate it seems highly useful. I can certainly imagine exceptions from a subsystem providing some debug information about the state of that subsystem.

      4. I’m comparing
        Contains(needle).check(haystack)
        to
        check(haystack, Contains(needle))

        when I push back on Matcher.check as a method.

      5. Ah, that’s fine with me.

        Unfortunately the Google Python standard says you must not import names, only modules, so it’s not going to help me much in code there.

        unittest.check(haystack, Contains(needle))

        is not a lot better than

        self.assertThat(haystack, Contains(needle))

        and perhaps the latter reads better.

  12. Has there been some discussion about refactoring python unittest on python-dev or some other place? I’ve got some ideas and thoughts but I’d like to find where similar ideas might have been proposed and shot down before diving in, head first.

    quick outline of my thoughts:
    (1) refactor unittest.TestCase into AssertionlessTestCase and DefaultAssertionsMixin (names are only indicative of intent)
    (2) Create AssertThatMixin based on testtools matchers (http://testtools.readthedocs.io/en/latest/for-test-authors.html#matchers)
    (3) Re-implement DefaultAssertionsMixin to provide existing unittest API using assertThat

    This approach would potentially provide a path for addressing issues that want to add more assertFOO’s w/o expanding the stdlib. (see https://bugs.python.org/issue27152 and https://bugs.python.org/issue27198)
    Perhaps a git repo of mixins with some name conflict testing to prevent _accidental_ name collisions.

    1. We haven’t had any formal discussion about dealing with the standard libraries unittest. But I’d expect we can do most anything we want to as long as we don’t introduce egregious incompatibilities.

      Adding assertThat and expectThat to TestCase would be fine IMO – as the last assertions we ever add. It would be ideal if we could add them (expectThat specifically) as free functions, though I think we’re missing some infrastructure for that). A further issue there is that we haven’t yet introduced the details concept to the stdlib, which allows for extended, structured error information – http://testtools.readthedocs.io/en/latest/for-test-authors.html#writing-your-own-matchers.

      Refactoring existing assertions to depend on assertThat would be somewhat risky – if assertThat and expectThat differ from existing implementations in the wild (such as the one in testtools) – because that would result in incompatibilities in existing test code (vs incompatibilities when you’re using a new thing, which is ok). If the API is identical, its much less risky.

      Having a ticket in the Python issue tracker is probably a sensible first step if you’re interested in pursuing this – please +nosy me :).

Leave a comment