Multi-machine parallel testing of nova with testrepository

I recently added a formal interface to testrepository to enable cross-machine scaling of test runs. As testrepository is still a static scheduler, this isn’t perfect, but its quite a minimal interface, which makes it easy to implement. I will likely evolve it in reaction to feedback and experience.

In the long term I’d love to have a super generic tool that matches that interface, so the project VCS copy of .testr.conf can just call out to it. However I don’t yet have that, but I do have a simple by-hand implementation that I use to run nova’s tests across my personal laptop, desktop and work laptop.

Testr models this by assuming each test running process can be mapped to a single ‘instance id’ (which could be a chroot, vm, cloud instances, …) and then running one or more commands in the instance, before disposing of it.

This by hand implementation consists of 4 things:

  1. A tiny script to rsync my source directory to the relevant places before I run tests. (This takes <2seconds on my home wifi).
  2. A script to allocate instance ids (I just use ints)
  3. A script to discard them
  4. And a script to copy tempfiles onto the target machine and run a given command.

I do my testing in lxc containers, because I like my primary environment to be free of project-specific quirks and workarounds. lxc is not needed though, if you don’t want it.

So, to set this up for yourself:

  1. on each host, make an lxc container (e.g. following)
  2. start them all (lxc-start -n nova -d)
  3. Make SSH config entries for the lxc containers, so you can get at them remotely. (make sure your host * rules are at the end of the file otherwise the master overrides won’t work [and you might not notice for some time…]):
    Host desktop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      ProxyCommand ssh nc -q0 %h %p
    Host hplaptop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      ProxyCommand ssh nc -q0 %h %p
  4. make a script to copy your nova source tree to each test location. I called mine ‘sync’
    cd $(dirname $0)
    echo syncing in $(pwd) 
    (rsync -a . desktop-nova.lxc:source/openstack/nova --delete-after && echo dell done) &
    (rsync -a . hplaptop-nova.lxc:source/openstack/nova --delete-after && echo hp done)
  5. Make sure you have the base directory on each location
    ssh desktop-nova.lxc mkdir -p source/openstack
    ssh hplaptop-nova.lxc mkdir -p source/openstack
  6. Sync your code over.
  7. And check tests run by running a few.
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./ compute"
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./ compute"

    This will check the test environment: we’re not going to be running tests on each node via run-tests or even testr (because it gets immediately meta), but if this fails, later attempts won’t work. Your test virtualenv is inside the source tree, so it is copied implicitly by the sync.

  8. Decide what concurrency you want. For me, I picked 12: I have a desktop i7 with 4 cores, and two laptops with 2 cores each, and hyperthreads are on on all of them – I’m going to set a concurrency figure of 12 – between the cores (8) and threads (16) counts, and possibly balance it more in future. A higher number assumes less contention between ALU’s and other elements of the core pipeline, and I expect quite some contention because most of nova’s unittests are CPU bound not I/O. If the test servers are not busy, I can always raise it later.
  9. Create scripts to create / dispose / execute logical worker threads.
  10. Creation. I call this ‘instance-provision’ and all it does is find the lowest ints not currently allocated and return them.
    #!/usr/bin/env python
    import os.path
    import sys
    if not os.path.isdir('.instances'):
    running_ids = os.listdir('.instances')
    count = int(sys.argv[1])
    top = count + len(running_ids)
    ids = [str(i) for i in range(top)]
    new = set(ids) - set(running_ids)
    for id in new:
        file('.instances/%s' % id, 'w').close()
    print(' '.join(new))
  11. Disposal is easy: remove the file marking the instance as in-use.
    echo freeing $@
    cd .instances
    rm $@
  12. Execution is a little trickier. We need to run some commands locally, and other ones by copying in temp files that testr has setup to the machine sshing to the remote machine, cd’ing to the right directory, sourcing the virtual env, and finally running the command.
    instance="$(($1 % 4))"
    case $instance in
    [0]) node=
    [1]) node=hplaptop-nova.lxc
    [2-3]) node=desktop-nova.lxc
    *)   echo "Unknown instance $instance" >&2
         exit 1
    # accumulate files to copy
    while [ "--" != "$1" ]; do 
    files="$files $1"
    shift ; done 
    if [ -n "$files" -a -z "$local" ]; then
        echo copying $files to node.
        for f in $files; do
            rsync $f $node:$(dirname $f) ;
    if [ -n "$local" ]; then
        eval $@
        echo ssh to $node
        ssh $node "cd source/openstack/nova && . .venv/bin/activate && $@"
  13. Finally, tell testr how to use this. (Don’t commit this change to nova, as it would break other people). Add this to your .testr.conf.
    test_run_concurrency=echo 12
    instance_provision=./instance-provision $INSTANCE_COUNT
    instance_execute=./instance-execute $INSTANCE_ID $FILES -- $COMMAND
    instance_dispose=./instance-dispose $INSTANCE_IDS

Now, when you run testr run –parallel, it will run across your machines. Just do a ./sync before running tests to get the code out there. It is possible to wrap all of this up via automation (or to include just-in-time provisioned cloud instances), but I like the results of still rough scripts here – it strikes a good balance between effort, reliability and performance.

Edit: I spent a bit of time poking at my config – it turns out that my laptop (coming up on 3 years old now) has relatively less grunt – so I’m now running mod 8, with 0 my laptop, 1-2 my work laptop, 3-7 my desktop, and interestingly by running a proportionately overloaded set of tests I get a time reduction.

time testr run --parallel --concurrency=16
real 2m34.950s