Multi-machine parallel testing of nova with testrepository

14Jan13

I recently added a formal interface to testrepository to enable cross-machine scaling of test runs. As testrepository is still a static scheduler, this isn’t perfect, but its quite a minimal interface, which makes it easy to implement. I will likely evolve it in reaction to feedback and experience.

In the long term I’d love to have a super generic tool that matches that interface, so the project VCS copy of .testr.conf can just call out to it. However I don’t yet have that, but I do have a simple by-hand implementation that I use to run nova’s tests across my personal laptop, desktop and work laptop.

Testr models this by assuming each test running process can be mapped to a single ‘instance id’ (which could be a chroot, vm, cloud instances, …) and then running one or more commands in the instance, before disposing of it.

This by hand implementation consists of 4 things:

  1. A tiny script to rsync my source directory to the relevant places before I run tests. (This takes <2seconds on my home wifi).
  2. A script to allocate instance ids (I just use ints)
  3. A script to discard them
  4. And a script to copy tempfiles onto the target machine and run a given command.

I do my testing in lxc containers, because I like my primary environment to be free of project-specific quirks and workarounds. lxc is not needed though, if you don’t want it.

So, to set this up for yourself:

  1. on each host, make an lxc container (e.g. following) http://wiki.openstack.org/DependsOnUbuntu
  2. start them all (lxc-start -n nova -d)
  3. Make SSH config entries for the lxc containers, so you can get at them remotely. (make sure your host * rules are at the end of the file otherwise the master overrides won’t work [and you might not notice for some time…]):
    Host desktop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      hostname 10.0.3.19
      ProxyCommand ssh 192.168.1.106 nc -q0 %h %p
    
    Host hplaptop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      hostname 10.0.3.244
      ProxyCommand ssh 192.168.1.116 nc -q0 %h %p
  4. make a script to copy your nova source tree to each test location. I called mine ‘sync’
    #!/bin/bash           
    cd $(dirname $0)
    echo syncing in $(pwd) 
    (rsync -a . desktop-nova.lxc:source/openstack/nova --delete-after && echo dell done) &
    (rsync -a . hplaptop-nova.lxc:source/openstack/nova --delete-after && echo hp done)
  5. Make sure you have the base directory on each location
    ssh desktop-nova.lxc mkdir -p source/openstack
    ssh hplaptop-nova.lxc mkdir -p source/openstack
  6. Sync your code over.
    ./sync
  7. And check tests run by running a few.
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./run_tests.sh compute"
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./run_tests.sh compute"

    This will check the test environment: we’re not going to be running tests on each node via run-tests or even testr (because it gets immediately meta), but if this fails, later attempts won’t work. Your test virtualenv is inside the source tree, so it is copied implicitly by the sync.

  8. Decide what concurrency you want. For me, I picked 12: I have a desktop i7 with 4 cores, and two laptops with 2 cores each, and hyperthreads are on on all of them – I’m going to set a concurrency figure of 12 – between the cores (8) and threads (16) counts, and possibly balance it more in future. A higher number assumes less contention between ALU’s and other elements of the core pipeline, and I expect quite some contention because most of nova’s unittests are CPU bound not I/O. If the test servers are not busy, I can always raise it later.
  9. Create scripts to create / dispose / execute logical worker threads.
  10. Creation. I call this ‘instance-provision’ and all it does is find the lowest ints not currently allocated and return them.
    #!/usr/bin/env python
    import os.path
    import sys
    
    if not os.path.isdir('.instances'):
        os.mkdir('.instances')
    
    running_ids = os.listdir('.instances')
    count = int(sys.argv[1])
    top = count + len(running_ids)
    ids = [str(i) for i in range(top)]
    new = set(ids) - set(running_ids)
    for id in new:
        file('.instances/%s' % id, 'w').close()
    print(' '.join(new))
  11. Disposal is easy: remove the file marking the instance as in-use.
    #!/bin/bash
    echo freeing $@
    cd .instances
    rm $@
  12. Execution is a little trickier. We need to run some commands locally, and other ones by copying in temp files that testr has setup to the machine sshing to the remote machine, cd’ing to the right directory, sourcing the virtual env, and finally running the command.
    #!/bin/bash
    instance="$(($1 % 4))"
    case $instance in
    [0]) node=
         local="true"
         ;;
    [1]) node=hplaptop-nova.lxc
         local=""
         ;;
    [2-3]) node=desktop-nova.lxc
         local=""
         ;;
    *)   echo "Unknown instance $instance" >&2
         exit 1
         ;;
    esac
    shift
    files=
    # accumulate files to copy
    while [ "--" != "$1" ]; do 
    files="$files $1"
    shift ; done 
    shift   
    if [ -n "$files" -a -z "$local" ]; then
        echo copying $files to node.
        for f in $files; do
            rsync $f $node:$(dirname $f) ;
        done
    fi  
    if [ -n "$local" ]; then
        eval $@
    else
        echo ssh to $node
        ssh $node "cd source/openstack/nova && . .venv/bin/activate && $@"
    fi
  13. Finally, tell testr how to use this. (Don’t commit this change to nova, as it would break other people). Add this to your .testr.conf.
    test_run_concurrency=echo 12
    instance_provision=./instance-provision $INSTANCE_COUNT
    instance_execute=./instance-execute $INSTANCE_ID $FILES -- $COMMAND
    instance_dispose=./instance-dispose $INSTANCE_IDS

Now, when you run testr run –parallel, it will run across your machines. Just do a ./sync before running tests to get the code out there. It is possible to wrap all of this up via automation (or to include just-in-time provisioned cloud instances), but I like the results of still rough scripts here – it strikes a good balance between effort, reliability and performance.

Edit: I spent a bit of time poking at my config – it turns out that my laptop (coming up on 3 years old now) has relatively less grunt – so I’m now running mod 8, with 0 my laptop, 1-2 my work laptop, 3-7 my desktop, and interestingly by running a proportionately overloaded set of tests I get a time reduction.

time testr run --parallel --concurrency=16
...
real 2m34.950s

About these ads


One Response to “Multi-machine parallel testing of nova with testrepository”

  1. 1 Sandip Dey

    Hi Robert

    I was just trying out the above with a test file(test_dummy.py),to verify if the tests were being distributed above multiple hosts.

    This was my .testr.conf

    [DEFAULT]
    test_command=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
    OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
    OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
    ${PYTHON:-python} -m subunit.run discover -t ./ ./discovery $LISTOPT $IDOPTION
    test_id_option=–load-list $IDFILE
    test_list_option=–list
    test_run_concurrency=echo 12
    instance_provision=./instance-provision $INSTANCE_COUNT
    instance_execute=./instance-execute $INSTANCE_ID $FILES — $COMMAND
    instance_dispose=./instance-dispose $INSTANCE_IDS
    group_regex=([^\.]+\.)+

    I think the following from ‘instance-execute’ getting executed;but dont see the log ‘ssh to nodea’ in stdoutput
    ‘echo ssh to $node
    ssh $node “cd source/openstack/nova && . .venv/bin/activate && $@”‘

    This the output I am getting

    api-venv)root@nodea11:~/contrail-test/scripts# testr run –parallel
    running=./instance-provision 12
    running=./instance-execute 11 — OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
    OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
    OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
    ${PYTHON:-python} -m subunit.run discover -t ./ ./discovery –list
    running=./instance-execute 11 /tmp/tmphItivo — OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
    OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
    OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} \
    ${PYTHON:-python} -m subunit.run discover -t ./ ./discovery –load-list /tmp/tmphItivo
    Ran 10 tests in 21.108s (+21.106s)
    PASSED (id=40)
    running=./instance-dispose 0 1 10 11 2 3 4 5 6 7 8 9
    freeing 0 1 10 11 2 3 4 5 6 7 8 9

    So where are those print statements getting logged?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 1,056 other followers

%d bloggers like this: