SkyDNS in Kubernetes 1.3 local clusters

If you want to run kubernetes locally – not in a VM – then you’ll probably also want DNS service integration to work.  Thats fine, except by default it doesn’t work :(. This may be due to DNS being a built-in add-on now, but the current docs around that are inconsistent – referencing the deleted 1.2 dns addon docs :/.

I’ve put a pull request up to fix the errors I encountered trying to use the local-up-cluster script per the current in-tree documentation in build. You also need to run it slightly differently than the basic docs suggest. The basic setup (sensibly) doesn’t listen on 0.0.0.0, avoiding exposing your insecure cluster to the world. But since you’re going to be partitioning off your machine into containers, and the kube-dns component which handles DNS integration needs to talk to the kubernetes API, so you need to override that.

sudo KUBE_ENABLE_CLUSTER_DNS=true API_HOST_IP=0.0.0.0 hack/local-up-cluster.sh

Will run a local cluster for you with DNS happily working, assuming the other preconditions (like – you’re not using 10.0.0.0/8) needed to run a local cluster are true. You can start with no environment variables set ar all to check that that works – kubernetes itself runs happily with no DNS integration. Note though, that if you have DNS enabled, it has to work, or the kubernetes API itself will fail to register endpoints, and then gets itself firewalled off.

Some quick debugging things I found useful.

Find the pod

$ cluster/kubectl.sh --namespace kube-system get pods
NAME READY STATUS RESTARTS AGE
kube-dns-v18-mi26o 3/3 Running 0 18m

Check it has registered endpoints successfully

$ cluster/kubectl.sh --namespace kube-system get ep
NAME ENDPOINTS AGE
kube-dns 172.17.0.2:53,172.17.0.2:53 18m

Check its logs

$ cluster/kubectl.sh logs --namespace kube-system kube-dns-v18-mi26o -c kubedns
....

Deploy something and check it both can use DNS and is listed in DNS

I made a trivial Ubuntu image with a little more in it:

$ cat rob/Dockerfile
FROM ubuntu

RUN apt-get update
RUN apt-get install -y iputils-ping curl openssh-client iproute2 dnsutils
RUN apt-get clean && rm -rf /var/lib/apt/lists/*

Which I then deploy via a trivial definition:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  namespace: default
spec:
  containers:
  - image: ubuntu-debug
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
    name: ubuntu
  restartPolicy: Always

And a call to kubectl:

$ cluster/kubectl.sh create -f rob/ubuntu.yaml

And if successfully integrated with DNS, it will be registered with DNS under A-B-C-D.default.pod.cluster.local.

$ cluster/kubectl.sh exec ubuntu -ti /bin/bash
root@ubuntu:/# ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
48: eth0@if49: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.3/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:3/64 scope link tentative dadfailed
       valid_lft forever preferred_lft forever
root@ubuntu:/# ping 172-17-0-3.default.pod.cluster.local
PING 172-17-0-3.default.pod.cluster.local (172.17.0.3) 56(84) bytes of data.
64 bytes from ubuntu (172.17.0.3): icmp_seq=1 ttl=64 time=0.013 ms
^C
--- 172-17-0-3.default.pod.cluster.local ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.013/0.013/0.013/0.000 ms

El cheapo 10Gbps networking

I’ve been hitting the limits of gigabit ethernet at home for quite a while now, and as I spend more time working with cloud technologies this started to frustrate me.

I’d heard of other folk getting good results with second hand Infiniband cards and decided to give it a go myself.

I bought two Voltaire dual-port Infiniband adapters – a 4X SDR PCI-E x4 card. And in a 2 metre 8470 cable, and we’re in business.

There are other, more comprehensive guides around to setting this up – e.g. http://davidhunt.ie/wp/?p=2291 or http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html

On ubuntu the hardware was autodetected; all I needed to do was:

modprobe ib_ipoib
sudo apt-get install opensm # on one machine

And configure /etc/network/interfaces – e.g.:

iface ib1 inet static
address 192.168.2.3
netmask 255.255.255.0
network 192.168.2.0
up echo connected >`find /sys -name mode | grep ib1`
up echo 65520 >`find /sys -name mtu | grep ib1`

With no further tuning I was able to get 2Gbps doing linear file copies via Samba, which I suspect is rather pushing the limits of my circa 2007 home server – I’ll investigate futher to identify where the bottlenecks are, but the networking itself I suspect is ok – netperf got me 6.7Gbps in a trivial test.

Multi-machine parallel testing of nova with testrepository

I recently added a formal interface to testrepository to enable cross-machine scaling of test runs. As testrepository is still a static scheduler, this isn’t perfect, but its quite a minimal interface, which makes it easy to implement. I will likely evolve it in reaction to feedback and experience.

In the long term I’d love to have a super generic tool that matches that interface, so the project VCS copy of .testr.conf can just call out to it. However I don’t yet have that, but I do have a simple by-hand implementation that I use to run nova’s tests across my personal laptop, desktop and work laptop.

Testr models this by assuming each test running process can be mapped to a single ‘instance id’ (which could be a chroot, vm, cloud instances, …) and then running one or more commands in the instance, before disposing of it.

This by hand implementation consists of 4 things:

  1. A tiny script to rsync my source directory to the relevant places before I run tests. (This takes <2seconds on my home wifi).
  2. A script to allocate instance ids (I just use ints)
  3. A script to discard them
  4. And a script to copy tempfiles onto the target machine and run a given command.

I do my testing in lxc containers, because I like my primary environment to be free of project-specific quirks and workarounds. lxc is not needed though, if you don’t want it.

So, to set this up for yourself:

  1. on each host, make an lxc container (e.g. following) http://wiki.openstack.org/DependsOnUbuntu
  2. start them all (lxc-start -n nova -d)
  3. Make SSH config entries for the lxc containers, so you can get at them remotely. (make sure your host * rules are at the end of the file otherwise the master overrides won’t work [and you might not notice for some time…]):
    Host desktop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      hostname 10.0.3.19
      ProxyCommand ssh 192.168.1.106 nc -q0 %h %p
    
    Host hplaptop-nova.lxc
    # lxc addresses may be present on localhost too, so namespace the control
    # path to avoid connecting to the wrong container.
      ControlPath ~/.ssh/master-lxc-%r@%h:%p
      hostname 10.0.3.244
      ProxyCommand ssh 192.168.1.116 nc -q0 %h %p
  4. make a script to copy your nova source tree to each test location. I called mine ‘sync’
    #!/bin/bash           
    cd $(dirname $0)
    echo syncing in $(pwd) 
    (rsync -a . desktop-nova.lxc:source/openstack/nova --delete-after && echo dell done) &
    (rsync -a . hplaptop-nova.lxc:source/openstack/nova --delete-after && echo hp done)
  5. Make sure you have the base directory on each location
    ssh desktop-nova.lxc mkdir -p source/openstack
    ssh hplaptop-nova.lxc mkdir -p source/openstack
  6. Sync your code over.
    ./sync
  7. And check tests run by running a few.
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./run_tests.sh compute"
    ssh hplaptop-nova.lxc "cd source/openstack/nova && ./run_tests.sh compute"

    This will check the test environment: we’re not going to be running tests on each node via run-tests or even testr (because it gets immediately meta), but if this fails, later attempts won’t work. Your test virtualenv is inside the source tree, so it is copied implicitly by the sync.

  8. Decide what concurrency you want. For me, I picked 12: I have a desktop i7 with 4 cores, and two laptops with 2 cores each, and hyperthreads are on on all of them – I’m going to set a concurrency figure of 12 – between the cores (8) and threads (16) counts, and possibly balance it more in future. A higher number assumes less contention between ALU’s and other elements of the core pipeline, and I expect quite some contention because most of nova’s unittests are CPU bound not I/O. If the test servers are not busy, I can always raise it later.
  9. Create scripts to create / dispose / execute logical worker threads.
  10. Creation. I call this ‘instance-provision’ and all it does is find the lowest ints not currently allocated and return them.
    #!/usr/bin/env python
    import os.path
    import sys
    
    if not os.path.isdir('.instances'):
        os.mkdir('.instances')
    
    running_ids = os.listdir('.instances')
    count = int(sys.argv[1])
    top = count + len(running_ids)
    ids = [str(i) for i in range(top)]
    new = set(ids) - set(running_ids)
    for id in new:
        file('.instances/%s' % id, 'w').close()
    print(' '.join(new))
  11. Disposal is easy: remove the file marking the instance as in-use.
    #!/bin/bash
    echo freeing $@
    cd .instances
    rm $@
  12. Execution is a little trickier. We need to run some commands locally, and other ones by copying in temp files that testr has setup to the machine sshing to the remote machine, cd’ing to the right directory, sourcing the virtual env, and finally running the command.
    #!/bin/bash
    instance="$(($1 % 4))"
    case $instance in
    [0]) node=
         local="true"
         ;;
    [1]) node=hplaptop-nova.lxc
         local=""
         ;;
    [2-3]) node=desktop-nova.lxc
         local=""
         ;;
    *)   echo "Unknown instance $instance" >&2
         exit 1
         ;;
    esac
    shift
    files=
    # accumulate files to copy
    while [ "--" != "$1" ]; do 
    files="$files $1"
    shift ; done 
    shift   
    if [ -n "$files" -a -z "$local" ]; then
        echo copying $files to node.
        for f in $files; do
            rsync $f $node:$(dirname $f) ;
        done
    fi  
    if [ -n "$local" ]; then
        eval $@
    else
        echo ssh to $node
        ssh $node "cd source/openstack/nova && . .venv/bin/activate && $@"
    fi
  13. Finally, tell testr how to use this. (Don’t commit this change to nova, as it would break other people). Add this to your .testr.conf.
    test_run_concurrency=echo 12
    instance_provision=./instance-provision $INSTANCE_COUNT
    instance_execute=./instance-execute $INSTANCE_ID $FILES -- $COMMAND
    instance_dispose=./instance-dispose $INSTANCE_IDS

Now, when you run testr run –parallel, it will run across your machines. Just do a ./sync before running tests to get the code out there. It is possible to wrap all of this up via automation (or to include just-in-time provisioned cloud instances), but I like the results of still rough scripts here – it strikes a good balance between effort, reliability and performance.

Edit: I spent a bit of time poking at my config – it turns out that my laptop (coming up on 3 years old now) has relatively less grunt – so I’m now running mod 8, with 0 my laptop, 1-2 my work laptop, 3-7 my desktop, and interestingly by running a proportionately overloaded set of tests I get a time reduction.

time testr run --parallel --concurrency=16
...
real 2m34.950s

Running juju against a private openstack instance.

My laptop has somewhat less than 1/2 the grunt of my desktop at home, but I prefer to work on it as I can go sit in the sun etc, very hard to do that with a mini tower case 🙂

However, running everything through ssh to another machine makes editing and iterating more clumsy; I need to do agent forwarding etc – not terribly hard, but not free either, particularly when I travel, I need to remember to sync my source trees back to my laptop. So I prefer to live on my laptop and use my desktop for compute power.

I had a couple of Juju charms I wanted to investigate, but I needed enough compute power to make my laptop really quite warm – so I thought, its time to update my local cloud provider from Eucalyptus to Openstack. This was easy enough, until I came to run Juju. Turns out that Juju’s commands really want to talk to the public DNS name of the instance (in order to SSH tunnel a connection to Zookeeper).

But! Openstack returns DNS names like ‘Server-3’, and if you think about a home network, its fairly rare to have a local DNS server *anyway*, so putting a suffix on names like that won’t help at all: you either need to use a DNS naming provider (openstack ships with an LDAP provider, which adds even more complexity), and configure your clients to know how to find it, or you need to use the public IP addresses (which default to the FlatNetwork, which is routable within a home LAN by simply adding a route to 10.0.0.0/8 to your wifi interface). Adding to confusion, some wifi routers fail to forward avahi messages, which is a) terrible and b) breaks the only obvious way of doing no-config local DNS :(.

So, I did some yak shaving this morning. Turns out other folk have already run into this and filed a Juju bug and a supporting txaws bug. The txaws bug was fixed, but just missed the release of Precise. Clint Byrum is going to SRU it this week though, so we’ll have it soon. I’ve put a patch up to address the Juju side, which is now pending review. Running the two together works very happily for me. \o/