Archive for March, 2014

Cancellation in Go, the Juju way

March 15, 2014

I read the recent Go blog post on pipelines and cancellation with interest because we have explored this area quite a bit where I work, in the Juju project.

Over the last couple of years, a pattern has emerged that works well for us, so I thought it might be illuminating to show an implementation of the same problem that uses that pattern.

The pattern centres around this simple interface.

type Worker interface {
    Kill()
    Wait() error
}

To implement some code that runs independently using its own goroutines, we define a type representing the task that the code performs, and implement Kill and Wait methods on it.

The interface contract is almost trivially simple: the Kill method asks the worker to die, but does not actually wait for it to die. The Wait method waits for a worker to die, either from natural causes (because it has completed its task or encountered an unrecoverable error), or because it was killed. In either case, it returns any error encountered before it shut down.

Both methods are idemopotent – it is ok to kill a worker many times, and Wait will always return the same thing. To stop a worker, we kill it and wait for it to quit:

func Stop(w Worker) error {
    w.Kill()
    return w.Wait()
}

An useful ingredient when implementing a Worker is Gustavo Niemeyer’s tomb package, which encapsulates some of the logic implemented in the Go pipelines blog post. In particular, it keeps track of the first error encountered and allows goroutines to monitor the current “liveness” status of a worker.

Here is a version of the pipelines blog post’s parallel.go code. In this example, the fileDigester type is our Worker. Just like the original code, we digest each file concurrently, sending the results on a channel. Unlike the original, though, the tomb takes on some of the burden of error propagation – when part of the worker dies unexpectedly, all the other workers will see that the tomb is dying and stop what they’re doing.

I have chosen to use Keith Rarick’s fs package instead of filepath.Walk because I think it makes the control flow more straightforward. The actual code to digest the file now gets its own function (filedigester.sumFile) and digests the file without reading the entire contents into memory. I have declared some methods on fileDigester as exported, even though fileDigester itself is not exported, to make it clear that they represent the “public” interface to the type.

Much of the teardown logic is encapsulated in the following code, which actually runs the main loop:

go func() {
    d.tomb.Kill(d.run(root))
    d.wg.Wait()
    close(d.out)
    d.tomb.Done()
}()

It runs the main loop and kills the tomb with the error that’s returned (this idiom means it’s easy to return in multiple places from the run method without needing to kill the tomb in each place). We then wait for all the outstanding digest goroutines to finish, close the results channel. The very last thing we do is to signal that the worker is done.

The bounded-concurrency version is here. Again, I have kept the basic approach as the original, with a pool of goroutines doing the digesting, reading paths from a channel. Even though we’re using a tomb, we’re still free to use other methods to shut down parts of the worker – in this case we close the paths channel to shut down the digester goroutines.

The code is somewhat longer, but I think that the structure is helpful and makes the code easier to reason about.

There is another advantage to making all workers implement a common interface. It enables us to write higher level functionality that manages workers. An example of this is worker.Runner, an API that makes it straightforward to manage a set of long-lived workers, automatically restarting them when they fail.