Friday 25 June 2010

HARC back

It is tempting, when writing a 'Research and Development' blog posting, to focus on tasks that have been completed. It is even more tempting to focus on those that have been completed successfully.

But, it is often said that...
If we knew what it was we were doing, it wouldn't be called research, would it?
(Pages scattered across The Internet attribute this quote to Einstein but are somewhat vague about when he said it and to whom.)

So this week's posting will focus on a service that we are trying to get working. Or to be more accurate, trying to get working again.

The service is HARC - the Highly Available Resource Co-allocator and it was first deployed more than three years ago.

HARC provides a way to co-ordinate reservations of time on many separate computers and networks so they can all be used together. The technology was used by the Grid Enabled Neurosurgical Imaging Using Simulation project (aka GENIUS). You can read more about their work in the Real-time visualisation of blood flow through the brain case study on the NGS web site.

HARC uses the idea of 'Paxos consensus', which I am not going to embarrass myself by trying to explain. The distinguished computer scientist Leslie Lamport who created it, first tried to explain it by analogy to a part time parliament on an ancient Greek island. Apparently this confused more people than it helped so his second attempt was called Paxos made Simple.

As far as HARC is concerned, the important feature is that you pass on your request for time to a set of acceptors and these communicate with the resources on your behalf to make a reservation.

Through the power of Paxos, the acceptors can reliably respond to any request even if some of them disappear off the network while the request is being processed.

When the NGS first deployed HARC, we used acceptors maintained and run by Lousiana State University's Centre for Computation and Technology

These acceptors served us very well for many years but - as LSU staff moved on and the service was less heavily used - they slowly disappeared. Paxos allowed us to survive until the last acceptors started to fail. At this point, we went from a Highly Available Resource Co-allocator to a Hardly Available Resource Co-allocator.

We have to thank all those at LSU who provided the service and kept it going as long as it did but it was clear that if the NGS wanted to provide a HARC service, we needed to have our own network of acceptors.

We were fortunate that one of the NGS staff at Manchester was heavily involved in the initial HARC development. He was able to deploy a single NGS acceptor - enough to keep a service ticking over but not enough to provide all the Paxos goodness.

Oxford eResearch centre stepped in and offered to host a second acceptor. With the help of Manchester, the software was deployed to Oxford.

Both acceptors worked in isolation but we tried to get them to cooperate, both disappeared off the network. That does not really fit the definition of 'highly available'.

And this is why we are researching and developing.

As a first step, Manchester are deploying a pair of acceptors locally to investigate whatever weird combination of factors made it all go so horribly wrong.

While they do, Oxford are hosting a standalone acceptor - available for anyone who needs HARC. Anyone with a copy of the HARC client software can point it at the NGS acceptor by putting the following in the harc.properties file:

# The global V2.0 acceptor set, good for co-allocating UK NGS,
harc.client.acceptorset=global2
### harc.client.acceptor.global2.vidar=https://harc.vidar.ngs.manchester.ac.uk:9877/harc-acceptor
harc.client.acceptor.global2.oerc=https://harc.oerc.ox.ac.uk:9877/harc-acceptor
(Note that Manchester's acceptor is commented out in that fragment.)

When - and if - we solve the problem, we'll let you know.

No comments: