Monday 20 June 2011

Mind the gap

I can't say that I wasn't warned.

In an attempt to conserve the world's supply of three letter acronyms - and support people studying the Sun and those studying society - we are trying to deploy grid software called ARC on a compute service called ARC.

The Nordugrid Advanced Resource Connector software is the only component within release 1 of the European Middleware Initiative's big bundle of grid stuff that does what we need:
  • accepting work requests from a Workload Management Server.
  • passing them on to HPC systems which may be running SunOracle GridEngine, Torque/PBS or SLURM batch systems. (Leeds is using Grid Engine, but if we are successful, it could be rolled out to other institutions).
The EMI are in an uncomfortable position. Their job is to take pieces of software from different places - that is similar in intent and very different in design - and persuade them to work together. Sometimes, inevitably, things fall through the gaps.

One of these gaps is between the gLite's BDII information service and ARC's own information services.

This is how it is meant to work...
  1. Information is kept in a BDII-friendly database and made available to the world via the LDAP protocol through an OpenLDAP 'slapd' service.
  2. On any given system, this information is generated as a set of LDAP 'LDIF' format records by programs called providers and plugins.
  3. A program called bdii-update takes the locally generated LDIF, processes it and passes it on to slapd.
What was actually happening...
  1. The ARC information system was generating lots of LDIF.
  2. The bdii-update process was collating it and passing it onto slapd.
  3. slapd was refusing to accept it - complaining of an 'Object class violation'.
After digging into the inner workings of both the BDII and ARC, we've identified the cause. It is all down to a subtle difference between what Nordugrid expect and what gLite expect from their information services.

From this point on, this is going to be technical. Readers of a less geeky disposition can look away now, happy in the knowledge that we know what broke and how to fix it.

Geeks, grab yours Acronyms. Here we go...

Slapd relies on schema files to define what is acceptable: Nordugrid have their own Scandinavian-style nordugrid.schema; gLite use the GLUE schema, including one called Glue-MDS.

Glue-MDS and nordugrid.schema both define an objectClass called 'Mds'. Both agree that it represents a collection of information but in GLUE, an Mds is defined as a STRUCTURAL class whereas Nordugrid defines it as an ABSTRACT class.

So what... as anyone who managed to make it this far down the page might cry.

Well, in the LDAP-world, STRUCTURAL objects can exist whereas ABSTRACT classes can only be used as a basis upon which other objects can be defined. Its all very Object-Oriented-Programming.

ARC's information service generates 'MdsVo' objects, based on Mds objects, but properly STRUCTURAL. This is fine according to the nordugrid schema.

But bdii-update contained code that takes any object that is based on an Mds object and turns it into a plain, simple self-contained Mds object. This is closer to what GLUE expects.

Slapd gets very confused.

A bug report has been raised - and after a bit of bug ping pong between the BDII and ARC developers - it has been decided that bdii-update should, in future, leave Mds objects alone. For the moment, all that is needed is to remove the line in bdii-update that reads

  new_ldif = fix(new_dns, new_ldif)

No comments: