Sunday 21 November 2010

Failing more succesfully - getting past Maradona and Condor

It has been nearly a month since the last progress report on Nagios. Which is a shame, because in that time we have made something that looks rather like progress.

The NGS's development Nagios server was at the point where it was throwing tests at NGS partner sites.

The simpler tests - for things such as service certificates reaching their expiry date - are working.

We have have less success with the more sophisticated tests - like those that poke every nook and cranny of a Compute Element.

A few sites - notably those in Scotgrid - are accepting the tests and running them to completion but we only see part of the results. For others sites we get the infamous
Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona.
error message.

In both cases, the same test - the CE-probe - is involved. This is thrown at all sites that advertise Compute Elements in the GOCDB database of all things griddy.

This test makes use of the Nagios concepts of active and passive tests. In an active test, the Nagios service runs some bit of code and expects that bit of code to provide a result. In a passive test, there is no explicit test code and results are fed in by whatever means necessary.

The CE-probe appears within Nagios as one active test and a whole raft of passive ones. The active test delivers a bundle of tests to the site - via a Workload Management service (WMS) - and checks on its progress. At various stages in the life of the bundle, the passive tests results are updated.

Some passive tests results are generated from the Nagios server itself; others are sent directly from the system under test via the next available Message Bus.

When the bundle of tests runs successfully, we see the results generated from within the nagios server but not those coming from the message bus. This is because the development service uses a message broker that sits outside the core set of brokers used by WLCG. A workaround for this is coming any day now.

The Maradona message appears when the bundle of tests doesn't run at all.

It is a by-product of the script generated within the WMS and sent on to the site and, in particular, how this script handles 'Shallow' resubmission.

A shallow failure is one where the job is rejected and can be tried elsewhere. The WMS touts the job around the grid until it finds a system prepared to accept it. Acceptance is signified by the deletion of a marker file using GridFTP.

Which is all very well, as long as the machine on which the script is running has software that is able to delete a file using GridFTP.

gLite-based systems usually have something , those using the NGS VDT based installer do not. If this step fails, the script gives up early and prints the Maradona message.

A VDT based system can be persuaded to run the WMS-generated script by installing the UberFTP tool using
  pacman -get http://vdt.cs.wisc.edu/vdt_181_cache:UberFTP

(Pick a different cache if you are using something other than the elderly version 1.8.1 of VDT.)

UberFTP provides enough GridFTP support to allow the bundle of tests to run - though we have yet to persuade them to run to completion. I would call that a more successful failure.

Anyone attending the HEPSYSMAN meeting in Birmingham on 22 November will have the opportunity to hear, and ask questions, about what we needed to do to persuade WLCG nagios to work on the weirder bits of the NGS.

[Edit 2010-11-24 fixing typos]

No comments: