Friday 28 January 2011

Adding NGS tests to WLCG Nagios

When we last mentioned the NGS project to deploy WLCG Nagios, we had most of the basic WLCG tests running against 'classic' and 'CREAM' compute elements.

We are now extending the WLCG code with some NGS-specific tests.

In particular, we are adding to the set of tests that are run on individual worker nodes as part of the 'CE', and eventually the 'CREAM-CE' tests.

This is not exactly a common requirement, so documentation is understandably sparse. The best place to start seems to be https://twiki.cern.ch/twiki/bin/view/LCG/PracticalHintsForMigrating2Nagios

The test we will use to test the testing service is deliberately simple. It is a Nagios-style plugin that checks if a site supports the 'Uniform Execution Environment' conventions. It looks for a /usr/ngs directory. If it is missing, this is an error, if it is empty, this warrants a warning, otherwise everything is OK.

We know that WLCG-Nagios uses a mixture of active and passive tests. Active tests deliver results immediately while the results of passive tests filter in slowly via the message broker.

Our initial plan was to extend the CE-probe tests. The CE-probe works by...
  • building a compressed tar file containing some nagios tests, a copy of nagios to run them, and bits of python to deliver the results to the message broker.
  • generating a JDL that describes how to fetch the tar file and run the tests within it.
The key is a script called nagrun.sh which runs nagios on a remote machine, collects the test results and throws them at a message broker. The broker should deliver them to main nagios server where they reappear as passive test results.

The CE-probe allows additional directory trees to be added to the tar file, as long as they look rather like...

/usr/libexec/grid-monitoring/probes
|
`-- uk.ac.ngs
`-- wnjob
|-- uk.ac.ngs
| |-- etc
| | `-- wn.d
| | `-- uk.ac.ngs
| | |-- commands.cfg
| | `-- services.cfg
| `-- probes
| `-- uk.ac.ngs
| `-- WN-uee
`-- uk.ac.ngs.gridJob.jdl.template

This is mostly directories and subdirectories. Real files are marked in bold: WN-uee is the test script, the *.cfg files are nagios configuration files describing how to run it; the *.jdl.template file is used when writing the JDL.

Eagle-eyed readers may have noticed lots of uk.ac.ngs's scattered around.

This serves as a convenient namespace - it exists to stop files in this directory tree inadvertently overwriting those from another tree when the tar file is being created.
The convention used in WLCG Nagios is that the namespace should be your organisation written backwards. Argue not will I.

Incorporating the new directories involves adding extra arguments to the CE-probe

--add-wntar-nag-nosamcfg
--add-wntar-nag /usr/libexec/grid-monitoring/probes/uk.ac.ngs/wnjob/uk.ac.ngs
--jdl-templ /usr/libexec/grid-monitoring/probes/uk.ac.ngs/wnjob/uk.ac.ngs.gridJob.jdl.template

The first of these turns off the standard WLCG 'SAM' tests. GridPP Linknagios service is already checking those.

At the time this blog post was being written, a grand total of one site has passed - congratulations Glasgow Scotgrid - and it flagged up a few sites that do not provide UEE application.

While we can claim one successful success and several successful failures, there are a lot of sites where the results have yet to arrive.

These laggards include all the old core NGS sites - all of which support UEE, but use Virtual Data Toolkit rather than gLite for grid software. We have tested the test-test on one of these sites and know it works. The next step to to find out why the results are getting lost on the way home.

No comments: