Friday 25 February 2011

See SPOT run

Just when you thought it was safe to go back to the Internet, up pops another Acronym.

Meet SPOT.

You won't - yet - find SPOT in the Grid Acronym Soup because it is not one of ours. It is an escapee from another of the great sources of Acronyms - the world of IT Business Systems.

SPOT stands for Single Point Of Truth.

A SPOT is not a single genuine fact that slipped into a Business Systems sales pitch. Nor does the name imply that Business Systems remind those involved of an inflamed, infected ball of pus. SPOTs are, in general, good.

If you have a SPOT, then you know there is one-and-only-one definitive source for any piece of information - whether it is a price, a name, a salary or an office number.

As anyone who has dealt with a large organisation will appreciate, we do not have as many SPOTs as we should.

On the Grid - which is large, diverse and dispersed by its nature - single points of truth are very hard to find.

Which makes deciding what we should test with Nagios... interesting.

The NCG configuration generator which writes the Nagios configuration needs to know:
  • What sites to test.
  • What services to test at those sites.
  • What tests to run for each service.
Perhaps the closest we have to a SPOT is the Grid Operations Centre Database or GOCDB - which lists every site on the European Grids, their official downtimes and some of the services they provide.

The `some of' is there because the GOCDB defines services in terms of service endpoints - which represent a host within a site acting as, say, a Compute Element or a GSISSH server or a GridFTP server.

There are a comparatively small number of predefined endpoints and these will never cover everything a site can offer - you cannot, for example, advertise an iRODS service.

The GOCDB does not directly provide information about the Virtual Organisations that a service is prepared to support but it should point anyone wanting this information at a site information service willing and able to provide it.

For our first attempt at a WLCG-like Nagios service...
  • We collect a list of sites come from the GOCDB - we take any site flagged as belonging to NorthGrid, SouthGrid, Scotgrid or the London Tier 2 subgrids within the UK and Ireland Region.
  • We only test services for which GOCDB service endpoints are defined.
  • We define the tests for each endpoint within the Perl code of NCG. There is a 'standard' set of tests defined within a perl module called NCG::LocalMetrics::Hash which forms part of the NCG package.
    We modified the module to include local changes from a NCG::LocalMetrics::Hash_local module - a change that has been adopted by the NCG maintainers.
As an approach, it works well enough for Nagios tests but not for the friendly-front-end MyEGI.

MyEGI gets its truth from elsewhere: from the the Aggregated Topology Provider (ATP). The ATP is a sort of single point of single points of truth. It swallows data from the Metric Description Database (MDDB) and from Virtual Organisation feeds. It is scarily complicated in places - as you might be able to gather by looking at the MDDB and ATP database schema

The European Grid Initiative exists to make the grid work better, in part by giving us nicer SPOTs and are encouraging development of the ATP and friends. The curious can find out more on the SAM and Nagios wiki pages at CERN.

The Single Point of Truth is Out There....

No comments: