Friday 15 October 2010

It's alive

If you work at one of those institutions that let users choose names for their own computers: it is inevitable that, sooner or later, someone will claim the name 'Elvis'.

Occasionally, this will be because they like his music. In most cases, it is because they want to be able to run 'ping elvis' and be told that, despite the events of August 16 1977:
 elvis is alive [*]
'Ping', the friendly name of an ICMP echo request packet, was invented as a way of testing network connectivity. The original idea was that if a machine on the Internet was working and it was pinged, it should send back the contents of the 'ping' to the sender as an ICMP echo reply.

Ping dates from when the Internet was a smaller and nicer place. These days, pings are seen as a security issue and are frequently blocked at campus and departmental firewalls.

So, once Elvis has left the building, you may never know if he is still alive.

This is a real issue for the current WLCG Nagios deployment. The Nagios service is hosted at the Rutherford Appleton Laboratory but the sites that make up the NGS and GridPP are spread around the country behind many different firewalls. Some hosts can be pinged from off site, some can't.

Nagios has the concept of hosts and services provided by those hosts. It will only check services if the associated host is working. The usual way of testing a host is by sending a ping. If pings are not permitted, no service on that host is tested.

The NCG utility that generates Nagios configurations can use a dummy test in place of a ping test for all hosts. To enable this, the /etc/ncg/ncg.conf configuration file needs to be changed to include a line setting CHECK_HOSTS to zero:

<NCG::ConfigGen>

<Nagios>
...
# Disable 'ping' checks of hosts
CHECK_HOSTS=0

</Nagios>
</NCG::ConfigGen>

When ncg is run and a new Nagios configuration built, all services on all hosts are tested. On the down side, If a host really has dropped off the network, Nagios will continue to test the services and generate alerts.

Now, if you will excuse me, I better stop writing about Nagios and go back to configuring it. As someone once said: a little less conversation, a little more action please.

[*] For the pedants.. the ping command on Linux run until you stop it and print statistics. You will need to find a Solaris system or a Cisco switch to actually see this message.

No comments: