Friday 28 January 2011

Adding NGS tests to WLCG Nagios

When we last mentioned the NGS project to deploy WLCG Nagios, we had most of the basic WLCG tests running against 'classic' and 'CREAM' compute elements.

We are now extending the WLCG code with some NGS-specific tests.

In particular, we are adding to the set of tests that are run on individual worker nodes as part of the 'CE', and eventually the 'CREAM-CE' tests.

This is not exactly a common requirement, so documentation is understandably sparse. The best place to start seems to be https://twiki.cern.ch/twiki/bin/view/LCG/PracticalHintsForMigrating2Nagios

The test we will use to test the testing service is deliberately simple. It is a Nagios-style plugin that checks if a site supports the 'Uniform Execution Environment' conventions. It looks for a /usr/ngs directory. If it is missing, this is an error, if it is empty, this warrants a warning, otherwise everything is OK.

We know that WLCG-Nagios uses a mixture of active and passive tests. Active tests deliver results immediately while the results of passive tests filter in slowly via the message broker.

Our initial plan was to extend the CE-probe tests. The CE-probe works by...
  • building a compressed tar file containing some nagios tests, a copy of nagios to run them, and bits of python to deliver the results to the message broker.
  • generating a JDL that describes how to fetch the tar file and run the tests within it.
The key is a script called nagrun.sh which runs nagios on a remote machine, collects the test results and throws them at a message broker. The broker should deliver them to main nagios server where they reappear as passive test results.

The CE-probe allows additional directory trees to be added to the tar file, as long as they look rather like...

/usr/libexec/grid-monitoring/probes
|
`-- uk.ac.ngs
`-- wnjob
|-- uk.ac.ngs
| |-- etc
| | `-- wn.d
| | `-- uk.ac.ngs
| | |-- commands.cfg
| | `-- services.cfg
| `-- probes
| `-- uk.ac.ngs
| `-- WN-uee
`-- uk.ac.ngs.gridJob.jdl.template

This is mostly directories and subdirectories. Real files are marked in bold: WN-uee is the test script, the *.cfg files are nagios configuration files describing how to run it; the *.jdl.template file is used when writing the JDL.

Eagle-eyed readers may have noticed lots of uk.ac.ngs's scattered around.

This serves as a convenient namespace - it exists to stop files in this directory tree inadvertently overwriting those from another tree when the tar file is being created.
The convention used in WLCG Nagios is that the namespace should be your organisation written backwards. Argue not will I.

Incorporating the new directories involves adding extra arguments to the CE-probe

--add-wntar-nag-nosamcfg
--add-wntar-nag /usr/libexec/grid-monitoring/probes/uk.ac.ngs/wnjob/uk.ac.ngs
--jdl-templ /usr/libexec/grid-monitoring/probes/uk.ac.ngs/wnjob/uk.ac.ngs.gridJob.jdl.template

The first of these turns off the standard WLCG 'SAM' tests. GridPP Linknagios service is already checking those.

At the time this blog post was being written, a grand total of one site has passed - congratulations Glasgow Scotgrid - and it flagged up a few sites that do not provide UEE application.

While we can claim one successful success and several successful failures, there are a lot of sites where the results have yet to arrive.

These laggards include all the old core NGS sites - all of which support UEE, but use Virtual Data Toolkit rather than gLite for grid software. We have tested the test-test on one of these sites and know it works. The next step to to find out why the results are getting lost on the way home.

Thursday 27 January 2011

Fixing the mangled XML in the SARoNGS service

A little technical note following from last weeks posting on SARoNGS.

If you are averse to perl, XML and regular expressions, look away now. There will be a proper R+D blog post along shortly.

Someone at a recent NGS Surgery asked for the gory technical details of how we turned the corrupted XML that was breaking the SARoNGS service into something that once-again matched its cryptographic signature.

The SARoNGS web front-end at https://cts.ngs.ac.uk is written in Perl. It relies on Shibboleth to obtain user attributes from identity providers, encode them in Base64 and deliver them via a custom http header called 'Shib-Attributes'.

The Apache web server will eventually present this to our perl code in an environment variable called HTTP_SHIB_ATTRIBUTES.

We realised that, under some circumstances, an additional set of xmlns:xs and xmlns:xsi namespace declarations were being added to the <samlp:response .. > XML tag generated by newer versions of the Shibboleth idP.

These were always inserted at the end of the tag, before the final '%gt;' and just after the responseid attribute. Removing them meant...
  • turning the base64 encoded data back into XML,
  • using a perl regular expression to remove the cruft and restore the XML to canonical form
  • turning the correctly canonicalised XML back into Base64.
or in perl...
use MIME::Base64;
my $encodedData = $ENV{HTTP_SHIB_ATTRIBUTES};

...

my $shibAttrXML=MIME::Base64::decode_base64($encodedData);
for ($shibAttrXML) {
s{(<saml1p:response.*?responseid="_[0-9a-f]+")(.*?)(>)}{$1$3}m;
};
my $encodedDataCanonical=MIME::Base64::encode_base64($shibAttrXML,'');


It's a workaround, not a fix, but it is a workaround that works.

Tuesday 25 January 2011

Taverna and Cloud on the NGS website

Incase you missed it (and we wouldn’t want that now!), a quick heads up of some new articles on the NGS website.

First up is an article on running Taverna workflows on the NGS. Mike Jones (NGS) and Donal Fellow (RCS) from the University of Manchester have been tackling this issue and are making rapid progress. The success of this project could result in an increase in NGS usage by the large Taverna user community.

The second article is an update on the NGS cloud prototype which has been running since last summer when we asked for users willing to try this new service. We were inundated with volunteers and this article by David Fergusson (NGS, NeSC) looks at the user communities taking advantage of this prototype service.

Thursday 20 January 2011

Single Sign On - The Movie

Ladies and Gentlemen, take a seat, grab some popcorn and don't forget to turn off your mobile phone - because the NGS is going to the movies....

In our feature presentation, we join an intrepid explorer as he connects to the Grid using only his institutional credentials and some slightly-annoying background music.



At 3 minutes and 44 seconds, it is considerably shorter than Avatar, but if you can't wait (spoiler alert!), the plot is:

  • Our hero visits the NGS SARoNGS service.
  • He authenticates himself using Shibboleth and his institutional username and password.
  • He clicks a button or two and is rewarded with credentials that allow him ssh command line access to a grid enabled machine.
  • And they all live happily every after.
In the future as seen by Project Moonshot, we will be able to use institutional credentials anywhere. We can already make it most of the way using existing technology - a sort of Project Apollo 13?

The Making Of...

No modern movie is complete without a 'The Making Of...' documentary to fill those extra bytes at the end of the DVD. So we will also let you see behind the movie magic...

When you click on 'Login' on the 'SARoNGS' service provider - https://cts.ngs.ac.uk - your web browser does the Shibboleth Shuffle: passing you via the 'Where Are Your From' (WAYF) service to your home institutions 'Identity Provider' (idP) and then back to http://cts.ngs.ac.uk.

In the last step of the shuffle, a blob of XML is delivered that means `we at the University of Nether Wallop do solemnly swear that this is one of our users'.

Now that it knows that you are a reasonable member of society, the SARoNGS service and your local Identify Provider immediately start talking about you behind your back. In the chatter, your Identity Provider passes on one or more Shibboleth Attributes that describe who you are and what you do.

Shibboleth Attributes can be nearly-anonymous or as personal as names, email addresses or even photos so the UK Access Management Federation has strong recommendations for what can be revealed. Unless legal agreements are in place, an idP only need reveal your unique pseudonymous identifier and your role.

The Shibboleth Assertions are passed from the cts.ngs.ac.uk to a separate authentication service based around a modified MyProxy server.

The authentication service only cares about the unique pseudonymous identifier - or eduPersonTargetedId - and creates and manages short-lived certificates on its behalf. These certificates have a distinguished name that looks like
/DC=uk/DC=ac/DC=ngs/DC=sarongs/CN=(a very long string of hexadecimal digits)
The very long string of hexadecimal numbers is a cryptographic hash of the eduPersonTargetedId.

The authorisation service sends the certificate back to cts.ngs.ac.uk where it is associated with one or more Virtual Organisations (VO).

The default VO, 'ukfederation.org.uk', represents anyone from an institution within the UK Access Management Federation. You can also sign up for an NGS account with a SARoNGS credential, at which point you will be eligible for membership of the 'ngs.ac.uk' VO.

The certificate and VO information is stored on the NGS's official MyProxy server myproxy.ngs.ac.uk under a unique username and a random password.

The SARoNGS service has done its duty. Now the MyProxy enabled Gsisshd (MEG) takes over.

MEG allows an ordinary ssh client to be used to access a grid-enabled service. It accepts a username, a myproxy server and a password - uses these to download a (proxy) certificate and uses that certificate to authenticate you.

ngs.leeds.ac.uk has a version of MEG running on port 2223. We have made some changes - described in technical detail in the bonfire-night R+D posting - to allow certificates with only ukfederation.org.uk membership to log on without being given full command line acccess.

The Out-takes...

The MEG service at Leeds has been running, and accepting SARoNGS and ukfederation.org.uk certificates, since early December 2010.

We have kept quiet about it not because we are naturally modest and unassuming, but because we would have looked like a bunch of bumbling idiots.

There were some places where the SARoNGS service resolutely refused to work. If you were based at one of the unfortunate institutions and tried to reproduce what you saw in the movie, you would have got to the end of the Shibboleth Shuffle and been rudely informed that:
MyProxy didn't like me
We have known why MyProxy is being so unfriendly since November. The XML representing the Shibboleth Attributes is digitally signed and, at some point on its journey, it is corrupted so the signature is no longer invalid.

The fault seemed independent on the version of the idP software deployed but did depend on which attributes were released.

Earlier this week, we worked out why.

It is very subtle, very Shibboleth and another magnificent example of XML biting back.

Before it is signed at the idP the XML the Shibboleth Assertions is first converted to a canonical form, a process that needs to take XML namespaces into account

When the attributes were reconstituted on cts.ngs.ac.uk ready to be passed to the authorization service, additional namespace declarations were inserted, scrambling the signature.

We are still not clear where or why this happens. It might be related to typos in the Shibboleth configuration which left certain Attributes missing a default XML namespace.

The typos are fixed in version 2.2.1 of the idP and, thanks to NeSC Glasgow, we can confirm that this version can send all the attributes it wants with no repercussions.

Working around the problem was trivial. The additional declarations always appeared in the same place - at the very end of a saml1p:Response tag - so we simply removed them again.

The Embarrassing thank-you speech...

Like an Oscar winner, we have a large number of people to thank for their contributions.

These include the people at the NGS partner sites at RAL and Manchester and those people at Glasgow, UCL and Sussex that helped identify and debug the SARoNGS problems.

We would particularly like to thank John Watt from what used to be NeSC Glasgow for taking the time at last weeks NeISS meeting to help generate test cases.

The inevitable sequel...?

SARoNGS is built around an elderly and currently unsupported versions of Shibboleth and Myproxy.

The web user interface is seen as confusing by less experienced users.

If it is to continue running, it will need further development.

SARoNGS is unique in that it make the Grid available to people who cannot or will not use browser-based certificates - and that makes it the real star of the movie.

Tuesday 18 January 2011

Last chance for the NGS user survey

This is it. The closing date for the NGS user survey is nigh - well this Monday anyway!

We've had a great response to our survey this year and I would like to really thank everyone for completing the form and giving us all your feedback. However I would like a couple more users to complete the survey just to make this one our best ever.

Once the survey closes on Monday I then have the enviable task of taking all your feedback and putting together a report which will be made available on the NGS website summarising all your responses and feedback. The report will also be used to feed into our bid for the continuation of the NGS which we will soon be formulating.

So if you would like your views and opinions to be incorporated into the user survey report and the next phase of the NGS, be sure to complete the user survey before the end of Monday 24th January!

PS and don't forget every completed survey which has an email address included is entered into a prize draw for one of THREE Amazon vouchers!

Friday 14 January 2011

A different kind of social network

Those of us involved in NGS support regularly email our users, sometimes phone them and occasionally see them pixelated in a window of an AccessGrid session - but we seldom get a chance to see what they do for a living.

When we do, it is a welcome reminder that we do the dull-but-useful stuff so others can do new and interesting research.

The latest welcome reminder came courtesy of the National e-Infrastructure for Social Simulation (NeISS) who met in Leeds earlier this week. The local NGS support took the opportunity to come along and listen.

NeISS has been described by as an attempt to build a real life version of`SimCity - with an emphasis changed making life better for real communities rather than tidying-up Tokyo after a visit from Godzilla.

NeISS researchers study the behaviour of people. Their approaches range from statistical analysis of census data to the use of agent-based modelling: following virtual people as they go about their virtual lives, travel to virtual work, have virtual children and eventually virtually die.

The researchers involved have been long-time users of the NGS. Visit the case-studies section of the website and you will see find how NGS resources were used in social simulations to estimate how the population of an area changes over time and study patterns of criminal behaviour

This is proper e-Research - an interdisciplinary collaboration between 8 UK institutions made possible by modern technology and fast computer networks.

They have a website and a software repository; they use applications like Taverna to automate data processing and publish the workflows they create to MyExperiment.

And they want to make their work available to those who plan and run our communities - or simply live in them.

They are developing web-based portals from which simulations can be launched. The idea is that a city planner would be able to log on and see the consequences of, say, building new houses on local traffic and schools.

Simulations are computationally expensive. The computer hardware that hosts the portals lacks the computational oomph to run the simulations so this work need to be offloaded to more powerful computers elsewhere and the data passed around.

It all very grid... or very cloud... or very something-as-a-service.

That is what we are here for and we are doing what we can to help.

NeISS are using resources across the 8 institutions and using or evaluating NGS services including the workload management service to distribute jobs; the SARoNGS service as a means of authentication, and the cloud prototypes for development.

We are working with them to understand how we can improve our services, so that they can use them and they can get on with that whole `making the world a better place' business.

Tuesday 11 January 2011

Got a problem with your software?

If so then we know who can help! Todays blog post is courtesy of Simon Hettrick from the Software Sustainability Institute (SSI).

In 2010, a crack developer was asked to join the SSI. This man promptly set up as the Institute’s software architect. Today, still wanted by a number of projects, Steve survives as a developer of fortune. If you have a problem, if no one else can help, and if you can email him, maybe you can Ask Steve!

At the SSI, whenever we have a software problem we simply ask Steve. He’s our in-house software architect and all-round guru of code. Then we got to thinking: it’s selfish to keep such a valuable resource to ourselves, we should make Steve’s knowledge available to everyone. And that’s when the idea for the Ask Steve! blog was born.

So what is Ask Steve? The idea is that anyone can email Ask Steve! with a software trouble and query. Each week or so, Steve will work on a problem and post his answer to the blog. People can comment, try out the solution or simply get back to Steve with another question. Steve will sort through the questions he is posed and answer the ones that trouble the most people.
Next time you have a software problem, visit the Ask Steve! blog.

Friday 7 January 2011

Bitrot

One of my colleagues always signs his emails with the aphorism:

I'm not against progress, it's the change I do not like.
The origin of the quote is obscure. My colleague originally saw it attributed to Mark Twain but neither Google or a rather battered copy of the Oxford Dictionary of Quotations are in a position to confirm or deny this.

Whoever said it, it summarises the life of someone in research IT support rather well.

As technology advances - we want the things that are broken to be fixed and we want the things that are trundling along happily, and which the users depend on, to not break.

Of course, just because you want it doesn't mean you get it... any piece of computer code will have a finite life before the inevitable onset of bitrot.

Infuriatingly, the more sensible the developer, the earlier the bitrot sets in. When someone has resisted the urge to reinvent the wheel - and used libraries and packages from elsewhere - any major, incompatible change to one of dependencies will bring the whole thing crashing down.

Grid software - and academic research software in general - is funded from grants or written to fix a specific problem at a specific time. Development stops when the money does or when the people are needed elsewhere. Those who support the software have become very good at keeping it alive.

Take the NGS's own accounting client software - described in a blog post from last May and now available for download from its new home on Sourceforge. This consists of a set of Perl modules that crunch the accounting logs of local batch systems and deliver it safely to our accounting database.

The process involves lots of boiler-plate XML thrown around using SOAP over SSL secured HTTP connections. The developer, very sensibly, decided to leave most of the nasty stuff to other perl modules - including
  • SOAP::Lite
  • Net::SSLeay
  • IO::Socket::SSL 
  • AppConfig
  • Template-Toolkit

All of these are available from the Comprehensive Perl Archive Network The original documentation advised downloading the modules using the Perl cpan tool - which downloads, builds, installs and tests modules and their dependencies automatically.

The cpan tool is very powerful and very useful., I would advise avoiding it like the plague.

Why? Cpan can only cope with one version of a module within a particular version of Perl. Many of the modules have moved on since the client code was written and the versions that cpan would install are incompatible with our account client.

The older versions are available from the www.cpan.org archive. They should be downloaded and installed in a directory outside the local perl install. You can use the PERL5LIB environment variable to add your local installation to the front of the list of places Perl searches for modules.

The same approach works with Python and compiled libraries with judicious use of the PYTHONPATH and LD_LIBRARY_PATH or LD_RUN_PATH environment variables.

It is not a cure for bitrot, but you can delay the inevitable by many years. You can get the progress without the change.

Tuesday 4 January 2011

Happy new year from the NGS

Welcome back after the Christmas holidays and I hope you had a good break. NGS staff are starting back today so there will be some catching up on user helpdesk queries and the usual getting back up to speed.

A couple of quick reminders of NGS news -
  • the presentations from the NGS Innovation Forum are now available on the NGS website along with the posters from the event.
  • there is still a chance to win Amazon vouchers in the NGS user survey so if you would like some last minute sale shopping on us, make sure you fill in the user survey by the end of the month!
Happy new year!