Monday 16 May 2011

Stuck in the middle

To a casual observer, they look very similar: groups of largely male, occasionally slightly dishevelled individuals who spend far too much time starting at screens and who communicate almost exclusively in acronyms.

But those in the know recognise that your High Performance Computing (HPC) geeks and your Grid Computing geeks are very different creatures.

Should you find yourself talking to one - having somehow managed to side-step the awkward initial `Who are you and how did you get into my server room?' conversation - it is very important to know which species it belongs to.

By far the easiest way to find out is to ask the question: how does someone run a program.

If you have found a Grid geek, the answer will feature web services, UIs, WMSs, CEs of various kinds, certificates and, in extreme cases, XML.

The HPC geek will answer Ssh, Ssh, Ssh. Again and again and again.

That is SSH as in Secure Shell. If the person is saying Shh!, you've wandered into the library by mistake.

HPC is all about building the biggest, fastest computer that can fit in the room. HPC systems are designed to be self contained, with fast disks and fast CPUs linked using fast networks. Users may connect to an HPC service from outside - via SSH - but everything they do from that point on stays within an HPC bubble.

Grid is about connecting a disparate set of resources, spread far and wide, so they can do something useful together. There is very definitely more than one way to do it.

At Leeds, we are in the interesting position of trying to connect our HPC service to the grid.

Rather than trying to graft the full Grid software stack onto a very specialised, and customised HPC environment - we are using a separate (virtual) machine to act as a relay, or maybe a translator, between the two worlds. The HPC service is called ARC1. It is only right that the relay will be called NGS.ARC1.

It will talk Grid to the world, but its only channels of communication to the HPC service will be the batch queuing system - SGE - and good old SSH. There is no shared disk space of any kind.

We are now able to submit jobs from the NGS.ARC1 to the HPC service and monitor their progress.

We have the ability to create separate SSH keys for every grid user. Our next step is to configure the HPC service to use these keys in 'scp' commands within 'prolog' and 'epilog' scripts. Data will be pulled onto ARC1 from NGS.ARC1 when a job starts and pushed it back when the job is done.

The latest set of documentation for the CREAM-CE - our choice for the grid side - says that you can set:

  SANDBOX_TRANSFER_METHOD_BETWEEN_CE_WN=LRMS
and let the Local Resource Management Service, the grids general term for batch services like SGE, do the donkey work.

At the moment, we have no idea if weakly-linking the Grid and HPC in this way will work.
Well update you in a week-or-so's time.

And if you are passing through Leeds and want to ask questions, I will - of course - answer `Who are you and what are you doing in my server room?'

2 comments:

Lee said...

I think this distinction will disappear one day. I certainly find myself working with all types of infrastructure nowadays: HPC, Grid, Cloud, GPU. Those who know me would say my background is definitely HPC.

Hiding it all behind a portal sounds like the way to go. I'm sure it will be successful. It may take more than a week to find out.

Ewan said...

This isn't an odd configuration at all; the common way to set CREAM up even on fully dedicated PP grid clusters is with no shared filesystems, and just staging things in and out using SSH controlled by the batch system, just like this.

Also, on the naming front; one of the alternative CE implementations is called ARC - if I were you I'd call this something else or you're going to confuse the hell out of anyone that doesn't already know the cluster.