Thoughts on Systems

Emil Sit

First Steps With the SunGrid

The SunGrid is an on-demand grid computing infrastructure: you pay per CPU-hour as you need it, Sun provides the hardware. I recently got access to the SunGrid as part of a generous grant of CPU hours by Sun to my research lab, CSAIL, and I’m mostly quite pleased with it.

John Powers rightly notes that it is not trivial to adapt most applications to run on the SunGrid:

It appears to me that there’s a long hard hill to climb to get applications onto SunGrid, and until that problem is fixed, few will care if the price is a buck or a penny per CPU-hour, even if the racks are full of nice hardware.

However, it is pretty easy to get started with embarrassingly parallelizable problems (like parameter space exploration) that run the same basic code with different inputs. The machines are better equipped and faster than the ones I normally have access to. I’ve used several hundred CPU hours so far running simulations to explore some new research ideas and the degree of parallelism available is quite gratifying.

Access to the SunGrid is via a web interface. You package up your application and its data files in a set of compressed ZIP files (called “resources”), and upload them to the Grid. You create a job by selecting which of resources to unpack and tell the Grid which executable/script to run. All of the nodes you use share a network file system which holds the freshly unpacked contents of the resources you picked for the job. After the job completes, the Grid handles collecting any new files created, packages them into a new ZIP file for you to download. This sounds simple and you can see it in action in Jon Udell’s SunGrid screencast.

My application is in written in C++ with no external library dependencies (e.g., no Boost or libasync or even the STL). This means it is easy to compile with g++ and ship over to the Grid. The trick of course is that I needed an x64 machine to build on. Fortunately, Sun will cleverly give you one almost for free; in my case, one was provided for me by the Infrastructure Group.

I have two gripes about SunGrid right now. First, there is no way to obtain job-specific status beyond number of CPU hours consumed. If you have an infinite loop sitting in some seldom exercised code path, you might not notice it until you’ve consumed quite a few hours. If you suspect a job has gone rogue, there is no way to inspect its state by logging in to a machine somewhere: you have to cancel the job and download the output. It would be much better if you could specify some sort of status to be displayed in the UI, much like the existing running CPU hour usage. Even a single integer could be useful (e.g., number of sub-jobs remaining) though obviously a short text string would be more flexible.

Second, you must interact with the SunGrid via a JavaScript-heavy web interface. This is not always convenient: for example, you may generate large input files on a well-connected server while working remotely. In order to load this resource, you are forced to transfer the inputs to your local machine and then upload them to the Grid. I would much prefer some sort of API (e.g., XML-RPC over HTTPS) that would allow me to submit resources, define jobs, and manage runs. For larger corporations, it would take humans out of the loop for any periodic tasks.

That aside, my SunGrid experience has been rather enjoyable. If you are a grad student and find yourself needing CPU power for simulations, buying an x64 box from Sun and getting a SunGrid account for the heavy lifting is probably way cheaper than buying a cluster and powering it, not to mention having to maintain all those machines. It’s also probably less time consuming than writing your own tools to cannibalize spare cycles on the workstations of your fellow students. Give it a try!