Thoughts on Systems

Emil Sit

Oct 25, 2006 - 3 minute read - Research grid-computing

Observations on SunGrid Customer Care

I haven’t used the SunGrid this week. In fact, no one has: there was a four day outage from last Saturday morning through this morning. I received a notification about this last Wednesday evening. In compensation, Sun has credited me (and presumably everyone) with 100 additional free CPU hours, which was thoughtful. However, if the IPTPS submission deadline was this week as opposed to next, the 100 free hours probably wouldn’t do me much good.

What company gives 3 days of notice for a complete outage of a utility for 4 days? NStar, my electricity company, occasionally (though rarely) needs to shut down power to do infrastructure work so they send out a notice at least two weeks in advance by mail, call and leave a message on my phone, and schedule it for a three or four hour window in the middle of the night. And people in the next town (or block) still have power. See the difference? If Sun wants the Grid to be used as a utility and seen as truly dependable, they should act more like one; the fact that they can afford a four day outage plus $100 per user suggests they are not yet ready to be a utility.

It turns out that Sun was consolidating some data centers, which I deduced from some SSL error messages delivered by their login front-end over the week-end. This is not something you decide to do three days before the event. (Is it?) Send out a notice in August! And deploy one of those fancy Blackbox machines somewhere and run with reduced capacity for a few days. A more “Web 2.0” company would even tell you what they’re doing up front (maybe after a bit of prodding)—and it is much appreciated by customers. The openness of a company about these details is very important in choosing any online service.

In the meantime, I did have an opportunity to interact with Sun’s customer service over e-mail and they were very responsive and effective. I tried to get a job in Friday evening before the outage, and had some issues which were acknowledged and handled very quickly, even though it was after business hours. (Maybe they had more people on call because of the outage, maybe not; the help pages indicate that normal hours exclude weekends and holidays.) I like responsive and competent customer service; too often, I understand the problem better than the first-line support and it takes forever to find someone who can actually observe and fix the problem. Sun so far has had good customer service and open access to their engineers.

As an idea, the SunGrid is a fast and easy way to get parallelism and performance flexibly. But Sun has to continue to improve the user interface (e.g., beyond the clever hack for job monitoring suggested to me by a Sun engineer) and reliability of their infrastructure. Unless they do, people without CPU grants are going to start looking at alternatives like using Amazon’s hosted EC2 or running their own DigiPede.