Monday, 23 March 2009

Heads in the Clouds

I've been trying to understand Cloud Computing. No, let me rephrase that, I've been trying to understand what's different about Cloud Computing. Microsoft are busy prepping their Azure Services Platform in readiness for taking on Amazon Web Services, Sun, Google and the rest. Even in the worst economic climate since the 1930s, these companies are spending tens of millions of dollars on building massive data centres that will host "The Cloud." Why?

Microsoft have some Case Studies to help those us having problems with the whole Cloud thing get it. Here's an extract

Ok. But is there really anything new here? What happens if we replace the word "Cloud" with something more prosaic and old-fashioned, something like the word "Internet":

You would be forgiven for thinking that these two paragraphs mean exactly the same thing. But they can't do. "Cloud Computing" must mean something, right?

Well, I'm not going to labour the point. Yes it does mean something different: it means having your code run on someone else's hardware. It also means having your data stored on someone else's hardware. In the example above, Infosys don't have a SQL database running on an Internet server that they own, they have a SQL database running on an Internet server that Microsoft owns. Why would they do something like that? Because of the benefits that Microsoft and Amazon and everyone else claims for Cloud Computing: start-up costs are minimal - you don't have to buy a server infrastructure up front, just rent what you need and scale up (and up) as you need to; redundancy is massive - if the data center hosting your application falls over, your application is instantly shifted to another data centre in another time zone, or in another country, or on another continent. As Microsoft says, it is about "having global-class credibility."

So the benefits are real. But that really is it. There's no new code, no new Internet. As Oracle's Larry Ellison said last year "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do." He went on to say "The computer industry is the only industry that is more fashion-driven than women's fashion."

What Cloud Computing emphatically is not is "where thousands of computers cooperate through the Internet to compute a result. Google’s proprietary MapReduce* framework is the standard bearer for this..." This definition comes from a recent Intel primer on Parallel programming. The author is either being idealistic or naive. Let's be generous: this is what we would like Cloud Computing to be - unlimited access to a hyper-computer where we only pay for the time we use. But it is not what is on offer. Yes, you can set up multiple virtual servers on AWS to form a cluster, but they might all be running on the same physical box! Never forgetting, of course, that the interconnect is the cluster.

Neither is Cloud Computing "where IT power is delivered over the Internet as you need it, rather than drawn from a desktop computer." At least, not yet. This is a more generalized and subtle, but just as idealistic definition. Notice the word "desktop" - we're not talking about replacing web servers here, which is what Amazon and Microsoft are punting. In this definition the "standard bearer" is Google's Gmail. What is replaced is the mail program on your PC. But there is nothing remotely new just in web applications like Gmail that requires the term "Cloud Computing." To have any force, this web applications paradigm would have to advance to the point where people were accessing not just their email, but the vast majority of their applications from multiple consumer devices (like mobile phones, TVs and digital cameras) which didn't just supplement the PC, but actually subsumed it. This promise is implicit in Cloud Computing, but again it is not what's on offer - and just building data centers won't in itself make it happen.

This definition comes from The Guardian's article on GNU founder Richard Stallman's now famous attack on Cloud Computing. Stallman argued that Cloud Computing was "worse than stupidity" and "simply a trap aimed at forcing more people to buy into locked, proprietary systems that would cost them more and more over time." Crucially he argued that "One reason you should not use web applications to do your computing is that you lose control..." (My emphasis.) This issue of control has been heavily debated. (There is one example here.) I just want to comment about one aspect of control.

In their book Fire in the Valley about the birth of the PC industry, Paul Freiberger and Michael Swaine talk about the feeling programmers, technicians and enginneers had in the 1960s and 70s of being "locked out of the machine room." The Personal Computer changed all that; it was genuinely subversive, putting technology and computing power in the hands of anyone who wanted to use it. Those of us who build Beowulf clusters work in the same tradition. The danger of Cloud Computing is that once again the machine room door will be slammed in our faces.

Saturday, 21 March 2009

Two more nodes

I've added two additional nodes to my cluster. The new nodes have Asus rather than Abit motherboards and, unfortunately, CentOS 5.2 doesn't support the Nvidia nForce 630a chipset. So I was left with a choice of either trying to build the necessary drivers, or finding an alternative. In the end I decided to give CAOS Linux another try. There seems to be no link on the website anymore to download a VNFS image, but if you Google you can find the FTP site easily enough. I used version 1.0 rather than the RC1 version I had tried previously. It is still a "bare bones" distribution - it doesn't even come with vi - but it worked. I even managed to get Python installed from source using "make DESTDIR", which didn't happen before. There are a few quirks: I don't seem to be able to specify which eth device gets which MAC address, which means I can't use it on all the nodes, but for the new nodes it's fine.

I wasn't going to go on about performance results anymore, they are just a bit of fun after all. However, having added the two new nodes they had to be tested :-) The new nodes have Athlon 64 X2 5200+ (2700MHz) processors instead of 4200 (2200MHz) processors, and 4Gb RAM instead of 2Gb RAM. So a big increase in performance right? Er... no. There is an increase in performance - from 13.3 to 14.25 Gflops on a two node test, but this modest increase in output is offset by a big drop in computational efficiency: down from 75.6% to 66%. What's going on?

On the original nodes, with 2Gb RAM, pretty much all I had to do to find the maximum performance was keep increasing the problem size (N) until xhpl ran out of memory and crashed. The bigger the problem size, the better the measured performance. With the nodes with 4Gb RAM, the measured performance peaks before (way before) we run out of memory. What we see is an initial steep increase in performance as the problem size increases, leading to a peak, and then a slow decline.

It's as if the processors can't make effective use of all that extra memory. Or perhaps we've hit the limit of what the bonded Gigabit Ethernet interconnect can deliver.