Monday, 23 March 2009

Heads in the Clouds

I've been trying to understand Cloud Computing. No, let me rephrase that, I've been trying to understand what's different about Cloud Computing. Microsoft are busy prepping their Azure Services Platform in readiness for taking on Amazon Web Services, Sun, Google and the rest. Even in the worst economic climate since the 1930s, these companies are spending tens of millions of dollars on building massive data centres that will host "The Cloud." Why?

Microsoft have some Case Studies to help those us having problems with the whole Cloud thing get it. Here's an extract


Ok. But is there really anything new here? What happens if we replace the word "Cloud" with something more prosaic and old-fashioned, something like the word "Internet":


You would be forgiven for thinking that these two paragraphs mean exactly the same thing. But they can't do. "Cloud Computing" must mean something, right?

Well, I'm not going to labour the point. Yes it does mean something different: it means having your code run on someone else's hardware. It also means having your data stored on someone else's hardware. In the example above, Infosys don't have a SQL database running on an Internet server that they own, they have a SQL database running on an Internet server that Microsoft owns. Why would they do something like that? Because of the benefits that Microsoft and Amazon and everyone else claims for Cloud Computing: start-up costs are minimal - you don't have to buy a server infrastructure up front, just rent what you need and scale up (and up) as you need to; redundancy is massive - if the data center hosting your application falls over, your application is instantly shifted to another data centre in another time zone, or in another country, or on another continent. As Microsoft says, it is about "having global-class credibility."

So the benefits are real. But that really is it. There's no new code, no new Internet. As Oracle's Larry Ellison said last year "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do." He went on to say "The computer industry is the only industry that is more fashion-driven than women's fashion."

What Cloud Computing emphatically is not is "where thousands of computers cooperate through the Internet to compute a result. Google’s proprietary MapReduce* framework is the standard bearer for this..." This definition comes from a recent Intel primer on Parallel programming. The author is either being idealistic or naive. Let's be generous: this is what we would like Cloud Computing to be - unlimited access to a hyper-computer where we only pay for the time we use. But it is not what is on offer. Yes, you can set up multiple virtual servers on AWS to form a cluster, but they might all be running on the same physical box! Never forgetting, of course, that the interconnect is the cluster.

Neither is Cloud Computing "where IT power is delivered over the Internet as you need it, rather than drawn from a desktop computer." At least, not yet. This is a more generalized and subtle, but just as idealistic definition. Notice the word "desktop" - we're not talking about replacing web servers here, which is what Amazon and Microsoft are punting. In this definition the "standard bearer" is Google's Gmail. What is replaced is the mail program on your PC. But there is nothing remotely new just in web applications like Gmail that requires the term "Cloud Computing." To have any force, this web applications paradigm would have to advance to the point where people were accessing not just their email, but the vast majority of their applications from multiple consumer devices (like mobile phones, TVs and digital cameras) which didn't just supplement the PC, but actually subsumed it. This promise is implicit in Cloud Computing, but again it is not what's on offer - and just building data centers won't in itself make it happen.

This definition comes from The Guardian's article on GNU founder Richard Stallman's now famous attack on Cloud Computing. Stallman argued that Cloud Computing was "worse than stupidity" and "simply a trap aimed at forcing more people to buy into locked, proprietary systems that would cost them more and more over time." Crucially he argued that "One reason you should not use web applications to do your computing is that you lose control..." (My emphasis.) This issue of control has been heavily debated. (There is one example here.) I just want to comment about one aspect of control.

In their book Fire in the Valley about the birth of the PC industry, Paul Freiberger and Michael Swaine talk about the feeling programmers, technicians and enginneers had in the 1960s and 70s of being "locked out of the machine room." The Personal Computer changed all that; it was genuinely subversive, putting technology and computing power in the hands of anyone who wanted to use it. Those of us who build Beowulf clusters work in the same tradition. The danger of Cloud Computing is that once again the machine room door will be slammed in our faces.

Saturday, 21 March 2009

Two more nodes

I've added two additional nodes to my cluster. The new nodes have Asus rather than Abit motherboards and, unfortunately, CentOS 5.2 doesn't support the Nvidia nForce 630a chipset. So I was left with a choice of either trying to build the necessary drivers, or finding an alternative. In the end I decided to give CAOS Linux another try. There seems to be no link on the website anymore to download a VNFS image, but if you Google you can find the FTP site easily enough. I used version 1.0 rather than the RC1 version I had tried previously. It is still a "bare bones" distribution - it doesn't even come with vi - but it worked. I even managed to get Python installed from source using "make DESTDIR", which didn't happen before. There are a few quirks: I don't seem to be able to specify which eth device gets which MAC address, which means I can't use it on all the nodes, but for the new nodes it's fine.

I wasn't going to go on about performance results anymore, they are just a bit of fun after all. However, having added the two new nodes they had to be tested :-) The new nodes have Athlon 64 X2 5200+ (2700MHz) processors instead of 4200 (2200MHz) processors, and 4Gb RAM instead of 2Gb RAM. So a big increase in performance right? Er... no. There is an increase in performance - from 13.3 to 14.25 Gflops on a two node test, but this modest increase in output is offset by a big drop in computational efficiency: down from 75.6% to 66%. What's going on?

On the original nodes, with 2Gb RAM, pretty much all I had to do to find the maximum performance was keep increasing the problem size (N) until xhpl ran out of memory and crashed. The bigger the problem size, the better the measured performance. With the nodes with 4Gb RAM, the measured performance peaks before (way before) we run out of memory. What we see is an initial steep increase in performance as the problem size increases, leading to a peak, and then a slow decline.


It's as if the processors can't make effective use of all that extra memory. Or perhaps we've hit the limit of what the bonded Gigabit Ethernet interconnect can deliver.

Thursday, 19 February 2009

Supersize me

One simple way to improve network performance is to enable Jumbo packets. Jumbo packets - or frames - are, as their name suggests, bigger than normal network packets. This means that more data can be transferred across the network at once: up to 9000 bytes instead of 1500 bytes. Because the computer CPU is not being interupted as often for the same volume of data, CPU utilization is increased along with throughput. What's more, on Linux enabling Jumbo frames is just a matter of setting the MTU (Maximum Transmission Unit) size on the network device:

#ifconfig eth1 mtu 9200

To make this permenant, you just need to add MTU=9200 in the /etc/sysconfig/network-scripts/ifcfg-eth1 file. On Perceus this is done in the /etc/perceus/modules/ipaddr file. The only other thing you need is a switch that has Jumbo frame support enabled. That done, you can test everything is working by pinging with a Jumbo packet size:

ping -s 9000 -M do 192.168.4.16

Easy! So why haven't I done this before? I originally bought D-Link DGE-528T Gigabit cards for the cluster. These were cheap, but seemed to offer everything I needed - including Jumbo frames support. The maximum MTU was only 7200 bytes, not the more usual 9200 bytes, but I could ping the cards with packets of 7000 bytes. What I found, however, was that I could not run the HPL benchmark xhpl with Jumbo packets of any size enabled. I could start up the mpd ring, and run mpdtrace, but when I actually tried to run xhpl it crashed:

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x555d1c0, count=1, dtype=USER, dest=2, tag=2001, comm=0x84000001) failed

Setting MTU back to 1500 again fixed the problem straight away. So is this a driver problem or a xhpl problem? I'm guessing it is a driver problem, but life's too short...

I made the decision to junk the D-Link DGE-528T cards and go with Intel PRO/1000 MT cards - another eBay triumph! This time there were no problems with xhpl crashing with Jumbo frames enabled. Interestingly, you enable Jumbo frames on the bonded device, bond0, not on the slave devices, and the bonding driver does the rest. Very cool.

Output on a four node test rose from 24.08 to 24.86 Gflops (70.6% compEff.) Running over all six nodes of the cluster produced 36.13 Gflops (68.4% compEff.)

I'm going to leave it there, and not spend any more time tuning. It is time to put the cluster to work running distributed Erlang. What performace critera are most relevent to this task is another question altogether...

I think we're bonding...

Having doubled the measured efficiency of my cluster by linking HPL to ACML, I looked around for something else that could take me closer to the 75% efficiency achieved by "real" clusters. One obvious area to look at is the "interconnect" - the network that connects the nodes together to actually make the cluster. Top clusters use Infiniband for the interconnect, but - even though prices continue to fall - Infiniband is still out of reach for just about all Beowulf clusters like mine. Whether Infiniband will ever be widely used for these kind of clusters is open to debate: perhaps we'll all be using 10GE cards in a few years. For now though, we have to get the most from our 1 Gigabit network cards.

The highest rated Gigabit cluster on the Top 500 list comes in at number 78. It's computational efficiency is only 53.26%. The average computational efficiency of all Gigabit clusters in November 2008's Top500 list is even lower at just 50.52%. The most efficient is 63.04% efficient, the lowest at 40.34% - and that is with 5096 cores! That is a lot of wasted clock cycles. Next to these figures, the 60.3% efficiency I achieved on my four node cluster doesn't look too bad. However, the efficency was nearer 70% with two nodes, so as I add nodes it looks like overall efficency will fall. Reason enough to try and maximize performance.

One of the things I wanted to investigate from the very start was network card bonding, or - perhaps more strictly - network card aggregration. This is where you "bond" two (or more) network cards together to create a single logical network interface. In other words, two network cards share the work of one, thereby giving you an increase in network performance.

Linux support for network card bonding is definitive. Not only that, it is very easy to impliment. (There is a good article here.) From a perceus cluster point of view, you just need to add the lines

alias bond0 bonding
options bond0 mode=802.3ad miimon=100

to the /etc/modprobe.conf on the vnfs image, and then edit the /etc/perceus/modules/ipaddr file. (This is used by the perceus 50-ipaddr.pl perl script to automatically generate the the ifcfg* files in /etc/sysconfig/network-scripts/):

n00004 bond0(USERCTL=no&BOOTPROTO=none&ONBOOT=yes):192.168.4.14/255.255.255.0/192.168.4.1 eth0(HWADDR=00:18:37:07:FB:3A):[default]/[default] eth1(MASTER=bond0&SLAVE=yes): eth2(MASTER=bond0&SLAVE=yes):

This creates a bond0 interface from eth1 and eth2, the "slave" interfaces.

Linux, then, supports bonding extremely well - but Linux is only one half of the connection. At the other end of the wire is the switch. I was using a Dell PowerConnect 2724 switch which advertises itself as supporting "Link Aggregation." Unfortunately "link aggregation" can mean different things to different people. To me it means "using multiple network cables/ports in parallel to increase the link speed beyond the limits of any one single cable or port..." But it can also refer to using bonded network devices for load balancing and fail-over protection. It is in this latter sense that the PowerConnect 2724 switch supports "aggregation." It doesn't support cabling devices in parallel to increase performance. For that, you need LACP (the Link Aggregation Control Protocol) which the Dell 27xx switches do not support; switches that do are twice the price. It was a case on getting on eBay, where I found a "pre-enjoyed" PowerConnect 5324 which does support LACP.

So was it worth it? On a two node test, output went from 12.22 Gflops to 12.93 Gflops; that doesn't sound a lot, but it means an increase in computational efficiency from 69.3% to 73.5%. On a four node test, output went from 21.24 Gflops to 24.08 Gflops; this is an increase in computational efficiency from 60.3% to 68.4%. So the benefit of aggregation is much greater on the four node cluster. This makes sense: by using aggregation we are flattening the performance curve; we are alleviating the bottleneck effect of the additional interconnects.

Network cards are cheap - at least relative to the cost of memory or processors. If you have a switch that supports LACP, link bonding definitely seems worthwhile.

Friday, 13 February 2009

VLANs to the rescue

I've overcome the two issues that caused me so much frustration. The problem of node network devices picking up different names (eth0, eth1) between boots was fixed by specifying the MAC addresses in the /etc/perceus/modules/ipaddr file:

eth0(HWADDR=00:18:37:07:05:A8):[default]/[default]

My thanks go to Greg Kurtzer for helping me get this fixed.

The general network problems I was having with provisiond, amongst other things, were resolved by splitting my "management" network (with 192.168.3.x addresses) and "application" network (with 192.168.4.x addesses) over two VLANs. This was easy to do on the switch and really cleaned things up. Now the node will always boot from the network device I want it to boot from, and management traffic doesn't wander over the application network.

Sunday, 25 January 2009

DNS Disaster!

I needed to make a change to the configuration of one of the network cards on the "host" machine. (I tend to use the word "host" rather than "head node" because I don't really think of it as a node - it's not where any applications will run. I suppose I could also use the more Perceus-like "master".) Because I come from a Windows background I still tend to look for a GUI rather than use the command line, so I ran system-config-network. Big, big mistake. Suddenly I can no longer ping any of the cluster nodes by name.

When I set up Perceus I'm sure I did no more than follow the instructions in the User Guide and add

nameserver 127.0.0.1

to the /etc/resolv.conf file on the host. (I also had to put the host address in the /etc/resolv.conf file on the nodes, of course: if your nodes boot slowly, you probably forgot.) system-config-network wiped my resolv.conf file, so I added the nameserver line back in and then ran /etc/init.d/preceus reload. No good. Looking in /var/log/messages (via gnome-system-log, ofcourse!) showed this when I did the reload:

perceus-dnsmasq[5286]: ignoring nameserver 127.0.0.1 - local interface

But is this a problem? I read somewhere that it isn't, however, subsequently I've not seen this message repeated when things have worked.

It struck me that I was assigning static IP addresses to my nodes in /etc/perceus/modules/ipaddr. Does perceus-dnsmasq pick these up, or do I need to add them to the /etc/hosts file? I didn't need to before. It also struck me that the messages log was filling up with entries like

perceus[5295]: ERROR: Unknown node just tried to contact us! (NodeID=00:1C:F0:6E:C8:53)

This was despite the fact that for a few minutes after booting there were no errors and perceus node status showed the node as "ready" and regularly responding, only to fail to respond later. Despite the fact too, that other nodes which showed up as "init" were not generating "Unknown node" errors.

At this point I was just a bit confused...

Time to get back to some certainties. The first thing to do was to get DNS working. I use two "networks" for the cluster. The "management" traffic is sent over a network card assigned a static 192.168.3.x address. The application traffic is send over a second network card with a static 192.168.4.x address. How do you make sure the right card gets the right IP address? You run /sbin/ifconfig on the node to get the order of the network devices and edit /etc/perceus/modules/ipaddr accordingly.

Easy, right? Well no. I've found that which node network device gets which name, eth0, eth1, etc. can change between boots, or if the host has been rebooted, or if the vnfs image has been updated.

I edited the /etc/perceus/modules/ipaddr file so that the management card would get a "Default" address. Perceus first looks in the /etc/hosts file for the node name, if it doesn't find one it assigns a DHCP address. However, the DHCP address that is assigned is not the same as the address assigned on boot up! As a result, if I tied to ping the node, the node name was resolved to the boot up address again. This cannot be right. There must be something wrong with perceus-dnsmasq - and if there isn't there should be.

So I added the node name to the /etc/hosts file with its static 192.168.3.x address. Finally, things started to work. At least, I was able to ping the nodes by name. However, perceus node status was still not being updated and I was still getting "unknown node" errors. I will leave that investigation to another post.

With the node names being resolved to their 192.168.3.x addresses, I needed to change the way I launched mpi applications. Essentially this is just a case of saying which interface hostname to use. So in the mpd.hosts name file I added entries like this:

n00005 ifhn=192.168.4.25
n00004 ifhn=192.168.4.24

Then bring up the ring of mpds specifying the local interface hostname:

#mpdboot --ifhn=192.168.4.1 -n 3 -v -f mpd.hosts

I could then add the application network addresses I wanted to use to the machine file, just:

192.168.4.25
192.168.4.24

and run the application:

# mpirun -machinefile machines -n 4 ./xhpl

I hadn't needed to do any of this previously. I'm left with the uneasy feeling that I haven't got to the bottom of why the problem arose, or whether what I've done is really the solution or just a workaround. If I had 1000 nodes, or 10,000 nodes, would Perceus expect me add all those addresses to the /etc/hosts file? perceus-dnsmasq should handle that, shouldn't it?

Sunday, 18 January 2009

Installing Erlang

One of my key goals for my experimental cluster is to run (and program) distributed Erlang. The first task is simply to get Erlang installed on the nodes. Unfortunately, there is no CentOS rpm package for Erlang. This is a bit surprising: it means that there is no rpm package for RHEL either. It's not a big problem, we just need to install from the source code.

The first thing to do is download and unpack the source file. Do not do what I did and use File Roller, the GNOME archive manager. If you do you get a make error. Instead just follow the instructions in the readme file:

gunzip -c otp_src_R12B-5.tar.gz tar xf -
zcat otp_src_R12B-5.tar.gz tar xf -

To build Erlang I needed to install the ncurses and OpenSSL development libraries:

yum install ncurses-devel
yum install openssl-devel

That done, Erlang built without any problems. But that's only half the job. The nice thing about running the same OS on both the "host" machine (where the nodes are managed from) and on the nodes themselves, is that you can build software on the host and then just copy it to the nodes. The Perceus user guide states that you should be able to do something like this:

make DESTDIR=/mnt/centos-5.2-1.stateless.x86_64 install

I have to say, however, that I've never got this to work. Happily, Erlang's file structure is quite simple. The progam files are in /usr/local/lib/erlang (by default) and there are a bunch of links in /usr/local/bin. Once Erlang is installed on the host you can copy the files to the mounted vnfs:

cp -r /usr/local/bin/* /mnt/centos-5.2-1.stateless.x86_64/usr/local/bin
cp -r /usr/local/lib/erlang /mnt/centos-5.2-1.stateless.x86_64/usr/local/lib

I rebooted a node, connected over ssh, and typed erl. Everything looks fine.