Sunday, 25 January 2009

DNS Disaster!

I needed to make a change to the configuration of one of the network cards on the "host" machine. (I tend to use the word "host" rather than "head node" because I don't really think of it as a node - it's not where any applications will run. I suppose I could also use the more Perceus-like "master".) Because I come from a Windows background I still tend to look for a GUI rather than use the command line, so I ran system-config-network. Big, big mistake. Suddenly I can no longer ping any of the cluster nodes by name.

When I set up Perceus I'm sure I did no more than follow the instructions in the User Guide and add

nameserver 127.0.0.1

to the /etc/resolv.conf file on the host. (I also had to put the host address in the /etc/resolv.conf file on the nodes, of course: if your nodes boot slowly, you probably forgot.) system-config-network wiped my resolv.conf file, so I added the nameserver line back in and then ran /etc/init.d/preceus reload. No good. Looking in /var/log/messages (via gnome-system-log, ofcourse!) showed this when I did the reload:

perceus-dnsmasq[5286]: ignoring nameserver 127.0.0.1 - local interface

But is this a problem? I read somewhere that it isn't, however, subsequently I've not seen this message repeated when things have worked.

It struck me that I was assigning static IP addresses to my nodes in /etc/perceus/modules/ipaddr. Does perceus-dnsmasq pick these up, or do I need to add them to the /etc/hosts file? I didn't need to before. It also struck me that the messages log was filling up with entries like

perceus[5295]: ERROR: Unknown node just tried to contact us! (NodeID=00:1C:F0:6E:C8:53)

This was despite the fact that for a few minutes after booting there were no errors and perceus node status showed the node as "ready" and regularly responding, only to fail to respond later. Despite the fact too, that other nodes which showed up as "init" were not generating "Unknown node" errors.

At this point I was just a bit confused...

Time to get back to some certainties. The first thing to do was to get DNS working. I use two "networks" for the cluster. The "management" traffic is sent over a network card assigned a static 192.168.3.x address. The application traffic is send over a second network card with a static 192.168.4.x address. How do you make sure the right card gets the right IP address? You run /sbin/ifconfig on the node to get the order of the network devices and edit /etc/perceus/modules/ipaddr accordingly.

Easy, right? Well no. I've found that which node network device gets which name, eth0, eth1, etc. can change between boots, or if the host has been rebooted, or if the vnfs image has been updated.

I edited the /etc/perceus/modules/ipaddr file so that the management card would get a "Default" address. Perceus first looks in the /etc/hosts file for the node name, if it doesn't find one it assigns a DHCP address. However, the DHCP address that is assigned is not the same as the address assigned on boot up! As a result, if I tied to ping the node, the node name was resolved to the boot up address again. This cannot be right. There must be something wrong with perceus-dnsmasq - and if there isn't there should be.

So I added the node name to the /etc/hosts file with its static 192.168.3.x address. Finally, things started to work. At least, I was able to ping the nodes by name. However, perceus node status was still not being updated and I was still getting "unknown node" errors. I will leave that investigation to another post.

With the node names being resolved to their 192.168.3.x addresses, I needed to change the way I launched mpi applications. Essentially this is just a case of saying which interface hostname to use. So in the mpd.hosts name file I added entries like this:

n00005 ifhn=192.168.4.25
n00004 ifhn=192.168.4.24

Then bring up the ring of mpds specifying the local interface hostname:

#mpdboot --ifhn=192.168.4.1 -n 3 -v -f mpd.hosts

I could then add the application network addresses I wanted to use to the machine file, just:

192.168.4.25
192.168.4.24

and run the application:

# mpirun -machinefile machines -n 4 ./xhpl

I hadn't needed to do any of this previously. I'm left with the uneasy feeling that I haven't got to the bottom of why the problem arose, or whether what I've done is really the solution or just a workaround. If I had 1000 nodes, or 10,000 nodes, would Perceus expect me add all those addresses to the /etc/hosts file? perceus-dnsmasq should handle that, shouldn't it?

Sunday, 18 January 2009

Installing Erlang

One of my key goals for my experimental cluster is to run (and program) distributed Erlang. The first task is simply to get Erlang installed on the nodes. Unfortunately, there is no CentOS rpm package for Erlang. This is a bit surprising: it means that there is no rpm package for RHEL either. It's not a big problem, we just need to install from the source code.

The first thing to do is download and unpack the source file. Do not do what I did and use File Roller, the GNOME archive manager. If you do you get a make error. Instead just follow the instructions in the readme file:

gunzip -c otp_src_R12B-5.tar.gz tar xf -
zcat otp_src_R12B-5.tar.gz tar xf -

To build Erlang I needed to install the ncurses and OpenSSL development libraries:

yum install ncurses-devel
yum install openssl-devel

That done, Erlang built without any problems. But that's only half the job. The nice thing about running the same OS on both the "host" machine (where the nodes are managed from) and on the nodes themselves, is that you can build software on the host and then just copy it to the nodes. The Perceus user guide states that you should be able to do something like this:

make DESTDIR=/mnt/centos-5.2-1.stateless.x86_64 install

I have to say, however, that I've never got this to work. Happily, Erlang's file structure is quite simple. The progam files are in /usr/local/lib/erlang (by default) and there are a bunch of links in /usr/local/bin. Once Erlang is installed on the host you can copy the files to the mounted vnfs:

cp -r /usr/local/bin/* /mnt/centos-5.2-1.stateless.x86_64/usr/local/bin
cp -r /usr/local/lib/erlang /mnt/centos-5.2-1.stateless.x86_64/usr/local/lib

I rebooted a node, connected over ssh, and typed erl. Everything looks fine.

Monday, 5 January 2009

HPL Tuning

You might think that the way to test the relative performance of different computer systems would be to give them all the same problem to solve, and then see which system solved the problem fastest. That is not how HPL works. Instead, the linear equations that get solved by the Linpack Benchmark are a means to an end, not an end in themselves: they are really just there to give the cluster's CPUs something to chew on. The aim is simply to run the CPUs as fast and as efficiently as possible. To this end, there are a number of parameters you can tweak in the HPL.dat configuration file to optimize or tune the way HPL runs.

There are 31 lines in the HPL.dat file, but some are definitely more important than others. In my, admittedly limited, experience the three most crucial parameters are the "problem size" (N), the "block size" (NB), and the "process rows" and "process columns" (P and Q). Essentially, the "problem size" needs to fill up as much RAM as possible - without any of it being swapped out to disk, and bearing in mind that the Operating System needs RAM too. 80% of total memory seems a good starting point; this is recommended in the HPL FAQs and seems to be borne out in practice. There's a nice graph on the Microwulf website showing how performance increases with problem size.

The "block size" is best determined by testing, but I've found values between 160 and 256 to produce the best results, depending on the system. I have a feeling that a good block size is related to the size of the processor cache, but I've no evidence whatsoever...

The P and Q parameters specify the dimensions of the "process grid." Again, I've found the best values by trial and error - I've no theory.

So I fired up my cluster full of hope and expectation, only to find that the results were truly terrible! Running four nodes, the best I could do was just 1.94 Gflops (with N=20000, NB=120, PxQ=2x4.) To put this in some sort of perspective, Microwulf - a four node cluster with 8Gb of RAM - is capable of delivering 26.25 Gflops. Earlier in the year, I ran HPL on Fedora 8 on my Dell D630 laptop, and achieved what I thought was an amazing 9.54 Gflops*. My laptop does have 4Gb of RAM, and the 2.20GHz Intel Mobile Core 2 Duo processor has a 4096 KB L2 cache, but that's nearly 10 Gflops on a laptop! I couldn't get 2 Gflops on the cluster... Clearly there was something badly wrong. But what?

To cut a very long story short, by experimenting with different values for NB and P and Q, and by carefully tuning N to the available RAM, I was able to increase performance from 4 nodes running 8 processes to 10.43 Gflops. (WR00L2L4: N=27648, NB=192, P=1, Q=8.) This was a big improvement -and shows how important tuning HPL is - but is still pretty poor.

You can measure the performance of a cluster by its Computational Efficiency. Computational Efficency is simply the the measured performance of the cluster (RMax) divided by its theoretical maximum peak performance (RPeak):

compEff = RMax/RPeak

RPeak can be estimated as the number of nodes in the cluster, multiplied by the number of cores per node, multiplied by the number of floating point units per core, multipled by the clock speed. In the case of my test server RPeak= 4 x 2 x 2 x 2.2 = 35.2. With the cluster only delivering 10.43 Gflops, compEff = 10.42/35.2 = 29.6%. This is not a good number :-( As a comparison, "real" clusters like those in the TOP500 list are around 75% efficent. Microwulf runs at an execeptional 82%. My D630 comes in at 108% :-o But then it isn't a cluster...

I needed a "game changer." My cluster is built using AMD Athlon™ 64 X2 processors. I replaced the GotoBLAS library with AMD Core Math Library (ACML) which is optimized for AMD processors. There is a really excellent article on using ACML with HPL here. The results were outstanding. Performance doubled to 21.24 Gflops. (WR00L2L2: N=27392, NB=192, P=2, Q=4.) Computational Efficiency doubled to 60.3%. All I had done was link HPL to a different library!

The figures are still not brilliant, but at least they are approaching acceptability. Interestingly, Computational Efficiency is around 70% when running on any two nodes. There are still some things to try.







*WR00L2L2: N=18944, NB=256, P=1, Q=4. Time taken 475.22.