Tuesday, 23 December 2008

Show us yer FLOPS!

What's the first thing you want to know about your cluster? Or anyone else's cluster, come to that? You want to know how fast it is, right? The way of measuring cluster performance is to count the FLOPS - the FLoating point Operations Per Second - the cluster is capable of executing. Counting FLOPS is the biggest pissing competition in the world of computing. Right now the cluster that can piss higher up the wall than any other is RoadRunner, capable of 1.7 petaflops (that's a 1 with 15 noughts - a million, billion FLOPS, a million gigaflops.)

The way to measure FLOPS is to run the High Performance Computing Linpack Benchmark - HPL. You can download HPL from Netlib. To build HPL you need two more things - mpi and a BLAS library. BLAS stands for Basic Linear Algebra Subroutines. GotoBLAS is recognized as being a fast implimentation of BLAS, so I downloaded that. I ran the GotoBLAS quickbuild.64bit script and everything seemed OK, so I left it there.

It is probably worth mentioning that HPL can use Fortran 77, so I installed the g77 "compat-gcc-34-g77-3.4.6-4" package.

Building HPL is a case of creating a make file for your architecture. Fortunately, you can just edit one of the default files in the hpl "setup" sub directory. I used the Make.Linux_PII_FBLAS file and set the MPI directory as follows:

MPdir = /opt/mpich2-install

and the Linear Algebra library like this:

LAdir = /home/David/Sources/GotoBLAS

Then its just a case of calling make specifying the right architecture:

make arch=Linux_PII

(Obviously the make file needs to match.)

The file you need to start benchmarking is xhpl in the (in my case) hpl/bin/Linux_PII_FBLAS/ sub directory. We need to run this with mpirun. However, there's an issue. In order to run an application across multiple nodes, that application's binary file (and any supporting libraries) needs to be on each node. Rebuilding a node image every time we want to run a new application is obviously out of the question, so what do we do? Fortunately Perceus can come to our aid.

Perceus supports what is known as "Hybridization". Hybridization is essentially file sharing. The idea is that files or folders in the VNFS image are replaced by symbolic links pointing to network-based files or folders. Unfortunately, it is at this point that Perceus' careful abstraction of the node file system falls down. To specify which files or folders get redirected you have to get into the guts of how Perceus organizes things.

I want to create a shared directory where I can put the binaries I want to run with mpi. This directory is going to be /opt/mpirun. Importantly, this directory has to exist on the head node, as well as all the cluster nodes. The first step is to add /opt/mpirun to the hybridize configuration file located at /etc/perceus/vnfs/vnfs_capsule_name. However, the /opt/mpirun directory specified in the hybridize file is not the /opt/mpirun directory on the host machine (the head node.) Oh no, this is the /opt/mpirun directory located on the physical representation of the VNFS file system that actually underlies the "virtual" VNFS file system of the nodes. In reality this is located at /var/lib/perceus/vnfs/vnfs_capsule_name/rootfs/opt/mpirun. It is this directory that actually gets shared. So because I want /opt/mpirun to exist on the head node as well, /opt/mpirun on the head node has to be a symbolic link back to /var/lib/perceus/vnfs/vnfs_capsule_name/rootfs/opt/mpirun. Not pretty. Finally, you need to mount the VNFS image and edit /etc/fstab so that the node connects to the share.

That done, we are ready to start testing the performance of the cluster by running xhpl. (Well, almost, I had to copy the libg2c.so.0 library to /usr/lib64/ on the nodes first.) Copy xhlp and HPL.dat to /opt/mpirun and go...

Monday, 22 December 2008

Installing MPI

MPI stands for Message Passing Interface. When different instances of a process -usually running on different nodes - need to talk to one another, they do so using MPI. It has become a sort of de facto standard.

The are various implimentations of MPI. I'm using MPICH2. Building and installing MPICH2 on the head node is no problem - just follow the instructions in the "From A Standing Start to Running an MPI Program" section of the Installer's Guide. The only thing you need to keep in mind is that you will need to replicate the install on all the nodes. For this reason I installed to /opt/mpich2-install via configure -prefix. I could then copy the mpich2-install directory to the mounted VNFS image:

perceus vnfs mount centos-5.1-1.stateless.x86_64
cp -r /opt/mpich2-install /mnt/centos-5.1-1.stateless.x86_64/opt/
perceus vnfs umount centos-5.1-1.stateless.x86_64

The key to getting mpi working, however, is to be able to ssh onto a node from the head node, and onto the head node from a node, without needing to enter a password. You must be able to ssh both ways. Happily, Perceus sets up the connection from the head node to the node, so you don't have to do anything. But to ssh onto the head node from a node - without needing a password - you have to do some work. On the node:

[root@node ~]#ssh-keygen -t rsa
[root@node ~]#cat .ssh/id_rsa.pub ssh root@head_node 'cat >> .ssh/authorized_keys'

This generates a private/public key pair for root on the node and copies the public key to the head node so that root will no longer need to enter a password when using ssh to connect from the node. That's fine, except the cluster nodes are stateless - they don't have their own harddrives - so the next time the node reboots the configuration will be lost. The solution is to copy the keys and ssh settings back to the node VNFS image. If the VNFS image is mounted, you can do this:

[root@node ~]#scp -r ./.ssh head_node:/mnt/centos-5.1-1.stateless.x86_64/root

Keep in mind that a .ssh directory probably already exists, so you might want to get it out of the way first. But that's it - job done!

There a few things worth noting here. You only have to generate the public/private keys once. The same keys work for all nodes - they are not related to the host name of the node (which I thought they might be.) This also means that the keys work regardless of what nic your mpi traffic is going over. (More of which later.) On mpi itself, all you have to do is make sure that the files are in the same place on each system, and it does the rest. Very smart. After that it is just a case of starting mpi on each node:

mpdboot -n 5 -f mpd.hosts

Running mpdtrace should now show you the names of your nodes.

Let's go clustering...

The reason I need CentOS is because I want to experiment with clustering. I first looked at clustering earlier this year, building a small 4 node beowulf cluster by following the "Configuration Notes" for Joel Adams' inspirational Microwulf. However, the Microwulf model is not easily scalable: for example, you need to manually create and configure a file partition on the host for each of the nodes. I want to look at something more industrial. After reading this article on the Linux Magazine website, I thought Perceus sounded just what I needed.

Perceus is an "enterprise and cluster provisioning toolkit" and supersedes the older Warewulf provisioning tools.

Perceus turned out to be pretty easy to build and install. I ended up needing to download and build all of the dependencies from the Perceus website, but nothing too onerous. I then downloaded, and "imported" into Perceus, the Caos NSA 1 VNFS "capsule". Let me just unpack that :-) Caos is a high performance, lightweight distribution of Linux. (NSA stands for "Node, Server, Appliance".) VNFS stands for Virtual Node File System. The idea is that you package up an Operating System - like Caos - into a VNFS capsule which you can then easily distribute, run and manage on your cluster nodes with Perceus. In fact the VNFS system works really well.

My "head node", running Perceus on top of CentOS 5.2, has three network cards. One nic talks to the outside world, while the other two talk to the cluster. To allow this, I fixed the firewall to completely open up the two internal network connections by adding the following lines to /etc/sysconfig/iptables:

-A RH-Firewall-1-INPUT -i eth0 -j ACCEPT
-A RH-Firewall-1-INPUT -i eth1 -j ACCEPT

That done, I started a node with a monitor attached and watched it boot into Caos. Very cool. Except that's when my problems began...

I ran perceus node status on the head node to see what state Perceus thought my node was in. Unfortunately it showed "init" and not "ready". Then I remembered that I hadn't installed provisiond on the node image. provisiond is a client-side daemon that runs on each node and talks to perceus (running on the head node) to let it know what is going on with the node.

Following the instructions in the Perceus "User Guide" I spent the best part of two DAYS trying to install provisiond. It should be as easy as this:

rpm -ivh --root /mnt/caos-nsa-node-1.0-1.stateless.x86_64 \

The problem is that Caos is so high performance and so lightweight that it doesn't seem to have a working version of rpm - or any other package manager - installed. Most of those two days were spent trying to install rpm, or trying to work out what I'd missed in the User Guide. I hadn't missed anything: the User Guide is simply wrong.

Towards the end of the second day I thought I had better just check that provisiond wasn't already installed on the Caos image. It was. So the User Guide is doubly wrong: the wrong instructions for something that didn't need to be done in the first place. My node status problem had nothing to do with provisiond.

Nevertheless, the lack of a package manager is a big problem for me. One of the things most people will want to do is run mpi based applications on their cluster. That means you have to install mpi on the nodes. mpi is dependent on Python. Python isn't installed on Caos, so how are you going to install it? Not with rpm or yum or apt-get, that's for sure. Want to build it from source? How are you going to install a compiler? Perhaps that isn't an issue; perhaps you can compile it on the head node and use ./configure --prefix or something to install it to the mounted VNFS image. Are you sure all the libraries are going to be there?

May be this problem doesn't arise of you are also running Caos on the head node - I don't know. I gave up on Caos and used the centos-5.1-genchroot.sh script in /usr/share/perceus/vnfs-scripts (not "vnfs-tools" as it says in the User Guide) to create a CentOS VNFS image. The script worked perfectly and provisiond installed instantly first time.

My node status problem was down to a combination of two things. Firstly a DNS problem fixed by setting the correct nameserver entry to the head node in the /etc/resolv.conf file on the nodes; and secondly by getting the eth0 and eth1 device IP entries in the right order in the /etc/perceus/modules/ipaddr configuration file. Easy when you know how...

Wednesday, 3 December 2008

All CentOS is theft... Or is it?

Up to now I have been running Ubuntu as my Linux Server OS of choice: for me, the Fedora stack is updated far to frequently for it to be a viable server option. However, for reasons that will become apparant in future posts, I need to run a Red Hat type server. The problem is I don't want to pay for it... Wouldn't it be good if I could get hold of the Red Hat Enterprise code, without having to pay for support? Enter The Community Enterprise Operating System:

CentOS is an Enterprise-class Linux Distribution derived from sources freely provided to the public by a prominent North American Enterprise Linux vendor. CentOS conforms fully with the upstream vendors redistribution policy and aims to be 100% binary compatible. (CentOS mainly changes packages to remove upstream vendor branding and artwork.)
That "prominent North American Enterprise Linux vendor" is, of course, Red Hat. The CentOS FAQs make the following points:
  • CentOS-x is NOT Red Hat® Linux, it is NOT Fedora™ Core. It is NOT Red Hat® Enterprise Linux. It is NOT RHEL.
  • CentOS-x does NOT contain Red Hat® Linux, Fedora™ Core, or Red Hat® Enterprise Linux.
  • CentOS is built from publicly available open source SRPMS.
And so, dear reader, we enter the bizarre world of Open Source licensing. I am not saying that the folks at CentOS are a bunch of liars and thieves - clearly what they are doing is perfectly acceptable in the Open Source world. But where else would it be acceptable? What if I go and get a can of Diet Coke, scrub off the printing, and then put on my own label? In what sense would that NOT be Diet Coke? Would the The Coca-Cola Company not sue my sorry ass if I tried to pass this stuff off as my own - even if I was giving it away for free?

Now any Open Sourcers out there reading this may well be saying you just don't get it. And you know what? I don't. I write software for a living. If someone took my code, changed the logos, and then passed it off unchanged as their own, I would have some issues with that. But Open Source software isn't like that, right? It is a community effort, right? RHEL a community effort? The creation of those SRPMS files a community effort? I don't think so. (I'd better stop there, I'm starting sound like Lewis Black from the Daily Show...)

So I'm not going to use CentOS? Wrong! I'm definitely going to use CentOS. He who lives by the sword dies by the sword. If Red Hat want to play in a world where software has no value, that's up to them.

Going /home

With the release of Fedora 10 I thought I would take another look at running Linux on my laptop. I haven't changed my mind, however: I stick by my assertion that Linux is a great workbench, but a lousy desktop. I love working on Linux, but I'll be installing Fedora 10 on a separate hard drive.

One of things that might have caused my previous installation of Fedora to break so badly was that I upgraded it: from version 7 to version 8. It turns out that just because you can upgrade a system doesn't mean that you should: section of the Release Notes states "In general, fresh installations are recommended over upgrades." But if the OS is changing every 6 months, what do you do? Backing up your data (and all your settings) every time and then restoring it all to your new system is a serious pain. One answer to this is to create a separate /home partition. By doing this you can keep your stuff out of harms way when the system and the applications get nuked by the installer.

Creating a separate /home partition is beautifully easy on Fedora 10. When the installer gets to the bit about "Select which drive(s) to use for this installation", you just need to check the "Review and modify partitioning layout" box. On the next page, click the "New" button and then add a new partition with /home as the mount point; the only other thing you just need to do is set the size. I also checked "Encrypt" - meaning my personal data would be encrypted :-)

The Fedora installer does everything else for you - including resizing the other partitions so that your new /home partition fits.

Cool. Except before I could do this, I still needed to backup my data and settings from my old Fedora installation that didn't have a separate /home partition. To get this done, I created an archive of my old home directory:

#cd /home
#tar cfv David.tar David/

I then burnt David.tar to a DVD. Once Fedora 10 was installed I created my new "David" user account and let Fedora create a home directory. I then copied my David.tar file from the DVD to /home:

#cd /media
#cd "Personal Data, Dec 01, 2008"
#cp David.tar /home

I then deleted the "David" home directory that Fedora had created for me, and recreated it from the .tar file:

#cd /home
#rm -rf /home/David
#tar xvf ./David.tar

Finally, I needed to make my user account the owner of the directory.

chown -hR David /home/David

I logged on, and it worked! My desktop appeared just as I remembered it. Maybe I didn't need to create a /home partition after all...

Footnote. Sadly, though, not everything worked. I had hoped that the version of Evolution on Fedora 10 would just pick up all my email. No chance. It starts some conversion process... and then crashes everytime. What a piece of crap that application is. The lack of an enterprise class mail client is one of the biggest failures of the Linux desktop.