Tuesday, 23 December 2008

Show us yer FLOPS!

What's the first thing you want to know about your cluster? Or anyone else's cluster, come to that? You want to know how fast it is, right? The way of measuring cluster performance is to count the FLOPS - the FLoating point Operations Per Second - the cluster is capable of executing. Counting FLOPS is the biggest pissing competition in the world of computing. Right now the cluster that can piss higher up the wall than any other is RoadRunner, capable of 1.7 petaflops (that's a 1 with 15 noughts - a million, billion FLOPS, a million gigaflops.)

The way to measure FLOPS is to run the High Performance Computing Linpack Benchmark - HPL. You can download HPL from Netlib. To build HPL you need two more things - mpi and a BLAS library. BLAS stands for Basic Linear Algebra Subroutines. GotoBLAS is recognized as being a fast implimentation of BLAS, so I downloaded that. I ran the GotoBLAS quickbuild.64bit script and everything seemed OK, so I left it there.

It is probably worth mentioning that HPL can use Fortran 77, so I installed the g77 "compat-gcc-34-g77-3.4.6-4" package.

Building HPL is a case of creating a make file for your architecture. Fortunately, you can just edit one of the default files in the hpl "setup" sub directory. I used the Make.Linux_PII_FBLAS file and set the MPI directory as follows:

MPdir = /opt/mpich2-install

and the Linear Algebra library like this:

LAdir = /home/David/Sources/GotoBLAS

Then its just a case of calling make specifying the right architecture:

make arch=Linux_PII

(Obviously the make file needs to match.)

The file you need to start benchmarking is xhpl in the (in my case) hpl/bin/Linux_PII_FBLAS/ sub directory. We need to run this with mpirun. However, there's an issue. In order to run an application across multiple nodes, that application's binary file (and any supporting libraries) needs to be on each node. Rebuilding a node image every time we want to run a new application is obviously out of the question, so what do we do? Fortunately Perceus can come to our aid.

Perceus supports what is known as "Hybridization". Hybridization is essentially file sharing. The idea is that files or folders in the VNFS image are replaced by symbolic links pointing to network-based files or folders. Unfortunately, it is at this point that Perceus' careful abstraction of the node file system falls down. To specify which files or folders get redirected you have to get into the guts of how Perceus organizes things.

I want to create a shared directory where I can put the binaries I want to run with mpi. This directory is going to be /opt/mpirun. Importantly, this directory has to exist on the head node, as well as all the cluster nodes. The first step is to add /opt/mpirun to the hybridize configuration file located at /etc/perceus/vnfs/vnfs_capsule_name. However, the /opt/mpirun directory specified in the hybridize file is not the /opt/mpirun directory on the host machine (the head node.) Oh no, this is the /opt/mpirun directory located on the physical representation of the VNFS file system that actually underlies the "virtual" VNFS file system of the nodes. In reality this is located at /var/lib/perceus/vnfs/vnfs_capsule_name/rootfs/opt/mpirun. It is this directory that actually gets shared. So because I want /opt/mpirun to exist on the head node as well, /opt/mpirun on the head node has to be a symbolic link back to /var/lib/perceus/vnfs/vnfs_capsule_name/rootfs/opt/mpirun. Not pretty. Finally, you need to mount the VNFS image and edit /etc/fstab so that the node connects to the share.

That done, we are ready to start testing the performance of the cluster by running xhpl. (Well, almost, I had to copy the libg2c.so.0 library to /usr/lib64/ on the nodes first.) Copy xhlp and HPL.dat to /opt/mpirun and go...

No comments: