Monday, 22 December 2008

Let's go clustering...

The reason I need CentOS is because I want to experiment with clustering. I first looked at clustering earlier this year, building a small 4 node beowulf cluster by following the "Configuration Notes" for Joel Adams' inspirational Microwulf. However, the Microwulf model is not easily scalable: for example, you need to manually create and configure a file partition on the host for each of the nodes. I want to look at something more industrial. After reading this article on the Linux Magazine website, I thought Perceus sounded just what I needed.

Perceus is an "enterprise and cluster provisioning toolkit" and supersedes the older Warewulf provisioning tools.

Perceus turned out to be pretty easy to build and install. I ended up needing to download and build all of the dependencies from the Perceus website, but nothing too onerous. I then downloaded, and "imported" into Perceus, the Caos NSA 1 VNFS "capsule". Let me just unpack that :-) Caos is a high performance, lightweight distribution of Linux. (NSA stands for "Node, Server, Appliance".) VNFS stands for Virtual Node File System. The idea is that you package up an Operating System - like Caos - into a VNFS capsule which you can then easily distribute, run and manage on your cluster nodes with Perceus. In fact the VNFS system works really well.

My "head node", running Perceus on top of CentOS 5.2, has three network cards. One nic talks to the outside world, while the other two talk to the cluster. To allow this, I fixed the firewall to completely open up the two internal network connections by adding the following lines to /etc/sysconfig/iptables:

-A RH-Firewall-1-INPUT -i eth0 -j ACCEPT
-A RH-Firewall-1-INPUT -i eth1 -j ACCEPT

That done, I started a node with a monitor attached and watched it boot into Caos. Very cool. Except that's when my problems began...

I ran perceus node status on the head node to see what state Perceus thought my node was in. Unfortunately it showed "init" and not "ready". Then I remembered that I hadn't installed provisiond on the node image. provisiond is a client-side daemon that runs on each node and talks to perceus (running on the head node) to let it know what is going on with the node.

Following the instructions in the Perceus "User Guide" I spent the best part of two DAYS trying to install provisiond. It should be as easy as this:

rpm -ivh --root /mnt/caos-nsa-node-1.0-1.stateless.x86_64 \

The problem is that Caos is so high performance and so lightweight that it doesn't seem to have a working version of rpm - or any other package manager - installed. Most of those two days were spent trying to install rpm, or trying to work out what I'd missed in the User Guide. I hadn't missed anything: the User Guide is simply wrong.

Towards the end of the second day I thought I had better just check that provisiond wasn't already installed on the Caos image. It was. So the User Guide is doubly wrong: the wrong instructions for something that didn't need to be done in the first place. My node status problem had nothing to do with provisiond.

Nevertheless, the lack of a package manager is a big problem for me. One of the things most people will want to do is run mpi based applications on their cluster. That means you have to install mpi on the nodes. mpi is dependent on Python. Python isn't installed on Caos, so how are you going to install it? Not with rpm or yum or apt-get, that's for sure. Want to build it from source? How are you going to install a compiler? Perhaps that isn't an issue; perhaps you can compile it on the head node and use ./configure --prefix or something to install it to the mounted VNFS image. Are you sure all the libraries are going to be there?

May be this problem doesn't arise of you are also running Caos on the head node - I don't know. I gave up on Caos and used the script in /usr/share/perceus/vnfs-scripts (not "vnfs-tools" as it says in the User Guide) to create a CentOS VNFS image. The script worked perfectly and provisiond installed instantly first time.

My node status problem was down to a combination of two things. Firstly a DNS problem fixed by setting the correct nameserver entry to the head node in the /etc/resolv.conf file on the nodes; and secondly by getting the eth0 and eth1 device IP entries in the right order in the /etc/perceus/modules/ipaddr configuration file. Easy when you know how...

1 comment:

u92 said...

Ive been working on a caos cluster for a couple days. I dumped my centos cluster to make the change to the capsule deployment. So far caos has been fighting me all the way.
For starters there is no quick partition manager for the Install on the head node. I’m running an ibm eserver 345 with 6 hard drives in it.
Second is the sidekick mod for fast deployment. It simply doesn’t work properly. It just a buggy script. Even the adduser scripts do not work.
As for the method of capsule deployment I’ve had no problems. With caos running on the head node I can open and compile into the image, then deploy it in munities.
I would like to see a setup script with a simple run down for deploying the image. That would simply checked off, adduser, passwd and things like that. Instead of building each node for clients. Not to mention running a CPANLE for them.
Since its on a VFS It should be no problem for the developers to do.
Good luck.