Sunday, 25 January 2009

DNS Disaster!

I needed to make a change to the configuration of one of the network cards on the "host" machine. (I tend to use the word "host" rather than "head node" because I don't really think of it as a node - it's not where any applications will run. I suppose I could also use the more Perceus-like "master".) Because I come from a Windows background I still tend to look for a GUI rather than use the command line, so I ran system-config-network. Big, big mistake. Suddenly I can no longer ping any of the cluster nodes by name.

When I set up Perceus I'm sure I did no more than follow the instructions in the User Guide and add

nameserver 127.0.0.1

to the /etc/resolv.conf file on the host. (I also had to put the host address in the /etc/resolv.conf file on the nodes, of course: if your nodes boot slowly, you probably forgot.) system-config-network wiped my resolv.conf file, so I added the nameserver line back in and then ran /etc/init.d/preceus reload. No good. Looking in /var/log/messages (via gnome-system-log, ofcourse!) showed this when I did the reload:

perceus-dnsmasq[5286]: ignoring nameserver 127.0.0.1 - local interface

But is this a problem? I read somewhere that it isn't, however, subsequently I've not seen this message repeated when things have worked.

It struck me that I was assigning static IP addresses to my nodes in /etc/perceus/modules/ipaddr. Does perceus-dnsmasq pick these up, or do I need to add them to the /etc/hosts file? I didn't need to before. It also struck me that the messages log was filling up with entries like

perceus[5295]: ERROR: Unknown node just tried to contact us! (NodeID=00:1C:F0:6E:C8:53)

This was despite the fact that for a few minutes after booting there were no errors and perceus node status showed the node as "ready" and regularly responding, only to fail to respond later. Despite the fact too, that other nodes which showed up as "init" were not generating "Unknown node" errors.

At this point I was just a bit confused...

Time to get back to some certainties. The first thing to do was to get DNS working. I use two "networks" for the cluster. The "management" traffic is sent over a network card assigned a static 192.168.3.x address. The application traffic is send over a second network card with a static 192.168.4.x address. How do you make sure the right card gets the right IP address? You run /sbin/ifconfig on the node to get the order of the network devices and edit /etc/perceus/modules/ipaddr accordingly.

Easy, right? Well no. I've found that which node network device gets which name, eth0, eth1, etc. can change between boots, or if the host has been rebooted, or if the vnfs image has been updated.

I edited the /etc/perceus/modules/ipaddr file so that the management card would get a "Default" address. Perceus first looks in the /etc/hosts file for the node name, if it doesn't find one it assigns a DHCP address. However, the DHCP address that is assigned is not the same as the address assigned on boot up! As a result, if I tied to ping the node, the node name was resolved to the boot up address again. This cannot be right. There must be something wrong with perceus-dnsmasq - and if there isn't there should be.

So I added the node name to the /etc/hosts file with its static 192.168.3.x address. Finally, things started to work. At least, I was able to ping the nodes by name. However, perceus node status was still not being updated and I was still getting "unknown node" errors. I will leave that investigation to another post.

With the node names being resolved to their 192.168.3.x addresses, I needed to change the way I launched mpi applications. Essentially this is just a case of saying which interface hostname to use. So in the mpd.hosts name file I added entries like this:

n00005 ifhn=192.168.4.25
n00004 ifhn=192.168.4.24

Then bring up the ring of mpds specifying the local interface hostname:

#mpdboot --ifhn=192.168.4.1 -n 3 -v -f mpd.hosts

I could then add the application network addresses I wanted to use to the machine file, just:

192.168.4.25
192.168.4.24

and run the application:

# mpirun -machinefile machines -n 4 ./xhpl

I hadn't needed to do any of this previously. I'm left with the uneasy feeling that I haven't got to the bottom of why the problem arose, or whether what I've done is really the solution or just a workaround. If I had 1000 nodes, or 10,000 nodes, would Perceus expect me add all those addresses to the /etc/hosts file? perceus-dnsmasq should handle that, shouldn't it?

No comments: