Thursday, 19 February 2009

Supersize me

One simple way to improve network performance is to enable Jumbo packets. Jumbo packets - or frames - are, as their name suggests, bigger than normal network packets. This means that more data can be transferred across the network at once: up to 9000 bytes instead of 1500 bytes. Because the computer CPU is not being interupted as often for the same volume of data, CPU utilization is increased along with throughput. What's more, on Linux enabling Jumbo frames is just a matter of setting the MTU (Maximum Transmission Unit) size on the network device:

#ifconfig eth1 mtu 9200

To make this permenant, you just need to add MTU=9200 in the /etc/sysconfig/network-scripts/ifcfg-eth1 file. On Perceus this is done in the /etc/perceus/modules/ipaddr file. The only other thing you need is a switch that has Jumbo frame support enabled. That done, you can test everything is working by pinging with a Jumbo packet size:

ping -s 9000 -M do 192.168.4.16

Easy! So why haven't I done this before? I originally bought D-Link DGE-528T Gigabit cards for the cluster. These were cheap, but seemed to offer everything I needed - including Jumbo frames support. The maximum MTU was only 7200 bytes, not the more usual 9200 bytes, but I could ping the cards with packets of 7000 bytes. What I found, however, was that I could not run the HPL benchmark xhpl with Jumbo packets of any size enabled. I could start up the mpd ring, and run mpdtrace, but when I actually tried to run xhpl it crashed:

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x555d1c0, count=1, dtype=USER, dest=2, tag=2001, comm=0x84000001) failed

Setting MTU back to 1500 again fixed the problem straight away. So is this a driver problem or a xhpl problem? I'm guessing it is a driver problem, but life's too short...

I made the decision to junk the D-Link DGE-528T cards and go with Intel PRO/1000 MT cards - another eBay triumph! This time there were no problems with xhpl crashing with Jumbo frames enabled. Interestingly, you enable Jumbo frames on the bonded device, bond0, not on the slave devices, and the bonding driver does the rest. Very cool.

Output on a four node test rose from 24.08 to 24.86 Gflops (70.6% compEff.) Running over all six nodes of the cluster produced 36.13 Gflops (68.4% compEff.)

I'm going to leave it there, and not spend any more time tuning. It is time to put the cluster to work running distributed Erlang. What performace critera are most relevent to this task is another question altogether...

I think we're bonding...

Having doubled the measured efficiency of my cluster by linking HPL to ACML, I looked around for something else that could take me closer to the 75% efficiency achieved by "real" clusters. One obvious area to look at is the "interconnect" - the network that connects the nodes together to actually make the cluster. Top clusters use Infiniband for the interconnect, but - even though prices continue to fall - Infiniband is still out of reach for just about all Beowulf clusters like mine. Whether Infiniband will ever be widely used for these kind of clusters is open to debate: perhaps we'll all be using 10GE cards in a few years. For now though, we have to get the most from our 1 Gigabit network cards.

The highest rated Gigabit cluster on the Top 500 list comes in at number 78. It's computational efficiency is only 53.26%. The average computational efficiency of all Gigabit clusters in November 2008's Top500 list is even lower at just 50.52%. The most efficient is 63.04% efficient, the lowest at 40.34% - and that is with 5096 cores! That is a lot of wasted clock cycles. Next to these figures, the 60.3% efficiency I achieved on my four node cluster doesn't look too bad. However, the efficency was nearer 70% with two nodes, so as I add nodes it looks like overall efficency will fall. Reason enough to try and maximize performance.

One of the things I wanted to investigate from the very start was network card bonding, or - perhaps more strictly - network card aggregration. This is where you "bond" two (or more) network cards together to create a single logical network interface. In other words, two network cards share the work of one, thereby giving you an increase in network performance.

Linux support for network card bonding is definitive. Not only that, it is very easy to impliment. (There is a good article here.) From a perceus cluster point of view, you just need to add the lines

alias bond0 bonding
options bond0 mode=802.3ad miimon=100

to the /etc/modprobe.conf on the vnfs image, and then edit the /etc/perceus/modules/ipaddr file. (This is used by the perceus 50-ipaddr.pl perl script to automatically generate the the ifcfg* files in /etc/sysconfig/network-scripts/):

n00004 bond0(USERCTL=no&BOOTPROTO=none&ONBOOT=yes):192.168.4.14/255.255.255.0/192.168.4.1 eth0(HWADDR=00:18:37:07:FB:3A):[default]/[default] eth1(MASTER=bond0&SLAVE=yes): eth2(MASTER=bond0&SLAVE=yes):

This creates a bond0 interface from eth1 and eth2, the "slave" interfaces.

Linux, then, supports bonding extremely well - but Linux is only one half of the connection. At the other end of the wire is the switch. I was using a Dell PowerConnect 2724 switch which advertises itself as supporting "Link Aggregation." Unfortunately "link aggregation" can mean different things to different people. To me it means "using multiple network cables/ports in parallel to increase the link speed beyond the limits of any one single cable or port..." But it can also refer to using bonded network devices for load balancing and fail-over protection. It is in this latter sense that the PowerConnect 2724 switch supports "aggregation." It doesn't support cabling devices in parallel to increase performance. For that, you need LACP (the Link Aggregation Control Protocol) which the Dell 27xx switches do not support; switches that do are twice the price. It was a case on getting on eBay, where I found a "pre-enjoyed" PowerConnect 5324 which does support LACP.

So was it worth it? On a two node test, output went from 12.22 Gflops to 12.93 Gflops; that doesn't sound a lot, but it means an increase in computational efficiency from 69.3% to 73.5%. On a four node test, output went from 21.24 Gflops to 24.08 Gflops; this is an increase in computational efficiency from 60.3% to 68.4%. So the benefit of aggregation is much greater on the four node cluster. This makes sense: by using aggregation we are flattening the performance curve; we are alleviating the bottleneck effect of the additional interconnects.

Network cards are cheap - at least relative to the cost of memory or processors. If you have a switch that supports LACP, link bonding definitely seems worthwhile.

Friday, 13 February 2009

VLANs to the rescue

I've overcome the two issues that caused me so much frustration. The problem of node network devices picking up different names (eth0, eth1) between boots was fixed by specifying the MAC addresses in the /etc/perceus/modules/ipaddr file:

eth0(HWADDR=00:18:37:07:05:A8):[default]/[default]

My thanks go to Greg Kurtzer for helping me get this fixed.

The general network problems I was having with provisiond, amongst other things, were resolved by splitting my "management" network (with 192.168.3.x addresses) and "application" network (with 192.168.4.x addesses) over two VLANs. This was easy to do on the switch and really cleaned things up. Now the node will always boot from the network device I want it to boot from, and management traffic doesn't wander over the application network.