Thursday, 19 February 2009

I think we're bonding...

Having doubled the measured efficiency of my cluster by linking HPL to ACML, I looked around for something else that could take me closer to the 75% efficiency achieved by "real" clusters. One obvious area to look at is the "interconnect" - the network that connects the nodes together to actually make the cluster. Top clusters use Infiniband for the interconnect, but - even though prices continue to fall - Infiniband is still out of reach for just about all Beowulf clusters like mine. Whether Infiniband will ever be widely used for these kind of clusters is open to debate: perhaps we'll all be using 10GE cards in a few years. For now though, we have to get the most from our 1 Gigabit network cards.

The highest rated Gigabit cluster on the Top 500 list comes in at number 78. It's computational efficiency is only 53.26%. The average computational efficiency of all Gigabit clusters in November 2008's Top500 list is even lower at just 50.52%. The most efficient is 63.04% efficient, the lowest at 40.34% - and that is with 5096 cores! That is a lot of wasted clock cycles. Next to these figures, the 60.3% efficiency I achieved on my four node cluster doesn't look too bad. However, the efficency was nearer 70% with two nodes, so as I add nodes it looks like overall efficency will fall. Reason enough to try and maximize performance.

One of the things I wanted to investigate from the very start was network card bonding, or - perhaps more strictly - network card aggregration. This is where you "bond" two (or more) network cards together to create a single logical network interface. In other words, two network cards share the work of one, thereby giving you an increase in network performance.

Linux support for network card bonding is definitive. Not only that, it is very easy to impliment. (There is a good article here.) From a perceus cluster point of view, you just need to add the lines

alias bond0 bonding
options bond0 mode=802.3ad miimon=100

to the /etc/modprobe.conf on the vnfs image, and then edit the /etc/perceus/modules/ipaddr file. (This is used by the perceus 50-ipaddr.pl perl script to automatically generate the the ifcfg* files in /etc/sysconfig/network-scripts/):

n00004 bond0(USERCTL=no&BOOTPROTO=none&ONBOOT=yes):192.168.4.14/255.255.255.0/192.168.4.1 eth0(HWADDR=00:18:37:07:FB:3A):[default]/[default] eth1(MASTER=bond0&SLAVE=yes): eth2(MASTER=bond0&SLAVE=yes):

This creates a bond0 interface from eth1 and eth2, the "slave" interfaces.

Linux, then, supports bonding extremely well - but Linux is only one half of the connection. At the other end of the wire is the switch. I was using a Dell PowerConnect 2724 switch which advertises itself as supporting "Link Aggregation." Unfortunately "link aggregation" can mean different things to different people. To me it means "using multiple network cables/ports in parallel to increase the link speed beyond the limits of any one single cable or port..." But it can also refer to using bonded network devices for load balancing and fail-over protection. It is in this latter sense that the PowerConnect 2724 switch supports "aggregation." It doesn't support cabling devices in parallel to increase performance. For that, you need LACP (the Link Aggregation Control Protocol) which the Dell 27xx switches do not support; switches that do are twice the price. It was a case on getting on eBay, where I found a "pre-enjoyed" PowerConnect 5324 which does support LACP.

So was it worth it? On a two node test, output went from 12.22 Gflops to 12.93 Gflops; that doesn't sound a lot, but it means an increase in computational efficiency from 69.3% to 73.5%. On a four node test, output went from 21.24 Gflops to 24.08 Gflops; this is an increase in computational efficiency from 60.3% to 68.4%. So the benefit of aggregation is much greater on the four node cluster. This makes sense: by using aggregation we are flattening the performance curve; we are alleviating the bottleneck effect of the additional interconnects.

Network cards are cheap - at least relative to the cost of memory or processors. If you have a switch that supports LACP, link bonding definitely seems worthwhile.

1 comment:

Menacing said...

Very interesting... I would like to see your notes on configuring LACP on the 5324. I just acquired one myself and am going to configure LAGG on my FreeNAS.