Thursday, 19 February 2009

Supersize me

One simple way to improve network performance is to enable Jumbo packets. Jumbo packets - or frames - are, as their name suggests, bigger than normal network packets. This means that more data can be transferred across the network at once: up to 9000 bytes instead of 1500 bytes. Because the computer CPU is not being interupted as often for the same volume of data, CPU utilization is increased along with throughput. What's more, on Linux enabling Jumbo frames is just a matter of setting the MTU (Maximum Transmission Unit) size on the network device:

#ifconfig eth1 mtu 9200

To make this permenant, you just need to add MTU=9200 in the /etc/sysconfig/network-scripts/ifcfg-eth1 file. On Perceus this is done in the /etc/perceus/modules/ipaddr file. The only other thing you need is a switch that has Jumbo frame support enabled. That done, you can test everything is working by pinging with a Jumbo packet size:

ping -s 9000 -M do 192.168.4.16

Easy! So why haven't I done this before? I originally bought D-Link DGE-528T Gigabit cards for the cluster. These were cheap, but seemed to offer everything I needed - including Jumbo frames support. The maximum MTU was only 7200 bytes, not the more usual 9200 bytes, but I could ping the cards with packets of 7000 bytes. What I found, however, was that I could not run the HPL benchmark xhpl with Jumbo packets of any size enabled. I could start up the mpd ring, and run mpdtrace, but when I actually tried to run xhpl it crashed:

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x555d1c0, count=1, dtype=USER, dest=2, tag=2001, comm=0x84000001) failed

Setting MTU back to 1500 again fixed the problem straight away. So is this a driver problem or a xhpl problem? I'm guessing it is a driver problem, but life's too short...

I made the decision to junk the D-Link DGE-528T cards and go with Intel PRO/1000 MT cards - another eBay triumph! This time there were no problems with xhpl crashing with Jumbo frames enabled. Interestingly, you enable Jumbo frames on the bonded device, bond0, not on the slave devices, and the bonding driver does the rest. Very cool.

Output on a four node test rose from 24.08 to 24.86 Gflops (70.6% compEff.) Running over all six nodes of the cluster produced 36.13 Gflops (68.4% compEff.)

I'm going to leave it there, and not spend any more time tuning. It is time to put the cluster to work running distributed Erlang. What performace critera are most relevent to this task is another question altogether...

No comments: