Wednesday, 16 December 2009

F#, Erlang and GPUs


Christmas time in olde Rochester towne, and as is traditional there is a fair in the grounds of the castle. I like carousels, and so do my kids, but here’s the thing: you can’t just go up to the carousel and go for a ride. The first thing you have got to do is wait for the carousel to stop and wait for the other people to get off. Once you get on you still don’t go anywhere; the guy running the carousel doesn’t want to run it half empty, so you have to wait for it to fill up. Only once it’s filled up, or if no one else turns up, do you get to ride. Finally, everyone gets the same ride: the music plays, the cranks turn, the horses go up and down, the same for everyone.

At ForensiT we’re right at the start of planning a major new project. For the first time, we’re thinking seriously about F#. F# is now a fully featured .NET language and ships as part of Visual Studio 2010. But why even consider F# instead of, say, Erlang?

When I was kid, a long time ago before the computer was personal, there used to be a lot of moaning a about the Japanese. All they did was take the products that we had invented and make them cheaper... oh, and... er... better. From a purely financial point of view, it’s not such a bad business model. Microsoft have a reputation for doing much the same thing...

Here’s some F# code that Luca Bolognese demonstrates in his excellent PDC 2008 presentation:


Here’s some Erlang code that does the same thing


Is it just me, or are there some similarities here? Except this is where F# begins, it is not where it finishes. It is always vital to keep in mind that Microsoft has, to all intents and purposes, unlimited resources. If Microsoft are going to do a functional programming language then, eventually at least, they are going to come up with the most powerful functional programming language there is.

F# is a .NET language, that means that a F# programmer has complete access to all the .NET code that Microsoft and others have developed over the last decade. More importantly perhaps, a F# programmer will have access to all the future innovations in the .NET framework. What I’m thinking of here in particular is future support for multi-core processors and concurrency. The speed at which Microsoft can assimilate and support new hardware and software technologies is just way, way beyond what even well supported projects like Erlang can do. If you choose Erlang over F# you have to be aware that you are giving all this up.

One technology that is already pressing is the GPU and programming languages like CUDA. There are already third party .NET libraries for CUDA and OpenCL. Without doubt, native support will follow. For Erlang to support GPU processing would require the Erlang runtime to be rewritten. Can we really expect the Erlang developers to spend precious time and effort on a technology which may not stand the test of time?

Notwithstanding the general point about future innovations in the .NET framework, how desirable would GPU support in Erlang actually be? The beauty of Erlang is in its message passing: multiple processes (“agents” in F#) signal each other when they want something done, or when they have a response. Just as a thought experiment, let’s imagine a moderately sized Erlang application with many processes. Let’s also assume that that the vast majority of these processes carry out the same task in response to the same message; in other words, they are different instances of the same code. The application runs and messages start flying around; realistically, the different processes receive messages at different times.

What happens when a process receives a message with some data that it has to process? Here’s the first point where GPU code might be useful. If the task was sufficiently complex, and sufficiently parallelizable, being able to call GPU-code would be a big help. (I don’t really have a clear idea about what sort of task this would be, but the calculation of a Mandelbrot set is the kind of thing I have in mind.) You can of course call C functions from Erlang, so some CUDA interop is already possible; leaving aside the question of how efficient it would be.

But what if the task each process carries out is not complex? Could GPU code make the basic infrastructure of Erlang, the framework of messaging passing, more efficient? Let’s go back to our thought experiment. A message comes in with data that needs to be processed. GPUs work with batches of data, they are fundamentally SIMD (Single Instruction, Multiple Data) devices. To take advantage of the GPU, the process would need to hand the data off to some kind of scheduler and wait. The scheduler wouldn’t want to run the GPU half empty, so it would wait for a certain time until either it had enough data to use the available GPU cores, or the waiting time was up. Once the ride was over, the GPU would then have send a result back to each of the waiting processes.

The question then, becomes one of whether it is faster to wait for the batching and unbatching of the GPU code, or faster just to wait for a time-slice on a CPU core. I certainly don’t have the answer. What we can say is that our thought experiment is very GPU-friendly. If you only have a small number of processes, or the processes are doing different things, a GPU is not going to give you any advantages.

The truth is that functional, message-passing languages and GPUs occupy two different parts of the parallel computing world. It is not as obvious, as I first thought it was, that they actually need each other.

Monday, 23 March 2009

Heads in the Clouds

I've been trying to understand Cloud Computing. No, let me rephrase that, I've been trying to understand what's different about Cloud Computing. Microsoft are busy prepping their Azure Services Platform in readiness for taking on Amazon Web Services, Sun, Google and the rest. Even in the worst economic climate since the 1930s, these companies are spending tens of millions of dollars on building massive data centres that will host "The Cloud." Why?

Microsoft have some Case Studies to help those us having problems with the whole Cloud thing get it. Here's an extract


Ok. But is there really anything new here? What happens if we replace the word "Cloud" with something more prosaic and old-fashioned, something like the word "Internet":


You would be forgiven for thinking that these two paragraphs mean exactly the same thing. But they can't do. "Cloud Computing" must mean something, right?

Well, I'm not going to labour the point. Yes it does mean something different: it means having your code run on someone else's hardware. It also means having your data stored on someone else's hardware. In the example above, Infosys don't have a SQL database running on an Internet server that they own, they have a SQL database running on an Internet server that Microsoft owns. Why would they do something like that? Because of the benefits that Microsoft and Amazon and everyone else claims for Cloud Computing: start-up costs are minimal - you don't have to buy a server infrastructure up front, just rent what you need and scale up (and up) as you need to; redundancy is massive - if the data center hosting your application falls over, your application is instantly shifted to another data centre in another time zone, or in another country, or on another continent. As Microsoft says, it is about "having global-class credibility."

So the benefits are real. But that really is it. There's no new code, no new Internet. As Oracle's Larry Ellison said last year "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do." He went on to say "The computer industry is the only industry that is more fashion-driven than women's fashion."

What Cloud Computing emphatically is not is "where thousands of computers cooperate through the Internet to compute a result. Google’s proprietary MapReduce* framework is the standard bearer for this..." This definition comes from a recent Intel primer on Parallel programming. The author is either being idealistic or naive. Let's be generous: this is what we would like Cloud Computing to be - unlimited access to a hyper-computer where we only pay for the time we use. But it is not what is on offer. Yes, you can set up multiple virtual servers on AWS to form a cluster, but they might all be running on the same physical box! Never forgetting, of course, that the interconnect is the cluster.

Neither is Cloud Computing "where IT power is delivered over the Internet as you need it, rather than drawn from a desktop computer." At least, not yet. This is a more generalized and subtle, but just as idealistic definition. Notice the word "desktop" - we're not talking about replacing web servers here, which is what Amazon and Microsoft are punting. In this definition the "standard bearer" is Google's Gmail. What is replaced is the mail program on your PC. But there is nothing remotely new just in web applications like Gmail that requires the term "Cloud Computing." To have any force, this web applications paradigm would have to advance to the point where people were accessing not just their email, but the vast majority of their applications from multiple consumer devices (like mobile phones, TVs and digital cameras) which didn't just supplement the PC, but actually subsumed it. This promise is implicit in Cloud Computing, but again it is not what's on offer - and just building data centers won't in itself make it happen.

This definition comes from The Guardian's article on GNU founder Richard Stallman's now famous attack on Cloud Computing. Stallman argued that Cloud Computing was "worse than stupidity" and "simply a trap aimed at forcing more people to buy into locked, proprietary systems that would cost them more and more over time." Crucially he argued that "One reason you should not use web applications to do your computing is that you lose control..." (My emphasis.) This issue of control has been heavily debated. (There is one example here.) I just want to comment about one aspect of control.

In their book Fire in the Valley about the birth of the PC industry, Paul Freiberger and Michael Swaine talk about the feeling programmers, technicians and enginneers had in the 1960s and 70s of being "locked out of the machine room." The Personal Computer changed all that; it was genuinely subversive, putting technology and computing power in the hands of anyone who wanted to use it. Those of us who build Beowulf clusters work in the same tradition. The danger of Cloud Computing is that once again the machine room door will be slammed in our faces.

Saturday, 21 March 2009

Two more nodes

I've added two additional nodes to my cluster. The new nodes have Asus rather than Abit motherboards and, unfortunately, CentOS 5.2 doesn't support the Nvidia nForce 630a chipset. So I was left with a choice of either trying to build the necessary drivers, or finding an alternative. In the end I decided to give CAOS Linux another try. There seems to be no link on the website anymore to download a VNFS image, but if you Google you can find the FTP site easily enough. I used version 1.0 rather than the RC1 version I had tried previously. It is still a "bare bones" distribution - it doesn't even come with vi - but it worked. I even managed to get Python installed from source using "make DESTDIR", which didn't happen before. There are a few quirks: I don't seem to be able to specify which eth device gets which MAC address, which means I can't use it on all the nodes, but for the new nodes it's fine.

I wasn't going to go on about performance results anymore, they are just a bit of fun after all. However, having added the two new nodes they had to be tested :-) The new nodes have Athlon 64 X2 5200+ (2700MHz) processors instead of 4200 (2200MHz) processors, and 4Gb RAM instead of 2Gb RAM. So a big increase in performance right? Er... no. There is an increase in performance - from 13.3 to 14.25 Gflops on a two node test, but this modest increase in output is offset by a big drop in computational efficiency: down from 75.6% to 66%. What's going on?

On the original nodes, with 2Gb RAM, pretty much all I had to do to find the maximum performance was keep increasing the problem size (N) until xhpl ran out of memory and crashed. The bigger the problem size, the better the measured performance. With the nodes with 4Gb RAM, the measured performance peaks before (way before) we run out of memory. What we see is an initial steep increase in performance as the problem size increases, leading to a peak, and then a slow decline.


It's as if the processors can't make effective use of all that extra memory. Or perhaps we've hit the limit of what the bonded Gigabit Ethernet interconnect can deliver.

Thursday, 19 February 2009

Supersize me

One simple way to improve network performance is to enable Jumbo packets. Jumbo packets - or frames - are, as their name suggests, bigger than normal network packets. This means that more data can be transferred across the network at once: up to 9000 bytes instead of 1500 bytes. Because the computer CPU is not being interupted as often for the same volume of data, CPU utilization is increased along with throughput. What's more, on Linux enabling Jumbo frames is just a matter of setting the MTU (Maximum Transmission Unit) size on the network device:

#ifconfig eth1 mtu 9200

To make this permenant, you just need to add MTU=9200 in the /etc/sysconfig/network-scripts/ifcfg-eth1 file. On Perceus this is done in the /etc/perceus/modules/ipaddr file. The only other thing you need is a switch that has Jumbo frame support enabled. That done, you can test everything is working by pinging with a Jumbo packet size:

ping -s 9000 -M do 192.168.4.16

Easy! So why haven't I done this before? I originally bought D-Link DGE-528T Gigabit cards for the cluster. These were cheap, but seemed to offer everything I needed - including Jumbo frames support. The maximum MTU was only 7200 bytes, not the more usual 9200 bytes, but I could ping the cards with packets of 7000 bytes. What I found, however, was that I could not run the HPL benchmark xhpl with Jumbo packets of any size enabled. I could start up the mpd ring, and run mpdtrace, but when I actually tried to run xhpl it crashed:

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=0x555d1c0, count=1, dtype=USER, dest=2, tag=2001, comm=0x84000001) failed

Setting MTU back to 1500 again fixed the problem straight away. So is this a driver problem or a xhpl problem? I'm guessing it is a driver problem, but life's too short...

I made the decision to junk the D-Link DGE-528T cards and go with Intel PRO/1000 MT cards - another eBay triumph! This time there were no problems with xhpl crashing with Jumbo frames enabled. Interestingly, you enable Jumbo frames on the bonded device, bond0, not on the slave devices, and the bonding driver does the rest. Very cool.

Output on a four node test rose from 24.08 to 24.86 Gflops (70.6% compEff.) Running over all six nodes of the cluster produced 36.13 Gflops (68.4% compEff.)

I'm going to leave it there, and not spend any more time tuning. It is time to put the cluster to work running distributed Erlang. What performace critera are most relevent to this task is another question altogether...

I think we're bonding...

Having doubled the measured efficiency of my cluster by linking HPL to ACML, I looked around for something else that could take me closer to the 75% efficiency achieved by "real" clusters. One obvious area to look at is the "interconnect" - the network that connects the nodes together to actually make the cluster. Top clusters use Infiniband for the interconnect, but - even though prices continue to fall - Infiniband is still out of reach for just about all Beowulf clusters like mine. Whether Infiniband will ever be widely used for these kind of clusters is open to debate: perhaps we'll all be using 10GE cards in a few years. For now though, we have to get the most from our 1 Gigabit network cards.

The highest rated Gigabit cluster on the Top 500 list comes in at number 78. It's computational efficiency is only 53.26%. The average computational efficiency of all Gigabit clusters in November 2008's Top500 list is even lower at just 50.52%. The most efficient is 63.04% efficient, the lowest at 40.34% - and that is with 5096 cores! That is a lot of wasted clock cycles. Next to these figures, the 60.3% efficiency I achieved on my four node cluster doesn't look too bad. However, the efficency was nearer 70% with two nodes, so as I add nodes it looks like overall efficency will fall. Reason enough to try and maximize performance.

One of the things I wanted to investigate from the very start was network card bonding, or - perhaps more strictly - network card aggregration. This is where you "bond" two (or more) network cards together to create a single logical network interface. In other words, two network cards share the work of one, thereby giving you an increase in network performance.

Linux support for network card bonding is definitive. Not only that, it is very easy to impliment. (There is a good article here.) From a perceus cluster point of view, you just need to add the lines

alias bond0 bonding
options bond0 mode=802.3ad miimon=100

to the /etc/modprobe.conf on the vnfs image, and then edit the /etc/perceus/modules/ipaddr file. (This is used by the perceus 50-ipaddr.pl perl script to automatically generate the the ifcfg* files in /etc/sysconfig/network-scripts/):

n00004 bond0(USERCTL=no&BOOTPROTO=none&ONBOOT=yes):192.168.4.14/255.255.255.0/192.168.4.1 eth0(HWADDR=00:18:37:07:FB:3A):[default]/[default] eth1(MASTER=bond0&SLAVE=yes): eth2(MASTER=bond0&SLAVE=yes):

This creates a bond0 interface from eth1 and eth2, the "slave" interfaces.

Linux, then, supports bonding extremely well - but Linux is only one half of the connection. At the other end of the wire is the switch. I was using a Dell PowerConnect 2724 switch which advertises itself as supporting "Link Aggregation." Unfortunately "link aggregation" can mean different things to different people. To me it means "using multiple network cables/ports in parallel to increase the link speed beyond the limits of any one single cable or port..." But it can also refer to using bonded network devices for load balancing and fail-over protection. It is in this latter sense that the PowerConnect 2724 switch supports "aggregation." It doesn't support cabling devices in parallel to increase performance. For that, you need LACP (the Link Aggregation Control Protocol) which the Dell 27xx switches do not support; switches that do are twice the price. It was a case on getting on eBay, where I found a "pre-enjoyed" PowerConnect 5324 which does support LACP.

So was it worth it? On a two node test, output went from 12.22 Gflops to 12.93 Gflops; that doesn't sound a lot, but it means an increase in computational efficiency from 69.3% to 73.5%. On a four node test, output went from 21.24 Gflops to 24.08 Gflops; this is an increase in computational efficiency from 60.3% to 68.4%. So the benefit of aggregation is much greater on the four node cluster. This makes sense: by using aggregation we are flattening the performance curve; we are alleviating the bottleneck effect of the additional interconnects.

Network cards are cheap - at least relative to the cost of memory or processors. If you have a switch that supports LACP, link bonding definitely seems worthwhile.

Friday, 13 February 2009

VLANs to the rescue

I've overcome the two issues that caused me so much frustration. The problem of node network devices picking up different names (eth0, eth1) between boots was fixed by specifying the MAC addresses in the /etc/perceus/modules/ipaddr file:

eth0(HWADDR=00:18:37:07:05:A8):[default]/[default]

My thanks go to Greg Kurtzer for helping me get this fixed.

The general network problems I was having with provisiond, amongst other things, were resolved by splitting my "management" network (with 192.168.3.x addresses) and "application" network (with 192.168.4.x addesses) over two VLANs. This was easy to do on the switch and really cleaned things up. Now the node will always boot from the network device I want it to boot from, and management traffic doesn't wander over the application network.

Sunday, 25 January 2009

DNS Disaster!

I needed to make a change to the configuration of one of the network cards on the "host" machine. (I tend to use the word "host" rather than "head node" because I don't really think of it as a node - it's not where any applications will run. I suppose I could also use the more Perceus-like "master".) Because I come from a Windows background I still tend to look for a GUI rather than use the command line, so I ran system-config-network. Big, big mistake. Suddenly I can no longer ping any of the cluster nodes by name.

When I set up Perceus I'm sure I did no more than follow the instructions in the User Guide and add

nameserver 127.0.0.1

to the /etc/resolv.conf file on the host. (I also had to put the host address in the /etc/resolv.conf file on the nodes, of course: if your nodes boot slowly, you probably forgot.) system-config-network wiped my resolv.conf file, so I added the nameserver line back in and then ran /etc/init.d/preceus reload. No good. Looking in /var/log/messages (via gnome-system-log, ofcourse!) showed this when I did the reload:

perceus-dnsmasq[5286]: ignoring nameserver 127.0.0.1 - local interface

But is this a problem? I read somewhere that it isn't, however, subsequently I've not seen this message repeated when things have worked.

It struck me that I was assigning static IP addresses to my nodes in /etc/perceus/modules/ipaddr. Does perceus-dnsmasq pick these up, or do I need to add them to the /etc/hosts file? I didn't need to before. It also struck me that the messages log was filling up with entries like

perceus[5295]: ERROR: Unknown node just tried to contact us! (NodeID=00:1C:F0:6E:C8:53)

This was despite the fact that for a few minutes after booting there were no errors and perceus node status showed the node as "ready" and regularly responding, only to fail to respond later. Despite the fact too, that other nodes which showed up as "init" were not generating "Unknown node" errors.

At this point I was just a bit confused...

Time to get back to some certainties. The first thing to do was to get DNS working. I use two "networks" for the cluster. The "management" traffic is sent over a network card assigned a static 192.168.3.x address. The application traffic is send over a second network card with a static 192.168.4.x address. How do you make sure the right card gets the right IP address? You run /sbin/ifconfig on the node to get the order of the network devices and edit /etc/perceus/modules/ipaddr accordingly.

Easy, right? Well no. I've found that which node network device gets which name, eth0, eth1, etc. can change between boots, or if the host has been rebooted, or if the vnfs image has been updated.

I edited the /etc/perceus/modules/ipaddr file so that the management card would get a "Default" address. Perceus first looks in the /etc/hosts file for the node name, if it doesn't find one it assigns a DHCP address. However, the DHCP address that is assigned is not the same as the address assigned on boot up! As a result, if I tied to ping the node, the node name was resolved to the boot up address again. This cannot be right. There must be something wrong with perceus-dnsmasq - and if there isn't there should be.

So I added the node name to the /etc/hosts file with its static 192.168.3.x address. Finally, things started to work. At least, I was able to ping the nodes by name. However, perceus node status was still not being updated and I was still getting "unknown node" errors. I will leave that investigation to another post.

With the node names being resolved to their 192.168.3.x addresses, I needed to change the way I launched mpi applications. Essentially this is just a case of saying which interface hostname to use. So in the mpd.hosts name file I added entries like this:

n00005 ifhn=192.168.4.25
n00004 ifhn=192.168.4.24

Then bring up the ring of mpds specifying the local interface hostname:

#mpdboot --ifhn=192.168.4.1 -n 3 -v -f mpd.hosts

I could then add the application network addresses I wanted to use to the machine file, just:

192.168.4.25
192.168.4.24

and run the application:

# mpirun -machinefile machines -n 4 ./xhpl

I hadn't needed to do any of this previously. I'm left with the uneasy feeling that I haven't got to the bottom of why the problem arose, or whether what I've done is really the solution or just a workaround. If I had 1000 nodes, or 10,000 nodes, would Perceus expect me add all those addresses to the /etc/hosts file? perceus-dnsmasq should handle that, shouldn't it?

Sunday, 18 January 2009

Installing Erlang

One of my key goals for my experimental cluster is to run (and program) distributed Erlang. The first task is simply to get Erlang installed on the nodes. Unfortunately, there is no CentOS rpm package for Erlang. This is a bit surprising: it means that there is no rpm package for RHEL either. It's not a big problem, we just need to install from the source code.

The first thing to do is download and unpack the source file. Do not do what I did and use File Roller, the GNOME archive manager. If you do you get a make error. Instead just follow the instructions in the readme file:

gunzip -c otp_src_R12B-5.tar.gz tar xf -
zcat otp_src_R12B-5.tar.gz tar xf -

To build Erlang I needed to install the ncurses and OpenSSL development libraries:

yum install ncurses-devel
yum install openssl-devel

That done, Erlang built without any problems. But that's only half the job. The nice thing about running the same OS on both the "host" machine (where the nodes are managed from) and on the nodes themselves, is that you can build software on the host and then just copy it to the nodes. The Perceus user guide states that you should be able to do something like this:

make DESTDIR=/mnt/centos-5.2-1.stateless.x86_64 install

I have to say, however, that I've never got this to work. Happily, Erlang's file structure is quite simple. The progam files are in /usr/local/lib/erlang (by default) and there are a bunch of links in /usr/local/bin. Once Erlang is installed on the host you can copy the files to the mounted vnfs:

cp -r /usr/local/bin/* /mnt/centos-5.2-1.stateless.x86_64/usr/local/bin
cp -r /usr/local/lib/erlang /mnt/centos-5.2-1.stateless.x86_64/usr/local/lib

I rebooted a node, connected over ssh, and typed erl. Everything looks fine.

Monday, 5 January 2009

HPL Tuning

You might think that the way to test the relative performance of different computer systems would be to give them all the same problem to solve, and then see which system solved the problem fastest. That is not how HPL works. Instead, the linear equations that get solved by the Linpack Benchmark are a means to an end, not an end in themselves: they are really just there to give the cluster's CPUs something to chew on. The aim is simply to run the CPUs as fast and as efficiently as possible. To this end, there are a number of parameters you can tweak in the HPL.dat configuration file to optimize or tune the way HPL runs.

There are 31 lines in the HPL.dat file, but some are definitely more important than others. In my, admittedly limited, experience the three most crucial parameters are the "problem size" (N), the "block size" (NB), and the "process rows" and "process columns" (P and Q). Essentially, the "problem size" needs to fill up as much RAM as possible - without any of it being swapped out to disk, and bearing in mind that the Operating System needs RAM too. 80% of total memory seems a good starting point; this is recommended in the HPL FAQs and seems to be borne out in practice. There's a nice graph on the Microwulf website showing how performance increases with problem size.

The "block size" is best determined by testing, but I've found values between 160 and 256 to produce the best results, depending on the system. I have a feeling that a good block size is related to the size of the processor cache, but I've no evidence whatsoever...

The P and Q parameters specify the dimensions of the "process grid." Again, I've found the best values by trial and error - I've no theory.

So I fired up my cluster full of hope and expectation, only to find that the results were truly terrible! Running four nodes, the best I could do was just 1.94 Gflops (with N=20000, NB=120, PxQ=2x4.) To put this in some sort of perspective, Microwulf - a four node cluster with 8Gb of RAM - is capable of delivering 26.25 Gflops. Earlier in the year, I ran HPL on Fedora 8 on my Dell D630 laptop, and achieved what I thought was an amazing 9.54 Gflops*. My laptop does have 4Gb of RAM, and the 2.20GHz Intel Mobile Core 2 Duo processor has a 4096 KB L2 cache, but that's nearly 10 Gflops on a laptop! I couldn't get 2 Gflops on the cluster... Clearly there was something badly wrong. But what?

To cut a very long story short, by experimenting with different values for NB and P and Q, and by carefully tuning N to the available RAM, I was able to increase performance from 4 nodes running 8 processes to 10.43 Gflops. (WR00L2L4: N=27648, NB=192, P=1, Q=8.) This was a big improvement -and shows how important tuning HPL is - but is still pretty poor.

You can measure the performance of a cluster by its Computational Efficiency. Computational Efficency is simply the the measured performance of the cluster (RMax) divided by its theoretical maximum peak performance (RPeak):

compEff = RMax/RPeak

RPeak can be estimated as the number of nodes in the cluster, multiplied by the number of cores per node, multiplied by the number of floating point units per core, multipled by the clock speed. In the case of my test server RPeak= 4 x 2 x 2 x 2.2 = 35.2. With the cluster only delivering 10.43 Gflops, compEff = 10.42/35.2 = 29.6%. This is not a good number :-( As a comparison, "real" clusters like those in the TOP500 list are around 75% efficent. Microwulf runs at an execeptional 82%. My D630 comes in at 108% :-o But then it isn't a cluster...

I needed a "game changer." My cluster is built using AMD Athlon™ 64 X2 processors. I replaced the GotoBLAS library with AMD Core Math Library (ACML) which is optimized for AMD processors. There is a really excellent article on using ACML with HPL here. The results were outstanding. Performance doubled to 21.24 Gflops. (WR00L2L2: N=27392, NB=192, P=2, Q=4.) Computational Efficiency doubled to 60.3%. All I had done was link HPL to a different library!

The figures are still not brilliant, but at least they are approaching acceptability. Interestingly, Computational Efficiency is around 70% when running on any two nodes. There are still some things to try.







*WR00L2L2: N=18944, NB=256, P=1, Q=4. Time taken 475.22.