Monday, 5 January 2009

HPL Tuning

You might think that the way to test the relative performance of different computer systems would be to give them all the same problem to solve, and then see which system solved the problem fastest. That is not how HPL works. Instead, the linear equations that get solved by the Linpack Benchmark are a means to an end, not an end in themselves: they are really just there to give the cluster's CPUs something to chew on. The aim is simply to run the CPUs as fast and as efficiently as possible. To this end, there are a number of parameters you can tweak in the HPL.dat configuration file to optimize or tune the way HPL runs.

There are 31 lines in the HPL.dat file, but some are definitely more important than others. In my, admittedly limited, experience the three most crucial parameters are the "problem size" (N), the "block size" (NB), and the "process rows" and "process columns" (P and Q). Essentially, the "problem size" needs to fill up as much RAM as possible - without any of it being swapped out to disk, and bearing in mind that the Operating System needs RAM too. 80% of total memory seems a good starting point; this is recommended in the HPL FAQs and seems to be borne out in practice. There's a nice graph on the Microwulf website showing how performance increases with problem size.

The "block size" is best determined by testing, but I've found values between 160 and 256 to produce the best results, depending on the system. I have a feeling that a good block size is related to the size of the processor cache, but I've no evidence whatsoever...

The P and Q parameters specify the dimensions of the "process grid." Again, I've found the best values by trial and error - I've no theory.

So I fired up my cluster full of hope and expectation, only to find that the results were truly terrible! Running four nodes, the best I could do was just 1.94 Gflops (with N=20000, NB=120, PxQ=2x4.) To put this in some sort of perspective, Microwulf - a four node cluster with 8Gb of RAM - is capable of delivering 26.25 Gflops. Earlier in the year, I ran HPL on Fedora 8 on my Dell D630 laptop, and achieved what I thought was an amazing 9.54 Gflops*. My laptop does have 4Gb of RAM, and the 2.20GHz Intel Mobile Core 2 Duo processor has a 4096 KB L2 cache, but that's nearly 10 Gflops on a laptop! I couldn't get 2 Gflops on the cluster... Clearly there was something badly wrong. But what?

To cut a very long story short, by experimenting with different values for NB and P and Q, and by carefully tuning N to the available RAM, I was able to increase performance from 4 nodes running 8 processes to 10.43 Gflops. (WR00L2L4: N=27648, NB=192, P=1, Q=8.) This was a big improvement -and shows how important tuning HPL is - but is still pretty poor.

You can measure the performance of a cluster by its Computational Efficiency. Computational Efficency is simply the the measured performance of the cluster (RMax) divided by its theoretical maximum peak performance (RPeak):

compEff = RMax/RPeak

RPeak can be estimated as the number of nodes in the cluster, multiplied by the number of cores per node, multiplied by the number of floating point units per core, multipled by the clock speed. In the case of my test server RPeak= 4 x 2 x 2 x 2.2 = 35.2. With the cluster only delivering 10.43 Gflops, compEff = 10.42/35.2 = 29.6%. This is not a good number :-( As a comparison, "real" clusters like those in the TOP500 list are around 75% efficent. Microwulf runs at an execeptional 82%. My D630 comes in at 108% :-o But then it isn't a cluster...

I needed a "game changer." My cluster is built using AMD Athlon™ 64 X2 processors. I replaced the GotoBLAS library with AMD Core Math Library (ACML) which is optimized for AMD processors. There is a really excellent article on using ACML with HPL here. The results were outstanding. Performance doubled to 21.24 Gflops. (WR00L2L2: N=27392, NB=192, P=2, Q=4.) Computational Efficiency doubled to 60.3%. All I had done was link HPL to a different library!

The figures are still not brilliant, but at least they are approaching acceptability. Interestingly, Computational Efficiency is around 70% when running on any two nodes. There are still some things to try.

*WR00L2L2: N=18944, NB=256, P=1, Q=4. Time taken 475.22.

No comments: