Performance of LU factorization algorithms
Figure 1: Performance of the different variants (unblocked and blocked) on an Intel Xeon (3.4GHz) processor, with a theoretical peak of 6.8 GFLOPS. For these experiments, the block size ( |
Questions to ask
|
Try it yourself
Run the above experiment on your machine:
Install the libFLAME library.
Download LU.tar.gz
gunzip LU.tar.gz
tar -xf LU.tar
cd LU/FLAMEC
- Edit the 'makefile'. In particular, change necessary paths and change the "GFLOPS" inputs.
make check
view output.m with Matlab (or Octave. However, with Octave the legend won't turn out quite right.)
Try to optimize further
Try the following:
- Look at the graph and see which unblocked algorithm performs best. Change all blocked algorithms so that they call this best unblocked algorithm.
In the code blocked Variant 5, there are suggestions to call the libFLAME routine FLA_LU_nopiv instead of the unblocked routines. Try it!
Change the algorithmic block size nb_alg by changing the appropriate parameter in the makefile
Try make nocheck to see if not parameter checking improves performance.
- Link to a different BLAS library, e.g. Intel's MKL library or IBM's ESSL.
- Instead of calling the unblocked algorithm from the blocked algorithm, call the blocked algorithm, but with a smaller block size. Once the block size becomes small, call the unblocked algorithm.
As you investigate the effects of these potential optimizations, revisit the Questions to ask.
Here is the best performance I achieved by playing around with the above options, on the same machine:
Figure 2: Performance of optimized variants (unblocked and blocked) on an Intel Xeon (3.4GHz) processor, with a theoretical peak of 6.8 GFLOPS. All implementations link to the GotoBLAS. |

in the algorithms, 'nb_alg' in the FLAMEC code) was chosen to equal 128. The reference implementation computes the operation with simple indexed loops. All implementations link to the