Performance of LU factorization algorithms

Figure 1: Performance of the different variants (unblocked and blocked) on an Intel Xeon (3.4GHz) processor, with a theoretical peak of 6.8 GFLOPS. For these experiments, the block size ($b$ in the algorithms, 'nb_alg' in the FLAMEC code) was chosen to equal 128. The reference implementation computes the operation with simple indexed loops. All implementations link to the GotoBLAS 1.06.


Questions to ask

  • Why does the performance of the unblocked algorithms get worse as the problem size increases?
  • Why does the performance of blocked Variants improve as the problem size increases?
  • Why is the performance of the unblocked Variant 4 better than the other unblocked variants in Figure 2?

Try it yourself

Run the above experiment on your machine:

Try to optimize further

Try the following:

As you investigate the effects of these potential optimizations, revisit the Questions to ask.

Here is the best performance I achieved by playing around with the above options, on the same machine:

Figure 2: Performance of optimized variants (unblocked and blocked) on an Intel Xeon (3.4GHz) processor, with a theoretical peak of 6.8 GFLOPS. All implementations link to the GotoBLAS.


LinearAlgebraWiki: LU/Graphs (last edited 2007-10-17 00:49:41 by RobertVanDeGeijn)