Optimizing Matrix-Matrix Multiplication

This page leads one, step by step, through the optimization of matrix-matrix multiplication. For now, it is assumed that the reader has a Linux account on a Pentium4 based computer. We will use the gcc compiler as part of the exercise.

Set Up

Files

In the directory OptimizeGemm you will find the following files that you will use to systematically optimize the matrix-matrix multiplication operation:

These last routines allow one to use Matlab to plot the performance of two different implementations.

Which Compiler?

In this exercise, we use the gcc (Gnu C) compiler with optimization level -O2. (See the makefile.) This is neither the best compiler nor the best optimization level. The reason is that with this compiler and level of optimization, we have a certain level of control:

Initialization

Make sure that the makefile starts with the following lines:

This indicates that the performance of the version of matrix-matrix multiplication in MMult0.c is measured (by virtue of the statement OLD  :=0).

Next, to make sure that when plotting the graphs are properly scaled, set certain parameters in the file proc_parameters.m. See the comments in that file. (Setting these parameters will ensure that when plotting the y-axis ranges from 0 to the peak performance of the architecture.)

Execute

The performance graph will look something like

Notice that the two curves are right on top of each other because data for the same im plementation are being compared. From the fact that the top of the graph represents peak performance, it is obvious that this simple implementation achieves only a fraction of the ideal performance.

More performance graphs

Optimization 1

Change the first lines in the makefile to

This time the performance graph will look something like

Compare the two implementations in MMult0.c and MMult1.c :

The difference in performance can be attributed to the fact that the update to C( i,j ) is accumulated in a register, and the order of the loops indexed by i and j have been exchanged. (You may want to see what happens if you only make one of these changes, rather than both.)

It should be obvious now how setting OLD and NEW in the makefile allows one to compare different implementations.

Optimization 2-9: unrolling i

Optimization 2

Copy the contents of file MMult1.c into a file named MMult2.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that the loop has been unrolled do that four elements of C, all in the same column, are computed. This decreases the overhead for computing the loop index p.

Optimization 3

Copy the contents of file MMult2.c into a file named MMult3.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 4

Copy the contents of file MMult3.c into a file named MMult4.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 5

Copy the contents of file MMult4.c into a file named MMult5.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 6

Copy the contents of file MMult5.c into a file named MMult6.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 7

Copy the contents of file MMult6.c into a file named MMult7.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 8

Copy the contents of file MMult7.c into a file named MMult8.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimizing Optimization 8

By transposing the matrix, it turns out that a larger block size can be used. (The reason for this is that the block of A can be increased beyond the size of the L1 cache.) Try changing MB = 144 and KB = 144 in file MMult8.c and reexecute make run, make plot, and gv graph_7_vs_8.eps. On my machine this yields

Optimization 9

Copy the contents of file MMult8.c into a file named MMult9.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 10-16: unrolling j

Optimization 10

Copy the contents of file MMult2.c into a file named MMult10.c and change the contents:

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 11

Copy the contents of file MMult10.c into a file named MMult11.c and change the contents:

Compare also Optimization 3 (above) to this Optimization 11. (They are similar, except that the same optimizations are applied to Optization 2 and 10, respectively.)

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 12

Copy the contents of file MMult11.c into a file named MMult12.c and change the contents:

Compare also Optimization 4 (above) to this Optimization 12. (They are similar, except that the same optimizations are applied to Optization 3 and 11, respectively.)

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 13

Copy the contents of file MMult12.c into a file named MMult13.c and change the contents:

Compare also Optimization 5 (above) to this Optimization 13. (They are similar, except that the same optimizations are applied to Optization 4 and 12, respectively.)

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 14

Copy the contents of file MMult13.c into a file named MMult14.c and change the contents:

Compare also Optimization 7 (above) to this Optimization 14. (They are similar, except that the same optimizations are applied to Optization 4 and 13, respectively.)

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 15

Copy the contents of file MMult14.c into a file named MMult14.c and change the contents:

Compare also Optimization 8 (above) to this Optimization 15. (They are similar, except that the same optimizations are applied to Optization 7 and 14, respectively.)

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimizing Optimization 15

Again (as shown for Optimization 8), by transposing the matrix, it turns out that a larger block size can be used. (The reason for this is that the block of A can be increased beyond the size of the L1 cache.) Try changing MB = 144 and KB = 144 in file MMult15.c and reexecute make run, make plot, and gv graph_8_vs_15.eps. On my machine this yields

Optimization 16

Copy the contents of file MMult15.c into a file named MMult16.c and change the contents:

Compare also Optimization 9 (above) to this Optimization 16. (They are similar, except that the same optimizations are applied to Optization 8 and 15, respectively.)

Change the first lines in the makefile to

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

LinearAlgebraWiki: OptimizingGemm (last edited 2009-11-02 13:37:13 by RobertVanDeGeijn)