This page leads one, step by step, through the optimization of matrix-matrix multiplication. For now, it is assumed that the reader has a Linux account on a Pentium4 based computer. We will use the gcc compiler as part of the exercise.
Set Up
Download OptimizeGemm.tar.gz
Uncompress by executing "gunzip OptimizeGemm.tar.gz"
Expand the tar file by executing "tar OptimizeGemm.tar"
Change into the directory that is created by executing "cd OptimizeGemm"
Files
In the directory OptimizeGemm you will find the following files that you will use to systematically optimize the matrix-matrix multiplication operation:
makefile
The makefile that describes how to compile, link, and execute the driver/implementations.
Test driver
File that holds parameters that control what data is collected
driver routine that tests and times the different implementations. This routine executes a reference implementation and the current optimization to be timed. Parameters for this routine are initialized in parameters.h. In particular, in that file it is indicated how many times to repeat each experiment (problem size) and how each of the three dimensions (
are tied to the problem size being timed. Matrix multiplication implementations
Reference implementation used to check correctness
Version 0: simplest implementation
Optimization 1
Utility routines
Compares the contents of two matrices and returns the maximum absolute difference
Copies one matrix to another
Returns elapsed time in seconds
Generates a random matrix
Transposes the contents of one matrix into another matrix
Prints the contents of a matrix
Plotting the results
Plots graphs corresponding to the data in files output_old.m and output_new.m
File in which parameters about the architecture are given
These last routines allow one to use Matlab to plot the performance of two different implementations.
Which Compiler?
In this exercise, we use the gcc (Gnu C) compiler with optimization level -O2. (See the makefile.) This is neither the best compiler nor the best optimization level. The reason is that with this compiler and level of optimization, we have a certain level of control:
- Had we used the intel compiler, the "simple loops" in MMult0.c would probably have yielded the best performance: This compiler is very good at optimizing a triple loop.
- Had we used -O3 (optimization level 3), the gnu compiler would have more aggressively optimized, making the step-by-step optimizations we demonstrate much less predictable.
Initialization
Make sure that the makefile starts with the following lines:
OLD := 0 NEW := 0
This indicates that the performance of the version of matrix-matrix multiplication in MMult0.c is measured (by virtue of the statement OLD :=0).
Next, to make sure that when plotting the graphs are properly scaled, set certain parameters in the file proc_parameters.m. See the comments in that file. (Setting these parameters will ensure that when plotting the y-axis ranges from 0 to the peak performance of the architecture.)
Execute
make run
This will compile, link, and execute the test driver, linking to the implementation in MMult0.c. The performance data is saved in file output0.m.more output0.m
This will display the contents of the output file output0.m. It should look something likeMY_MMult = [ 20 4.953912e-01 0.000000e+00 40 6.276695e-01 0.000000e+00 60 6.324479e-01 0.000000e+00 80 6.295030e-01 0.000000e+00 [ lines deleted ] 360 3.791921e-01 0.000000e+00 380 3.228467e-01 0.000000e+00 400 2.622753e-01 0.000000e+00 ];
The first column equals the problem size. The second column the elapsed time (in seconds) when a matrix-matrix multiply with the indicated problem size (
is executed. The last column reports the maximum absolute difference encountered between the implementation in REF_MMult.c and MMult0.c. It should be close to 0.00000e+00 although as different optimizations are added the difference may not be perfectly zero. make plot
This will execute the commands in PlotAll using Matlab, creating a graph in the file graph_0_vs_0.eps.gv graph_0_vs_0.eps
This displays the graph.
The performance graph will look something like
Notice that the two curves are right on top of each other because data for the same im plementation are being compared. From the fact that the top of the graph represents peak performance, it is obvious that this simple implementation achieves only a fraction of the ideal performance.
Optimization 1
Change the first lines in the makefile to
OLD := 0 NEW := 1
make run
make plot
gv graph_0_vs_1.eps
This time the performance graph will look something like
Compare the two implementations in MMult0.c and MMult1.c :
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( i=0; i<m; i++ ){ for ( j=0; j<n; j++ ){ for ( p=0; p<k; p++ ){ C( i,j ) += A( i,p ) * B( p,j ); } } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i++ ){ register double c0=0.0; for ( p=0; p<k; p++ ){ c0 += A( i,p ) * B( p,j ); } C( i,j ) += c0; } } }
The difference in performance can be attributed to the fact that the update to C( i,j ) is accumulated in a register, and the order of the loops indexed by i and j have been exchanged. (You may want to see what happens if you only make one of these changes, rather than both.)
It should be obvious now how setting OLD and NEW in the makefile allows one to compare different implementations.
Optimization 2-9: unrolling i
Optimization 2
Copy the contents of file MMult1.c into a file named MMult2.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i++ ){ register double c0=0.0; for ( p=0; p<k; p++ ){ c0 += A( i,p ) * B( p,j ); } C( i,j ) += c0; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; for ( p=0; p<k; p++ ){ c0 += A( i,p ) * B( p,j ); c1 += A( i+1,p ) * B( p,j ); c2 += A( i+2,p ) * B( p,j ); c3 += A( i+3,p ) * B( p,j ); } C( i,j ) += c0; C( i+1,j ) += c1; C( i+2,j ) += c2; C( i+3,j ) += c3; } } }
Change the first lines in the makefile to
OLD := 1 NEW := 2
make run
make plot
gv graph_1_vs_2.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that the loop has been unrolled do that four elements of C, all in the same column, are computed. This decreases the overhead for computing the loop index p.
Optimization 3
Copy the contents of file MMult2.c into a file named MMult3.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; for ( p=0; p<k; p++ ){ c0 += A( i,p ) * B( p,j ); c1 += A( i+1,p ) * B( p,j ); c2 += A( i+2,p ) * B( p,j ); c3 += A( i+3,p ) * B( p,j ); } C( i,j ) += c0; C( i+1,j ) += c1; C( i+2,j ) += c2; C( i+3,j ) += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ c0 = c0 + ap[0] * *bp; c1 = c1 + ap[1] * *bp; c2 = c2 + ap[2] * *bp; c3 = c3 + ap[3] * *bp; bp++; ap += lda; } C( i,j ) += c0; C( i+1,j ) += c1; C( i+2,j ) += c2; C( i+3,j ) += c3; } } }
Change the first lines in the makefile to
OLD := 2 NEW := 3
make run
make plot
gv graph_2_vs_3.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 4
Copy the contents of file MMult3.c into a file named MMult4.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ c0 = c0 + ap[0] * *bp; c1 = c1 + ap[1] * *bp; c2 = c2 + ap[2] * *bp; c3 = c3 + ap[3] * *bp; bp++; ap += lda; } C( i,j ) += c0; C( i+1,j ) += c1; C( i+2,j ) += c2; C( i+3,j ) += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for cogmputing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + ap[0] * bpj; c1 = c1 + ap[1] * bpj; c2 = c2 + ap[2] * bpj; c3 = c3 + ap[3] * bpj; bp++; ap+=lda; } C( i,j ) += c0; C( i+1,j ) += c1; C( i+2,j ) += c2; C( i+3,j ) += c3; } } }
Change the first lines in the makefile to
OLD := 3 NEW := 4
make run
make plot
gv graph_3_vs_4.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 5
Copy the contents of file MMult4.c into a file named MMult5.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for cogmputing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + ap[0] * bpj; c1 = c1 + ap[1] * bpj; c2 = c2 + ap[2] * bpj; c3 = c3 + ap[3] * bpj; bp++; ap+=lda; } C( i,j ) += c0; C( i+1,j ) += c1; C( i+2,j ) += c2; C( i+3,j ) += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + ap[0] * bpj; c1 = c1 + ap[1] * bpj; c2 = c2 + ap[2] * bpj; c3 = c3 + ap[3] * bpj; bp++; ap+=lda; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }
Change the first lines in the makefile to
OLD := 4 NEW := 5
make run
make plot
gv graph_4_vs_5.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 6
Copy the contents of file MMult5.c into a file named MMult6.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + ap[0] * bpj; c1 = c1 + ap[1] * bpj; c2 = c2 + ap[2] * bpj; c3 = c3 + ap[3] * bpj; bp++; ap+=lda; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; register double r0, r1, r2; r0 = ap[ 0 ]; r1 = ap[ 1 ]; r2 = ap[ 2 ]; c0 = c0 + r0 * bpj; r0 = ap[ 3 ]; c1 = c1 + r1 * bpj; c2 = c2 + r2 * bpj; c3 = c3 + r0 * bpj; bp++; ap+=lda; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }
Change the first lines in the makefile to
OLD := 5 NEW := 6
make run
make plot
gv graph_5_vs_6.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 7
Copy the contents of file MMult6.c into a file named MMult7.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; register double r0, r1, r2; r0 = ap[ 0 ]; r1 = ap[ 1 ]; r2 = ap[ 2 ]; c0 = c0 + r0 * bpj; r0 = ap[ 3 ]; c1 = c1 + r1 * bpj; c2 = c2 + r2 * bpj; c3 = c3 + r0 * bpj; bp++; ap+=lda; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }/* Create macros so that the matrices are stored in column-major order */ #include "parameters.h" #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + ap[0] * bpj; c1 = c1 + ap[1] * bpj; c2 = c2 + ap[2] * bpj; c3 = c3 + ap[3] * bpj; bp++; ap+=lda; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }
Change the first lines in the makefile to
OLD := 6 NEW := 7
make run
make plot
gv graph_6_vs_7.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 8
Copy the contents of file MMult7.c into a file named MMult8.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #include "parameters.h" #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + ap[0] * bpj; c1 = c1 + ap[1] * bpj; c2 = c2 + ap[2] * bpj; c3 = c3 + ap[3] * bpj; bp++; ap+=lda; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }/* Create macros so that the matrices are stored in column-major order */ #include "parameters.h" #define A(i,j) a[ (j)*lda + (i) ] #define AT(i,j) at[ (j)*(k) + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; double at[ MB * KB ]; transpose_matrix( m, k, a, lda, at, k ); for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *a0p = &AT(0,i ), *a1p = &AT(0,i+1), *a2p = &AT(0,i+2), *a3p = &AT(0,i+3), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + *a0p++ * bpj; c1 = c1 + *a1p++ * bpj; c2 = c2 + *a2p++ * bpj; c3 = c3 + *a3p++ * bpj; bp++; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }
Change the first lines in the makefile to
OLD := 7 NEW := 8
make run
make plot
gv graph_7_vs_8.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimizing Optimization 8
By transposing the matrix, it turns out that a larger block size can be used. (The reason for this is that the block of A can be increased beyond the size of the L1 cache.) Try changing MB = 144 and KB = 144 in file MMult8.c and reexecute make run, make plot, and gv graph_7_vs_8.eps. On my machine this yields
Optimization 9
Copy the contents of file MMult8.c into a file named MMult9.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #include "parameters.h" #define A(i,j) a[ (j)*lda + (i) ] #define AT(i,j) at[ (j)*(k) + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; double at[ MB * KB ]; transpose_matrix( m, k, a, lda, at, k ); for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *a0p = &AT(0,i ), *a1p = &AT(0,i+1), *a2p = &AT(0,i+2), *a3p = &AT(0,i+3), *bp=&B(0,j); for ( p=0; p<k; p++ ){ register double bpj = *bp; c0 = c0 + *a0p++ * bpj; c1 = c1 + *a1p++ * bpj; c2 = c2 + *a2p++ * bpj; c3 = c3 + *a3p++ * bpj; bp++; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }/* Create macros so that the matrices are stored in column-major order */ #include "parameters.h" #define A(i,j) a[ (j)*lda + (i) ] #define AT(i,j) at[ (j)*(k) + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 144 #define KB 144 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; double at[ MB * KB ]; transpose_matrix( m, k, a, lda, at, k ); for ( j=0; j<n; j++ ){ double *cp = &C( 0, j ); for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *a0p = &AT(0,i), *a1p = &AT(0,i+1), *a2p = &AT(0,i+2), *a3p = &AT(0,i+3), *bp=&B(0,j); for ( p=0; p<k; p+=4 ){ register double bpj = bp[0]; c0 = c0 + a0p[0] * bpj; c1 = c1 + a1p[0] * bpj; c2 = c2 + a2p[0] * bpj; c3 = c3 + a3p[0] * bpj; bpj = bp[1]; c0 = c0 + a0p[1] * bpj; c1 = c1 + a1p[1] * bpj; c2 = c2 + a2p[1] * bpj; c3 = c3 + a3p[1] * bpj; bpj = bp[2]; c0 = c0 + a0p[2] * bpj; c1 = c1 + a1p[2] * bpj; c2 = c2 + a2p[2] * bpj; c3 = c3 + a3p[2] * bpj; bpj = bp[3]; c0 = c0 + a0p[3] * bpj; c1 = c1 + a1p[3] * bpj; c2 = c2 + a2p[3] * bpj; c3 = c3 + a3p[3] * bpj; bp+=4; a0p+=4; a1p+=4; a2p+=4; a3p+=4; } cp[ 0 ] += c0; cp[ 1 ] += c1; cp[ 2 ] += c2; cp[ 3 ] += c3; cp += 4; } } }
Change the first lines in the makefile to
OLD := 8 NEW := 9
make run
make plot
gv graph_8_vs_9.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 10-16: unrolling j
Optimization 10
Copy the contents of file MMult2.c into a file named MMult10.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j++ ){ for ( i=0; i<m; i+=4 ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; for ( p=0; p<k; p++ ){ c0 += A( i,p ) * B( p,j ); c1 += A( i+1,p ) * B( p,j ); c2 += A( i+2,p ) * B( p,j ); c3 += A( i+3,p ) * B( p,j ); } C( i,j ) += c0; C( i+1,j ) += c1; C( i+2,j ) += c2; C( i+3,j ) += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; for ( p=0; p<k; p++ ){ c0 += A( i,p ) * B( p,j ); c1 += A( i,p ) * B( p,j+1 ); c2 += A( i,p ) * B( p,j+2 ); c3 += A( i,p ) * B( p,j+3 ); } C( i,j ) += c0; C( i,j+1 ) += c1; C( i,j+2 ) += c2; C( i,j+3 ) += c3; } } }
Change the first lines in the makefile to
OLD := 2 NEW := 10
make run
make plot
gv graph_2_vs_10.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 11
Copy the contents of file MMult10.c into a file named MMult11.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; for ( p=0; p<k; p++ ){ c0 += A( i,p ) * B( p,j ); c1 += A( i,p ) * B( p,j+1 ); c2 += A( i,p ) * B( p,j+2 ); c3 += A( i,p ) * B( p,j+3 ); } C( i,j ) += c0; C( i,j+1 ) += c1; C( i,j+2 ) += c2; C( i,j+3 ) += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ c0 = c0 + *ap * *b0p++; c1 = c1 + *ap * *b1p++; c2 = c2 + *ap * *b2p++; c3 = c3 + *ap * *b3p++; ap += lda; } C( i,j ) += c0; C( i,j+1 ) += c1; C( i,j+2 ) += c2; C( i,j+3 ) += c3; } } }
Compare also Optimization 3 (above) to this Optimization 11. (They are similar, except that the same optimizations are applied to Optization 2 and 10, respectively.)
Change the first lines in the makefile to
OLD := 3 NEW := 11
make run
make plot
gv graph_3_vs_11.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 12
Copy the contents of file MMult11.c into a file named MMult12.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ c0 = c0 + *ap * *b0p++; c1 = c1 + *ap * *b1p++; c2 = c2 + *ap * *b2p++; c3 = c3 + *ap * *b3p++; ap += lda; } C( i,j ) += c0; C( i,j+1 ) += c1; C( i,j+2 ) += c2; C( i,j+3 ) += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap += lda; } C( i,j ) += c0; C( i,j+1 ) += c1; C( i,j+2 ) += c2; C( i,j+3 ) += c3; } } }
Compare also Optimization 4 (above) to this Optimization 12. (They are similar, except that the same optimizations are applied to Optization 3 and 11, respectively.)
Change the first lines in the makefile to
OLD := 4 NEW := 12
make run
make plot
gv graph_4_vs_12.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 13
Copy the contents of file MMult12.c into a file named MMult13.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap += lda; } C( i,j ) += c0; C( i,j+1 ) += c1; C( i,j+2 ) += c2; C( i,j+3 ) += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3); for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap += lda; } *c0p++ += c0; *c1p++ += c1; *c2p++ += c2; *c3p++ += c3; } } }
Compare also Optimization 5 (above) to this Optimization 13. (They are similar, except that the same optimizations are applied to Optization 4 and 12, respectively.)
Change the first lines in the makefile to
OLD := 5 NEW := 13
make run
make plot
gv graph_5_vs_13.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 14
Copy the contents of file MMult13.c into a file named MMult14.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3); for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap += lda; } *c0p++ += c0; *c1p++ += c1; *c2p++ += c2; *c3p++ += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3); for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap += lda; } *c0p++ += c0; *c1p++ += c1; *c2p++ += c2; *c3p++ += c3; } } }
Compare also Optimization 7 (above) to this Optimization 14. (They are similar, except that the same optimizations are applied to Optization 4 and 13, respectively.)
Change the first lines in the makefile to
OLD := 7 NEW := 14
make run
make plot
gv graph_7_vs_14.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimization 15
Copy the contents of file MMult14.c into a file named MMult14.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; for ( j=0; j<n; j+=4 ){ double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3); for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &A(i,0), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap += lda; } *c0p++ += c0; *c1p++ += c1; *c2p++ += c2; *c3p++ += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define AT(i,j) at[ (j)*(k) + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; double at[ MB * KB ]; transpose_matrix( m, k, a, lda, at, k ); for ( j=0; j<n; j+=4 ){ double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3); for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &AT(0,i), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap++; } *c0p++ += c0; *c1p++ += c1; *c2p++ += c2; *c3p++ += c3; } } }
Compare also Optimization 8 (above) to this Optimization 15. (They are similar, except that the same optimizations are applied to Optization 7 and 14, respectively.)
Change the first lines in the makefile to
OLD := 8 NEW := 15
make run
make plot
gv graph_8_vs_15.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
Optimizing Optimization 15
Again (as shown for Optimization 8), by transposing the matrix, it turns out that a larger block size can be used. (The reason for this is that the block of A can be increased beyond the size of the L1 cache.) Try changing MB = 144 and KB = 144 in file MMult15.c and reexecute make run, make plot, and gv graph_8_vs_15.eps. On my machine this yields
Optimization 16
Copy the contents of file MMult15.c into a file named MMult16.c and change the contents:
from
to
/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define AT(i,j) at[ (j)*(k) + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 48 #define KB 48 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; double at[ MB * KB ]; transpose_matrix( m, k, a, lda, at, k ); for ( j=0; j<n; j+=4 ){ double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3); for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &AT(0,i), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p++ ){ register double aip = *ap; c0 = c0 + aip * *b0p++; c1 = c1 + aip * *b1p++; c2 = c2 + aip * *b2p++; c3 = c3 + aip * *b3p++; ap++; } *c0p++ += c0; *c1p++ += c1; *c2p++ += c2; *c3p++ += c3; } } }/* Create macros so that the matrices are stored in column-major order */ #define A(i,j) a[ (j)*lda + (i) ] #define AT(i,j) at[ (j)*(k) + (i) ] #define B(i,j) b[ (j)*ldb + (i) ] #define C(i,j) c[ (j)*ldc + (i) ] #define min( x, y ) ( x < y ? x : y ) #define MB 144 #define KB 144 /* MY_MMult_inner prototype */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ); /* Routine for computing C = A * B + C */ void MY_MMult( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, p, ib, pb; for ( p=0; p<k; p+=KB ){ pb = min( k-p, KB ); for ( i=0; i<m; i+=MB ){ ib = min( m-i, MB ); MY_MMult_inner( ib, n, pb, &A( i, p ), lda, &B( p, 0 ), ldb, &C( i, 0 ), ldc ); } } } /* Inner kernel for computing C = A * B + C */ void MY_MMult_inner( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j, p; double at[ MB * KB ]; transpose_matrix( m, k, a, lda, at, k ); for ( j=0; j<n; j+=4 ){ double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3); for ( i=0; i<m; i++ ){ register double c0=0.0, c1=0.0, c2=0.0, c3=0.0; double *ap = &AT(0,i), *b0p=&B(0,j ), *b1p=&B(0,j+1), *b2p=&B(0,j+2), *b3p=&B(0,j+3); for ( p=0; p<k; p+=4 ){ register double aip; aip = *ap++; c0 = c0 + aip * b0p[0]; c1 = c1 + aip * b1p[0]; c2 = c2 + aip * b2p[0]; c3 = c3 + aip * b3p[0]; aip = *ap++; c0 = c0 + aip * b0p[1]; c1 = c1 + aip * b1p[1]; c2 = c2 + aip * b2p[1]; c3 = c3 + aip * b3p[1]; aip = *ap++; c0 = c0 + aip * b0p[2]; c1 = c1 + aip * b1p[2]; c2 = c2 + aip * b2p[2]; c3 = c3 + aip * b3p[2]; aip = *ap++; c0 = c0 + aip * b0p[3]; c1 = c1 + aip * b1p[3]; c2 = c2 + aip * b2p[3]; c3 = c3 + aip * b3p[3]; b0p+=4; b1p+=4; b2p+=4; b3p+=4; } *c0p++ += c0; *c1p++ += c1; *c2p++ += c2; *c3p++ += c3; } } }
Compare also Optimization 9 (above) to this Optimization 16. (They are similar, except that the same optimizations are applied to Optization 8 and 15, respectively.)
Change the first lines in the makefile to
OLD := 9 NEW := 16
make run
make plot
gv graph_9_vs_16.eps
This time the performance graph will look something like
The difference in performance can be attributed to the fact that ...
