Differences between revisions 46 and 47

Deletions are marked like this. Additions are marked like this.
Line 424: Line 424:
Copy the contents of file `MMult13.c` into a file named `MMult14.c` and change the contents: Copy the contents of file `MMult14.c` into a file named `MMult14.c` and change the contents:

Optimizing Matrix-Matrix Multiplication

This page leads one, step by step, through the optimization of matrix-matrix multiplication. For now, it is assumed that the reader has a Linux account on a Pentium4 based computer. We will use the gcc compiler as part of the exercise.

Set Up

  • Download OptimizeGemm.tar.gz

  • Uncompress by executing "gunzip OptimizeGemm.tar.gz"

  • Expand the tar file by executing "tar OptimizeGemm.tar"

  • Change into the directory that is created by executing "cd OptimizeGemm"

Files

In the directory OptimizeGemm you will find the following files that you will use to systematically optimize the matrix-matrix multiplication operation:

  • makefile

    makefile

    The makefile that describes how to compile, link, and execute the driver/implementations.

    Test driver

    parameters.h

    File that holds parameters that control what data is collected

    test_MMult.c

    driver routine that tests and times the different implementations. This routine executes a reference implementation and the current optimization to be timed. Parameters for this routine are initialized in parameters.h. In particular, in that file it is indicated how many times to repeat each experiment (problem size) and how each of the three dimensions ($ m $, $ n $, and $ k $ are tied to the problem size being timed.

    Matrix multiplication implementations

    REF_MMult.c

    Reference implementation used to check correctness

    MMult0.c

    Version 0: simplest implementation

    MMult1.c

    Optimization 1

    Utility routines

    compare_matrices.c

    Compares the contents of two matrices and returns the maximum absolute difference

    copy_matrix.c

    Copies one matrix to another

    dclock.c

    Returns elapsed time in seconds

    random_matrix.c

    Generates a random matrix

    transpose_matrix.c

    Transposes the contents of one matrix into another matrix

    print_matrix.c

    Prints the contents of a matrix

    Plotting the results

    PlotAll.m

    Plots graphs corresponding to the data in files output_old.m and output_new.m

    proc_parameters.m

    File in which parameters about the architecture are given

These last routines allow one to use Matlab to plot the performance of two different implementations.

Which Compiler?

In this exercise, we use the gcc (Gnu C) compiler with optimization level -O2. (See the makefile.) This is neither the best compiler nor the best optimization level. The reason is that with this compiler and level of optimization, we have a certain level of control:

  • Had we used the intel compiler, the "simple loops" in MMult0.c would probably have yielded the best performance: This compiler is very good at optimizing a triple loop.
  • Had we used -O3 (optimization level 3), the gnu compiler would have more aggressively optimized, making the step-by-step optimizations we demonstrate much less predictable.

Initialization

Make sure that the makefile starts with the following lines:

  • OLD  := 0
    NEW  := 0

This indicates that the performance of the version of matrix-matrix multiplication in MMult0.c is measured (by virtue of the statement OLD  :=0).

Next, to make sure that when plotting the graphs are properly scaled, set certain parameters in the file proc_parameters.m. See the comments in that file. (Setting these parameters will ensure that when plotting the y-axis ranges from 0 to the peak performance of the architecture.)

Execute

  • make run
    This will compile, link, and execute the test driver, linking to the implementation in MMult0.c. The performance data is saved in file output0.m.

  • more output0.m
    This will display the contents of the output file output0.m. It should look something like

    • MY_MMult = [
      20 4.953912e-01 0.000000e+00 
      40 6.276695e-01 0.000000e+00 
      60 6.324479e-01 0.000000e+00 
      80 6.295030e-01 0.000000e+00 
        [ lines deleted ]
      360 3.791921e-01 0.000000e+00 
      380 3.228467e-01 0.000000e+00 
      400 2.622753e-01 0.000000e+00 
      ];

    The first column equals the problem size. The second column the elapsed time (in seconds) when a matrix-matrix multiply with the indicated problem size ($m=n=k$ is executed. The last column reports the maximum absolute difference encountered between the implementation in REF_MMult.c and MMult0.c. It should be close to 0.00000e+00 although as different optimizations are added the difference may not be perfectly zero.

  • make plot
    This will execute the commands in PlotAll using Matlab, creating a graph in the file graph_0_vs_0.eps.

  • gv graph_0_vs_0.eps
    This displays the graph.

The performance graph will look something like

Notice that the two curves are right on top of each other because data for the same im plementation are being compared. From the fact that the top of the graph represents peak performance, it is obvious that this simple implementation achieves only a fraction of the ideal performance.

More performance graphs

Optimization 1

Change the first lines in the makefile to

  • OLD  := 0
    NEW  := 1
  • make run

  • make plot

  • gv graph_0_vs_1.eps

This time the performance graph will look something like

Compare the two implementations in MMult0.c and MMult1.c :

  • /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( i=0; i<m; i++ ){
        for ( j=0; j<n; j++ ){
          for ( p=0; p<k; p++ ){
            C( i,j ) += A( i,p ) * B( p,j );
          }
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0;
    
          for ( p=0; p<k; p++ ){
            c0 += A( i,p ) * B( p,j );
          }
          
          C( i,j ) += c0;
        }
      }
    }
    

The difference in performance can be attributed to the fact that the update to C( i,j ) is accumulated in a register, and the order of the loops indexed by i and j have been exchanged. (You may want to see what happens if you only make one of these changes, rather than both.)

It should be obvious now how setting OLD and NEW in the makefile allows one to compare different implementations.

Optimization 2-9: unrolling i

Optimization 2

Copy the contents of file MMult1.c into a file named MMult2.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0;
    
          for ( p=0; p<k; p++ ){
            c0 += A( i,p ) * B( p,j );
          }
          
          C( i,j ) += c0;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
    
          for ( p=0; p<k; p++ ){
            c0 += A( i,p ) * B( p,j );
            c1 += A( i+1,p ) * B( p,j );
            c2 += A( i+2,p ) * B( p,j );
            c3 += A( i+3,p ) * B( p,j );
          }
          
          C( i,j )   += c0;
          C( i+1,j ) += c1;
          C( i+2,j ) += c2;
          C( i+3,j ) += c3;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 1
    NEW  := 2
  • make run

  • make plot

  • gv graph_1_vs_2.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that the loop has been unrolled do that four elements of C, all in the same column, are computed. This decreases the overhead for computing the loop index p.

Optimization 3

Copy the contents of file MMult2.c into a file named MMult3.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
    
          for ( p=0; p<k; p++ ){
            c0 += A( i,p ) * B( p,j );
            c1 += A( i+1,p ) * B( p,j );
            c2 += A( i+2,p ) * B( p,j );
            c3 += A( i+3,p ) * B( p,j );
          }
          
          C( i,j )   += c0;
          C( i+1,j ) += c1;
          C( i+2,j ) += c2;
          C( i+3,j ) += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            c0 = c0 + ap[0] * *bp;
            c1 = c1 + ap[1] * *bp;
            c2 = c2 + ap[2] * *bp;
            c3 = c3 + ap[3] * *bp;
    
            bp++;
            ap += lda;
          }
          
          C( i,j )   += c0;
          C( i+1,j ) += c1;
          C( i+2,j ) += c2;
          C( i+3,j ) += c3;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 2
    NEW  := 3
  • make run

  • make plot

  • gv graph_2_vs_3.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 4

Copy the contents of file MMult3.c into a file named MMult4.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            c0 = c0 + ap[0] * *bp;
            c1 = c1 + ap[1] * *bp;
            c2 = c2 + ap[2] * *bp;
            c3 = c3 + ap[3] * *bp;
    
            bp++;
            ap += lda;
          }
          
          C( i,j )   += c0;
          C( i+1,j ) += c1;
          C( i+2,j ) += c2;
          C( i+3,j ) += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for cogmputing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 + ap[0] * bpj;
            c1 = c1 + ap[1] * bpj;
            c2 = c2 + ap[2] * bpj;
            c3 = c3 + ap[3] * bpj;
    
            bp++;
            ap+=lda;
          }
          
          C( i,j )   += c0;
          C( i+1,j ) += c1;
          C( i+2,j ) += c2;
          C( i+3,j ) += c3;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 3
    NEW  := 4
  • make run

  • make plot

  • gv graph_3_vs_4.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 5

Copy the contents of file MMult4.c into a file named MMult5.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for cogmputing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 + ap[0] * bpj;
            c1 = c1 + ap[1] * bpj;
            c2 = c2 + ap[2] * bpj;
            c3 = c3 + ap[3] * bpj;
    
            bp++;
            ap+=lda;
          }
          
          C( i,j )   += c0;
          C( i+1,j ) += c1;
          C( i+2,j ) += c2;
          C( i+3,j ) += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 +  ap[0] * bpj;
            c1 = c1 +  ap[1] * bpj;
            c2 = c2 +  ap[2] * bpj;
            c3 = c3 +  ap[3] * bpj;
    
            bp++;
            ap+=lda;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 4
    NEW  := 5
  • make run

  • make plot

  • gv graph_4_vs_5.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 6

Copy the contents of file MMult5.c into a file named MMult6.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 +  ap[0] * bpj;
            c1 = c1 +  ap[1] * bpj;
            c2 = c2 +  ap[2] * bpj;
            c3 = c3 +  ap[3] * bpj;
    
            bp++;
            ap+=lda;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
            register double r0, r1, r2;
    
            r0 = ap[ 0 ];
            r1 = ap[ 1 ];
            r2 = ap[ 2 ];
            c0 = c0 +  r0 * bpj;
    
            r0 = ap[ 3 ];
            c1 = c1 +  r1 * bpj;
            c2 = c2 +  r2 * bpj;
            c3 = c3 +  r0 * bpj;
    
            bp++;
            ap+=lda;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 5
    NEW  := 6
  • make run

  • make plot

  • gv graph_5_vs_6.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 7

Copy the contents of file MMult6.c into a file named MMult7.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
            register double r0, r1, r2;
    
            r0 = ap[ 0 ];
            r1 = ap[ 1 ];
            r2 = ap[ 2 ];
            c0 = c0 +  r0 * bpj;
    
            r0 = ap[ 3 ];
            c1 = c1 +  r1 * bpj;
            c2 = c2 +  r2 * bpj;
            c3 = c3 +  r0 * bpj;
    
            bp++;
            ap+=lda;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #include "parameters.h"
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
                                                                            
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 + ap[0] * bpj;
            c1 = c1 + ap[1] * bpj;
            c2 = c2 + ap[2] * bpj;
            c3 = c3 + ap[3] * bpj;
    
            bp++;
            ap+=lda;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 6
    NEW  := 7
  • make run

  • make plot

  • gv graph_6_vs_7.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 8

Copy the contents of file MMult7.c into a file named MMult8.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #include "parameters.h"
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
                                                                            
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 + ap[0] * bpj;
            c1 = c1 + ap[1] * bpj;
            c2 = c2 + ap[2] * bpj;
            c3 = c3 + ap[3] * bpj;
    
            bp++;
            ap+=lda;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #include "parameters.h"
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define AT(i,j) at[ (j)*(k) + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
                                                                            
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
      double at[ MB * KB ];
    
      transpose_matrix( m, k, a, lda, at, k );
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *a0p = &AT(0,i  ), *a1p = &AT(0,i+1), 
                 *a2p = &AT(0,i+2), *a3p = &AT(0,i+3), 
                 *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 +  *a0p++ * bpj;
            c1 = c1 +  *a1p++ * bpj;
            c2 = c2 +  *a2p++ * bpj;
            c3 = c3 +  *a3p++ * bpj;
    
            bp++;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 7
    NEW  := 8
  • make run

  • make plot

  • gv graph_7_vs_8.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimizing Optimization 8

By transposing the matrix, it turns out that a larger block size can be used. (The reason for this is that the block of A can be increased beyond the size of the L1 cache.) Try changing MB = 144 and KB = 144 in file MMult8.c and reexecute make run, make plot, and gv graph_7_vs_8.eps. On my machine this yields

Optimization 9

Copy the contents of file MMult8.c into a file named MMult9.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #include "parameters.h"
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define AT(i,j) at[ (j)*(k) + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
                                                                            
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
      double at[ MB * KB ];
    
      transpose_matrix( m, k, a, lda, at, k );
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *a0p = &AT(0,i  ), *a1p = &AT(0,i+1), 
                 *a2p = &AT(0,i+2), *a3p = &AT(0,i+3), 
                 *bp=&B(0,j);
    
          for ( p=0; p<k; p++ ){
            register double bpj = *bp;
    
            c0 = c0 +  *a0p++ * bpj;
            c1 = c1 +  *a1p++ * bpj;
            c2 = c2 +  *a2p++ * bpj;
            c3 = c3 +  *a3p++ * bpj;
    
            bp++;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #include "parameters.h"
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define AT(i,j) at[ (j)*(k) + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 144
    #define KB 144
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
                                                                            
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
      double at[ MB * KB ];
    
      transpose_matrix( m, k, a, lda, at, k );
    
      for ( j=0; j<n; j++ ){
        double *cp = &C( 0, j );
    
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *a0p = &AT(0,i), *a1p = &AT(0,i+1), 
                 *a2p = &AT(0,i+2), *a3p = &AT(0,i+3), 
                 *bp=&B(0,j);
    
          for ( p=0; p<k; p+=4 ){
            register double bpj = bp[0];
    
            c0 = c0 +  a0p[0] * bpj;
            c1 = c1 +  a1p[0] * bpj;
            c2 = c2 +  a2p[0] * bpj;
            c3 = c3 +  a3p[0] * bpj;
    
            bpj = bp[1];
    
            c0 = c0 +  a0p[1] * bpj;
            c1 = c1 +  a1p[1] * bpj;
            c2 = c2 +  a2p[1] * bpj;
            c3 = c3 +  a3p[1] * bpj;
    
            bpj = bp[2];
    
            c0 = c0 +  a0p[2] * bpj;
            c1 = c1 +  a1p[2] * bpj;
            c2 = c2 +  a2p[2] * bpj;
            c3 = c3 +  a3p[2] * bpj;
    
            bpj = bp[3];
    
            c0 = c0 +  a0p[3] * bpj;
            c1 = c1 +  a1p[3] * bpj;
            c2 = c2 +  a2p[3] * bpj;
            c3 = c3 +  a3p[3] * bpj;
    
            bp+=4;
    
            a0p+=4;
            a1p+=4;
            a2p+=4;
            a3p+=4;
          }
          
          cp[ 0 ] += c0;
          cp[ 1 ] += c1;
          cp[ 2 ] += c2;
          cp[ 3 ] += c3;
          
          cp += 4;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 8
    NEW  := 9
  • make run

  • make plot

  • gv graph_8_vs_9.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 10-16: unrolling j

Optimization 10

Copy the contents of file MMult2.c into a file named MMult10.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j++ ){
        for ( i=0; i<m; i+=4 ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
    
          for ( p=0; p<k; p++ ){
            c0 += A( i,p ) * B( p,j );
            c1 += A( i+1,p ) * B( p,j );
            c2 += A( i+2,p ) * B( p,j );
            c3 += A( i+3,p ) * B( p,j );
          }
          
          C( i,j )   += c0;
          C( i+1,j ) += c1;
          C( i+2,j ) += c2;
          C( i+3,j ) += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
    
          for ( p=0; p<k; p++ ){
            c0 += A( i,p ) * B( p,j );
            c1 += A( i,p ) * B( p,j+1 );
            c2 += A( i,p ) * B( p,j+2 );
            c3 += A( i,p ) * B( p,j+3 );
          }
          
          C( i,j )   += c0;
          C( i,j+1 ) += c1;
          C( i,j+2 ) += c2;
          C( i,j+3 ) += c3;
        }
      }
    }
    

Change the first lines in the makefile to

  • OLD  := 2
    NEW  := 10
  • make run

  • make plot

  • gv graph_2_vs_10.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 11

Copy the contents of file MMult10.c into a file named MMult11.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
    
          for ( p=0; p<k; p++ ){
            c0 += A( i,p ) * B( p,j );
            c1 += A( i,p ) * B( p,j+1 );
            c2 += A( i,p ) * B( p,j+2 );
            c3 += A( i,p ) * B( p,j+3 );
          }
          
          C( i,j )   += c0;
          C( i,j+1 ) += c1;
          C( i,j+2 ) += c2;
          C( i,j+3 ) += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            c0 = c0 + *ap * *b0p++;
            c1 = c1 + *ap * *b1p++;
            c2 = c2 + *ap * *b2p++;
            c3 = c3 + *ap * *b3p++;
    
            ap += lda;
          }
          
          C( i,j )   += c0;
          C( i,j+1 ) += c1;
          C( i,j+2 ) += c2;
          C( i,j+3 ) += c3;
        }
      }
    }
    

Compare also Optimization 3 (above) to this Optimization 11. (They are similar, except that the same optimizations are applied to Optization 2 and 10, respectively.)

Change the first lines in the makefile to

  • OLD  := 3
    NEW  := 11
  • make run

  • make plot

  • gv graph_3_vs_11.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 12

Copy the contents of file MMult11.c into a file named MMult12.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            c0 = c0 + *ap * *b0p++;
            c1 = c1 + *ap * *b1p++;
            c2 = c2 + *ap * *b2p++;
            c3 = c3 + *ap * *b3p++;
    
            ap += lda;
          }
          
          C( i,j )   += c0;
          C( i,j+1 ) += c1;
          C( i,j+2 ) += c2;
          C( i,j+3 ) += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap += lda;
          }
          
          C( i,j )   += c0;
          C( i,j+1 ) += c1;
          C( i,j+2 ) += c2;
          C( i,j+3 ) += c3;
        }
      }
    }
    

Compare also Optimization 4 (above) to this Optimization 12. (They are similar, except that the same optimizations are applied to Optization 3 and 11, respectively.)

Change the first lines in the makefile to

  • OLD  := 4
    NEW  := 12
  • make run

  • make plot

  • gv graph_4_vs_12.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 13

Copy the contents of file MMult12.c into a file named MMult13.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap += lda;
          }
          
          C( i,j )   += c0;
          C( i,j+1 ) += c1;
          C( i,j+2 ) += c2;
          C( i,j+3 ) += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        double *c0p = &C(0,j), *c1p = &C(0,j+1), 
               *c2p = &C(0,j+2), *c3p = &C(0,j+3);
    
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap += lda;
          }
          
          *c0p++ += c0;
          *c1p++ += c1;
          *c2p++ += c2;
          *c3p++ += c3;
        }
      }
    }
    

Compare also Optimization 5 (above) to this Optimization 13. (They are similar, except that the same optimizations are applied to Optization 4 and 12, respectively.)

Change the first lines in the makefile to

  • OLD  := 5
    NEW  := 13
  • make run

  • make plot

  • gv graph_5_vs_13.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 14

Copy the contents of file MMult13.c into a file named MMult14.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        double *c0p = &C(0,j), *c1p = &C(0,j+1), 
               *c2p = &C(0,j+2), *c3p = &C(0,j+3);
    
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap += lda;
          }
          
          *c0p++ += c0;
          *c1p++ += c1;
          *c2p++ += c2;
          *c3p++ += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
                                                                            
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                              double *b, int ldb,
                                              double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3);
    
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap += lda;
          }
          
          *c0p++ += c0;
          *c1p++ += c1;
          *c2p++ += c2;
          *c3p++ += c3;
        }
      }
    }
    

Compare also Optimization 7 (above) to this Optimization 14. (They are similar, except that the same optimizations are applied to Optization 4 and 13, respectively.)

Change the first lines in the makefile to

  • OLD  := 7
    NEW  := 14
  • make run

  • make plot

  • gv graph_7_vs_14.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimization 15

Copy the contents of file MMult14.c into a file named MMult14.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
                                                                            
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                              double *b, int ldb,
                                              double *c, int ldc )
    {
      int i, j, p;
    
      for ( j=0; j<n; j+=4 ){
        double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3);
    
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &A(i,0), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap += lda;
          }
          
          *c0p++ += c0;
          *c1p++ += c1;
          *c2p++ += c2;
          *c3p++ += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define AT(i,j) at[ (j)*(k) + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
    
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                              double *b, int ldb,
                                              double *c, int ldc )
    {
      int i, j, p;
      double at[ MB * KB ];
    
      transpose_matrix( m, k, a, lda, at, k );
    
      for ( j=0; j<n; j+=4 ){
        double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3);
    
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &AT(0,i), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap++;
          }
          
          *c0p++ += c0;
          *c1p++ += c1;
          *c2p++ += c2;
          *c3p++ += c3;
        }
      }
    }
    

Compare also Optimization 8 (above) to this Optimization 15. (They are similar, except that the same optimizations are applied to Optization 7 and 14, respectively.)

Change the first lines in the makefile to

  • OLD  := 8
    NEW  := 15
  • make run

  • make plot

  • gv graph_8_vs_15.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

Optimizing Optimization 15

Again (as shown for Optimization 8), by transposing the matrix, it turns out that a larger block size can be used. (The reason for this is that the block of A can be increased beyond the size of the L1 cache.) Try changing MB = 144 and KB = 144 in file MMult15.c and reexecute make run, make plot, and gv graph_8_vs_15.eps. On my machine this yields

Optimization 16

Copy the contents of file MMult15.c into a file named MMult16.c and change the contents:

  • from

    to

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define AT(i,j) at[ (j)*(k) + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 48
    #define KB 48
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
    
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                              double *b, int ldb,
                                              double *c, int ldc )
    {
      int i, j, p;
      double at[ MB * KB ];
    
      transpose_matrix( m, k, a, lda, at, k );
    
      for ( j=0; j<n; j+=4 ){
        double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3);
    
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &AT(0,i), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p++ ){
            register double aip = *ap;
    
            c0 = c0 + aip * *b0p++;
            c1 = c1 + aip * *b1p++;
            c2 = c2 + aip * *b2p++;
            c3 = c3 + aip * *b3p++;
    
            ap++;
          }
          
          *c0p++ += c0;
          *c1p++ += c1;
          *c2p++ += c2;
          *c3p++ += c3;
        }
      }
    }
    

    /* Create macros so that the matrices are stored in column-major order */
    
    #define A(i,j) a[ (j)*lda + (i) ]
    #define AT(i,j) at[ (j)*(k) + (i) ]
    #define B(i,j) b[ (j)*ldb + (i) ]
    #define C(i,j) c[ (j)*ldc + (i) ]
    
    #define min( x, y ) ( x < y ? x : y )
    
    #define MB 144
    #define KB 144
    
    /* MY_MMult_inner prototype */
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc );
    
    
    /* Routine for computing C = A * B + C */
    
    void MY_MMult( int m, int n, int k, double *a, int lda, 
                                        double *b, int ldb,
                                        double *c, int ldc )
    {
      int i, p, ib, pb;
    
      for ( p=0; p<k; p+=KB ){
        pb = min( k-p, KB );
        for ( i=0; i<m; i+=MB ){
          ib = min( m-i, MB );
    
          MY_MMult_inner( ib, n, pb, 
                          &A( i, p ), lda, 
                          &B( p, 0 ), ldb, 
                          &C( i, 0 ), ldc );
        }
      }
    }
    
    
      
    
    
    /* Inner kernel for computing C = A * B + C */
    
    void MY_MMult_inner( int m, int n, int k, double *a, int lda, 
                                              double *b, int ldb,
                                              double *c, int ldc )
    {
      int i, j, p;
      double at[ MB * KB ];
    
      transpose_matrix( m, k, a, lda, at, k );
    
      for ( j=0; j<n; j+=4 ){
        double *c0p = &C(0,j), *c1p = &C(0,j+1), *c2p = &C(0,j+2), *c3p = &C(0,j+3);
    
        for ( i=0; i<m; i++ ){
          register double c0=0.0, c1=0.0, c2=0.0, c3=0.0;
          double *ap = &AT(0,i), *b0p=&B(0,j  ), *b1p=&B(0,j+1), 
                                *b2p=&B(0,j+2), *b3p=&B(0,j+3);
    
          for ( p=0; p<k; p+=4 ){
            register double aip;
    
            aip = *ap++;
    
            c0 = c0 + aip * b0p[0];
            c1 = c1 + aip * b1p[0];
            c2 = c2 + aip * b2p[0];
            c3 = c3 + aip * b3p[0];
    
            aip = *ap++;
    
            c0 = c0 + aip * b0p[1];
            c1 = c1 + aip * b1p[1];
            c2 = c2 + aip * b2p[1];
            c3 = c3 + aip * b3p[1];
    
            aip = *ap++;
    
            c0 = c0 + aip * b0p[2];
            c1 = c1 + aip * b1p[2];
            c2 = c2 + aip * b2p[2];
            c3 = c3 + aip * b3p[2];
    
            aip = *ap++;
    
            c0 = c0 + aip * b0p[3];
            c1 = c1 + aip * b1p[3];
            c2 = c2 + aip * b2p[3];
            c3 = c3 + aip * b3p[3];
    
            b0p+=4;
            b1p+=4;
            b2p+=4;
            b3p+=4;
          }
          
          *c0p++ += c0;
          *c1p++ += c1;
          *c2p++ += c2;
          *c3p++ += c3;
        }
      }
    }
    

Compare also Optimization 9 (above) to this Optimization 16. (They are similar, except that the same optimizations are applied to Optization 8 and 15, respectively.)

Change the first lines in the makefile to

  • OLD  := 9
    NEW  := 16
  • make run

  • make plot

  • gv graph_9_vs_16.eps

This time the performance graph will look something like

The difference in performance can be attributed to the fact that ...

LinearAlgebraWiki: OptimizingGemm (last edited 2009-11-02 13:37:13 by RobertVanDeGeijn)