DM818 Mandatory Assignment 1: Optimize Matrix Multiplication
Problem
Your task is to optimize matrix multiplication code to run fast on a single processor core of NERSC Franklin cluster (please read the access instructions), and also Running Jobs on Franklin. Semantics of the matrix multiplication is:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C[i,j] = C[i,j] + A[i,k] * B[k,j]
end
end
end
where A, B and C are n-by-n matrices. You need to write a dgemm.c that implements this semantics and exposes it through the following C interface:
void square_dgemm( int n, double *A, double *B, double *C ).
The matrices are stored in column-major order, i.e. entry Cij is stored at C[i+j*M]. The correctness of implementation is verified using an element-wise error bound that holds for the naive matrix multiply:
|square_dgemm(n,A,B,0) - A*B| < eps * n * |A| * |B|.
This bound also holds for any other algorithm that does the same floating point operations as the naive algorithm but in different order.
However, this may exclude using some other algorithms such as Strassen's. In the formula, eps = 2–52 = 2.2 * 10–16 is the machine epsilon.
We provide two simple implementations for you to start with: a naive three-loop implementation similar to shown above and a more cache-efficient blocked implementation.
The necessary files are in matmul.tar. Included are the following:
|
This code uses ACML library, so you may need to type "module load acml" before compiling the code on Franklin to set up the paths. See instructions in the Makefile on compiling the code with different compilers available on Franklin. A detailed help on using the programming environment can be found on the NERSC pages.
The target processor is Quad Core AMD Opteron. Its cores have separate 128-bit wide pipelines for addition and multiplication clocked at 2.3GHz. This yields 2 double precision (i.e. 64-bit) flops per pipeline * 2 pipelines * 2.3 GHz = 9.2 Gflop/s. The following graph shows the performance of the supplied naive and blocked implementations (if compiled using PGI compiler with -fastsse option) as a fraction of this peak (see Figure below). Your goal is to get a higher performance and understand how you did that. Ideally, you would match or surpass the performance of the BLAS implementation (please check how fast it is), but this is not required.

Useful References
Technical Publications:
- Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
- Chellappa, S., Franchetti, F., and Püschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196–259.
- Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
- Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63–74.
Technical Documentation:
- AMD64 Architecture Tech Docs.
- PGI C and C++ Compilers on Franklin
- Running Jobs on Franklin
- Using BLAS/LAPACK/ATLAS on Linux based systems
- Automatically Tuned Linear Algebra Software (ATLAS)
- A brief API reference for the ANSI/ISO C interface to the BLAS.
You are also welcome to learn from the source code of state-of-art BLAS implementations such as GotoBLAS and ATLAS. However, you should not reuse those codes in your submission.
Report / Submission
The assignment can be done alone, or in groups of two or three people (preferably groups of two). The submission of the source code (at least an optimized version of dgemm.c or dgemm.cpp and a Makefile has to be supplemented by a write-up (max. 10 pages), one write-up for the whole group). The write-up should contain
- the names of the people in your group,
- a statement of who contributed with what,
- the optimizations used or attempted,
- the results of those optimizations,
- the reason for any odd behavior (e.g., dips) in performance, and
- how the same kernel performs on a different hardware platform (and a small discussion of that),
- a figure fhat shows the performace similar than the figure above, but using all matrix sizes 2, 3, 4, ..., 769 for a single processor core of NERSC Franklin cluster (y-axis should say percentage of peak performace),
- a figure fhat shows the performace similar than the figure above, but using all matrix sizes 2, 3, 4, ..., 769 for a single processor of another system of your choice (using the same kernel). Explain how you dertermined the peak performace, and again, the y-axis should say percentage of peak performace.
For the last requirement, you may run your optimized implementation on another NERSC cluster, such as Bassi (POWER 5 processors) or DaVinci (Itanium 2 processors), on your laptop, on one of the IMADA pool machines, etc. Note that on the IMADA pool machines ATLAS (Automatically Tuned Linear Algebra Software) is installed.
Passing this assignment will mostly depend on two factors:
- performance sustained by your codes on Franklin,
- explanations of the performance features you observed.
For submitting the report and the sources proceed as follows: Create the following directory structure one of IMADAs pool machines (of course you have to copy the data from the NERSC filesystem to the IMADA system before doing that)
mandatory1/ mandatory1/report/ mandatory1/sources/franklin/ mandatory1/sources/other/ mandatory1/results/franklin/ mandatory1/results/other/
Put your report, sources, and results in the corresponding
directory. The source directory for the Franklin Cluster should include all files that are
necessary to produce an executable code on the NERSC resources (for
example there has to be a dgemm.cpp and
a Makefile in any case). It should be possible to copy
the complete data to the file system of the Franklin cluster and to produce
executable files via a simple make command. The report
direcory and the sources directories should not contain
more than 10MB of data. Each of the two result directories should contain a file called performance. These two
files have to have the format of the output from the benchmark program (see source code). Use all matrix sizes 2, 3, 4, ..., 768, 769. The file(s) should look similar to the following:
Size: 2 Mflop/s: 1466.75 Size: 3 Mflop/s: 1515.29 ... Size: 768 Mflop/s: 433.163 Size: 769 Mflop/s: 397.745
After putting all the data to the correct location, change to the directory mandatory1 and type aflever DM818.
Additionally, you shall hand in a printout of your report and of your source code. (Department secretaries office).
The strict deadline for submission (electronically and printouts) is Friday, September, 25th, 13:00 (due to delays in receiving acces to the Franklin Cluster, the original deadline was prolonged by one week from September, 18th to September 25th).
Frequently Asked Questions (FAQ)
- Are we allowed to work in groups?
Yes, but you should clearly state who contributed with what in the report. The mandatory assignments are part of your final exam, i.e., you have to fully understand all parts of what you have done.