DM818 Mandatory Assignment 3: DNS Algorithm (Deckel, Nassimi, Sahni)
Overview
The purpose of this assignment is to parallelize matrix-matrix multiplication with the DNS algorithm (see Chapter 8 of the course book).
Implement the DNS algorithm for dense matrix multiplication using MPI on the Franlin cluster. Use a virtual topology as a mechanism for naming the processes in the communicator (see for example here). The implementation has only to be functional for a cubic number p of processors and you can assume that the matrices of size n×n have a certain size, namely it holds n ≥ p1/3. Furthermore you can assume that the input matrices are squared. Note that your implementation has to handle the case that the matrix dimension n is not divisible by p1/3. For the multiplication of the submatrices you are required to use the ACML library (cmp. Mandatory Assignment 1).
Let the overall runtime of the matrix-matrix multiplication be defined as Pend-Pstart. Pstart refers to the time that corresponds to the situation (a) on Figure 8.4 page 350 of the course book (see below, i.e. the time after the submatrices have been scattered from one to n2 processors), Pend refers to the time when the All-to-One reduction step along the k axis of DNS is finished (i.e. the time before the submatrices are gathered to one processor which is identical to the time directly after the All-to-One reduction step along the k axis has finished in all n2 processors).

Source Code
You may (but you don't have to) start with the implementation template supplied below.
|
Although you do not need to use this template, your executable
should in any case be named dnsmat, and it should use exactly two parameters,
namely the number of processors and the matrix dimension (for example
the command ./dnsmat 8 240 will execute the DNS
matrix-matrix multiplication algorithm on two matrices of size
240×240 with 8 cores).
You are welcome to use any NERSC cluster in this assignment.
Resources
- The main source of information is Chapter 8 of the course book. Virtual topologies are (for example) explained here.
Report / Submission
The assignment can be done alone, or in groups of two or three people (preferably groups of two). The submission of the source code has to be supplemented by a write-up (max. 10 pages), one write-up for the whole group. The write-up should contain in any case:
- The names of the people in your group.
- A statement of who contributed with what.
- A description of the design choices that you tried and how they affected the performance.
- A description of how you used virtual topologies.
- In your report, all runtimes and measures that depend on runtimes, should be
based on average runtimes based on 10 test runs (compare
main.c). - A table and a figure similar to Figure 5.8 and Table 5.1 on page 211 of the course book with reasonably chosen values for the number of processors p and matrix dimension n. As matrix-matrix multiplication with a small number of processors can take very long for large matrices, you should omit certain combination of p and n. (You may restrict computation time via the #PBS -l walltime=XX:XX:XX command in the batch file.) However you have to include (p,n) ∈ {(125,25),(125,10000),(1000,100),(1000,10000)} and a column for p=1. For determining the speedup and the efficiency you shall not compare your measured parallel runtimes to actual runtimes using p=1, but you shall assume a 9.2 Gflop/s peak performance per processor and infer the sequential runtime for p=1 based on that assumption. (Therefore the column "p=1" will have efficiency values <1.0 that are identical to the fraction of peak performance based on the ACML runtime measurements from the first mandatory assignment.)
- A table with the same columns and
rows as the table given above, where the entries should give the
percentage of the overall runtime that was spend for computation
(use
MPI_Wtimefor measuring computation times between communication periods on one node only). - A table with the same columns and rows as the table given above,
where the entries should give the overall runtime in seconds
(use
MPI_Wtimefor measuring computation times between communication periods on one node only). - A discussion of your results and a discussion that compares your results to the isoefficiency function O(p (log p)3).
- A description of how you solved the problem when the matrix dimension n is not divisible by p1/3.
- Your matrices for multiplication should have randomly chosen elements between 0 and 1. The matrices have to be defined on one node and then the submatrices have to be distributed according to the DNS algorithm to the other nodes. Furthermore the data should be gathered in the end to one node (this is also necessary for the correctness check, see next paragraph). Note however how the overall runtime of the matrix-matrix multiplication is defined!
- Your source code has to include a method that checks for a randomly chosen element of the resulting matrix, that the outcome of the parallel matrix multiplication is identical to the result that is computed in O(n) time on one node (that is the difference is smaller that a reasonably chosen small value). The correctness check should not influence the overall runtime measurements. Your executable has to perform the correctness check on at least one element.
For submitting the report and the sources proceed as follows: Create the following directory structure one of IMADAs pool machines (of course you have to copy the data from the NERSC filesystem to the IMADA system before doing that, and you have to modify the directory names below if you use different systems than Franklin for your implementation).
mandatory3/ mandatory3/report/ mandatory3/mpi-franklin/sources/ mandatory3/mpi-franklin/results/
Put your report, sources, and results in the corresponding
directory. The source directory should include all files that are
necessary to produce an executable code on the corresponding NERSC
resources. The sources directory and
the results directory should also contain
a README file that describes details. It basically has to
be possible to copy the complete sources to the file system of the
system of your choice, and to produce executable files via a
simple make command. The report directory,
the sources directory, and the results
directory should not contain more than 10MB of data each.
After putting all the data to the correct location, change to the directory mandatory3 and type aflever DM818.
Additionally, you shall hand in a printout of your report and of your source code (10 pages limit applies only to the report, not to the source code). (Department secretaries office).
The strict deadline for submission (electronically and printouts) is Friday, November, 27th, 10:00am.
Frequently Asked Questions (FAQ)
- Are we allowed to work in groups?
Yes, but you should clearly state who contributed with what in the report. The mandatory assignments are part of your final exam, i.e., you have to fully understand all parts of what you have done.