Parallel Computing

DM818, Autumn 2009

Daniel Merkle

DM818 Mandatory Assignment 3: DNS Algorithm (Deckel, Nassimi, Sahni)

Overview

The purpose of this assignment is to parallelize matrix-matrix multiplication with the DNS algorithm (see Chapter 8 of the course book).

Implement the DNS algorithm for dense matrix multiplication using MPI on the Franlin cluster. Use a virtual topology as a mechanism for naming the processes in the communicator (see for example here). The implementation has only to be functional for a cubic number p of processors and you can assume that the matrices of size n×n have a certain size, namely it holds n ≥ p1/3. Furthermore you can assume that the input matrices are squared. Note that your implementation has to handle the case that the matrix dimension n is not divisible by p1/3. For the multiplication of the submatrices you are required to use the ACML library (cmp. Mandatory Assignment 1).

Let the overall runtime of the matrix-matrix multiplication be defined as Pend-Pstart. Pstart refers to the time that corresponds to the situation (a) on Figure 8.4 page 350 of the course book (see below, i.e. the time after the submatrices have been scattered from one to n2 processors), Pend refers to the time when the All-to-One reduction step along the k axis of DNS is finished (i.e. the time before the submatrices are gathered to one processor which is identical to the time directly after the All-to-One reduction step along the k axis has finished in all n2 processors).

Source Code

You may (but you don't have to) start with the implementation template supplied below.

comm.c
comm.h
for submatrix scattering and gathering and handling virtual topologies
matrix.c
matrix.h
(sub)matrix handling
main.c
main file (including averaging runtimes)
Makefile
a makefile that should work on the Franklin cluster,
job-dns8-240,
sample batch file to launch one job on Franklin (Please check how "Running Multiple Parallel Jobs Simultaneously" . Use qsub to submit on Franklin.
template.tar
all above files in one tarball.

Although you do not need to use this template, your executable should in any case be named dnsmat, and it should use exactly two parameters, namely the number of processors and the matrix dimension (for example the command ./dnsmat 8 240 will execute the DNS matrix-matrix multiplication algorithm on two matrices of size 240×240 with 8 cores).

You are welcome to use any NERSC cluster in this assignment.

Resources

  • The main source of information is Chapter 8 of the course book. Virtual topologies are (for example) explained here.

Report / Submission

The assignment can be done alone, or in groups of two or three people (preferably groups of two). The submission of the source code has to be supplemented by a write-up (max. 10 pages), one write-up for the whole group. The write-up should contain in any case:

  • The names of the people in your group.
  • A statement of who contributed with what.
  • A description of the design choices that you tried and how they affected the performance.
  • A description of how you used virtual topologies.
  • In your report, all runtimes and measures that depend on runtimes, should be based on average runtimes based on 10 test runs (compare main.c).
  • A table and a figure similar to Figure 5.8 and Table 5.1 on page 211 of the course book with reasonably chosen values for the number of processors p and matrix dimension n. As matrix-matrix multiplication with a small number of processors can take very long for large matrices, you should omit certain combination of p and n. (You may restrict computation time via the #PBS -l walltime=XX:XX:XX command in the batch file.) However you have to include (p,n) ∈ {(125,25),(125,10000),(1000,100),(1000,10000)} and a column for p=1. For determining the speedup and the efficiency you shall not compare your measured parallel runtimes to actual runtimes using p=1, but you shall assume a 9.2 Gflop/s peak performance per processor and infer the sequential runtime for p=1 based on that assumption. (Therefore the column "p=1" will have efficiency values <1.0 that are identical to the fraction of peak performance based on the ACML runtime measurements from the first mandatory assignment.)
  • A table with the same columns and rows as the table given above, where the entries should give the percentage of the overall runtime that was spend for computation (use MPI_Wtime for measuring computation times between communication periods on one node only).
  • A table with the same columns and rows as the table given above, where the entries should give the overall runtime in seconds (use MPI_Wtime for measuring computation times between communication periods on one node only).
  • A discussion of your results and a discussion that compares your results to the isoefficiency function O(p (log p)3).
  • A description of how you solved the problem when the matrix dimension n is not divisible by p1/3.
  • Your matrices for multiplication should have randomly chosen elements between 0 and 1. The matrices have to be defined on one node and then the submatrices have to be distributed according to the DNS algorithm to the other nodes. Furthermore the data should be gathered in the end to one node (this is also necessary for the correctness check, see next paragraph). Note however how the overall runtime of the matrix-matrix multiplication is defined!
  • Your source code has to include a method that checks for a randomly chosen element of the resulting matrix, that the outcome of the parallel matrix multiplication is identical to the result that is computed in O(n) time on one node (that is the difference is smaller that a reasonably chosen small value). The correctness check should not influence the overall runtime measurements. Your executable has to perform the correctness check on at least one element.

For submitting the report and the sources proceed as follows: Create the following directory structure one of IMADAs pool machines (of course you have to copy the data from the NERSC filesystem to the IMADA system before doing that, and you have to modify the directory names below if you use different systems than Franklin for your implementation).

mandatory3/
mandatory3/report/
mandatory3/mpi-franklin/sources/
mandatory3/mpi-franklin/results/

Put your report, sources, and results in the corresponding directory. The source directory should include all files that are necessary to produce an executable code on the corresponding NERSC resources. The sources directory and the results directory should also contain a README file that describes details. It basically has to be possible to copy the complete sources to the file system of the system of your choice, and to produce executable files via a simple make command. The report directory, the sources directory, and the results directory should not contain more than 10MB of data each.

After putting all the data to the correct location, change to the directory mandatory3 and type aflever DM818.

Additionally, you shall hand in a printout of your report and of your source code (10 pages limit applies only to the report, not to the source code). (Department secretaries office).

The strict deadline for submission (electronically and printouts) is Friday, November, 27th, 10:00am.

Frequently Asked Questions (FAQ)

  • Are we allowed to work in groups?
  • Yes, but you should clearly state who contributed with what in the report. The mandatory assignments are part of your final exam, i.e., you have to fully understand all parts of what you have done.