DM818 Mandatory Assignment 2: Parallelize Particle Simulation

Overview

The purpose of this assignment is introduction to programming in shared and distributed memory models. (Particle Simulations will/was also be covered in the lecture). You will be required to submit a distributed memory implementation based on MPI, and at least one shared memory implementation (based on pthreads or OpenMP).

Your goal is to parallelize a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behaviour shown in the following animation:

The range of interaction forces is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.

Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations. Don't forget to use the no output ( -no ) option for your actual timing runs you use in these calculations and plots.

Correctness and Performance

A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01-0.05 . More details as well as an average distance are described in the source file.

The code we are providing will do this distance check based on the calls to the interaction function but the full autograder will do the checks based on the outputted txt file. We'd recommend keeping the correctness checker in your code (for the OpenMP and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct ".txt" file.

The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts. However, make sure your code is fully tested before running the autograders at full scale (repeatedly running the autograders unmodified for testing can eat up your compute hours allocation quickly).

There will be two types of scaling that are tested for your parallel codes: In strong scaling we keep the problem size constant but increase the number of processors In weak scaling we increase the problem size proportionally to the number of processors so the work/processor stays the same (Note that for the purposes of this assignment we will assume a linear scaling between work and processors)

For more details on the options provided by each file you can use the -h flag on the compiled executables.

It may help to have a clean O(n) serial CPU implementation as a reference. Besides the simple correctness check provided, you can check the correctness of your algorithm by comparing your solution to the serial implementation up to 100 timesteps. Your solution is correct up the the tenth decimal place by this point in time with any order of summation, but not correct if you have missed a particle interaction.

Important note for Performance:

While the scripts we are providing have small numbers of particles (500-1000) to allow for the O(n^2) algorithm to finish execution, the final codes should be tested with values 100 times larger (50000-100000) to better see their performance.

Source Code

You may start with the serial and parallel implementations supplied below. All of them run in O(n²) time, which is unacceptably inefficient.

common.cpp, common.h: an implementation of common functionality, such as I/O, numerics and timing
autograder.cpp: a code that helps calculate performance for both serial and parallel implementations
Makefile: a makefile that should work on all NERSC machines using slurm if you uncomment appropriate lines (please note, that the pthreads implementation has its own Makefile)
Makefile.imada: a makefile that should work on IMADA Computing Lab machines
job-edison-serial, job-edison-pthreads24, job-edison-openmp24, job-edison-mpi24: sample batch files to launch jobs on Edison. Use sbatch to submit on Edison. First, however, adjust the parameters for testing vs. final runs as appropriate.
auto-edison-serial, auto-edison-pthreads24, auto-edison-openmp24, auto-edison-mpi24: sample batch files to autograde performance on Edison. Use sbatch to submit on Edison. First, however, adjust the parameters for testing vs. final runs as appropriate.
serial.cpp: a serial implementation,
openmp.cpp: a shared memory parallel implementation done using OpenMP,
pthreads.cpp: a shared memory parallel implementation done using pthreads (if you prefer it over OpenMP. This version does not include a correctness check.),
mpi.cpp: a distributed memory parallel implementation done using MPI,
dm818_2017_src.tar: all above files in one tarball.

You are welcome to use any NERSC cluster in this assignment. If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.

You can use the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version (not tested since 2011).

Resources for the mandatory part

Programming in shared and distributed memory models have been introduced in the Lectures, the slides are available in the Blackboard System.
Shared memory implementations may require using locks that are availabale as omp_lock_t in OpenMP (requires omp.h) and pthread_mutex_t in pthreads (requires pthread.h).
Alternatively, you may consider using atomic operations such as __fetch_and_swap in the IBM compiler (requires builtins.h) and __sync_lock_test_and_set in the newer versions of the GNU compiler.
Distributed memory implementation may benefit from overlapping communication and computation that is provided by nonblocking MPI routines such as MPI_Isend and MPI_Irecv.
Other useful resources: pthreads tutorial, OpenMP tutorial, OpenMP specifications and MPI specifications.

Report / Submission

The assignment can be done alone, or in groups of two people. The submission of the source code has to be supplemented by a write-up (max. 10 pages), one write-up for the whole group. The write-up should contain

The names of the people in your group.
A statement of who contributed with what.
A plot in log-log scale that shows that your serial and parallel codes run in O(n) time and a description of the data structures that you used to achieve it. You can also use plot in log-linear scale that shows your performance as a percent of peak performance for different numbers of processors.
A description of the synchronization you used in the shared memory implementation.
A description of the communication you used in the distributed memory implementation.
A description of the design choices that you tried and how did they affect the performance.
Speedup plots that show how closely your parallel codes approach the idealized p-times speedup and a discussion on whether it is possible to do better.
Where does the time go? Consider breaking down the runtime into computation time, synchronization time and/or communication time. How do they scale with p?
A discussion on using pthreads, OpenMP and MPI for this assignment.

For submitting the report and the sources proceed as follows: Create the following directory structure one of IMADAs pool machines (of course you have to copy the data from the NERSC filesystem to the IMADA system before doing that, and you have to modify the directory names below if you use different systems than Edison for the OpenMP, pthreads, and MPI implementation.)

mandatory2/
mandatory2/report/
mandatory2/openmp-edison/sources/
mandatory2/openmp-edison/results/
mandatory2/pthreads-edison/sources/
mandatory2/pthreads-edison/results/
mandatory2/mpi-edison/sources/
mandatory2/mpi-edison/results/

Put your report, sources, and results in the corresponding directory. The source directories should include all files that are necessary to produce an executable code on the corresponing NERSC resources. Each of the three sources directories and each of the three results directories should also contain a README file that describes details. It basically has to be possible to copy the complete sources to the file system of the system of your choice, and to produce executable files via a simple make command. The report directory, the sources directories, and the results directories should not contain more than 10MB of data each. You should consider to include a webpage (please use the filename index.html) in the results directories that includes animations to show the correctness of your simulation for reasonably chosen number of particles.

The submission procedure for blackboard will be specified soon.

Additionally, you shall hand in a printout of your report and of your source code (10 pages limit applies only to the report, not to the source code). (my mailbox, my office, or during the lecture).

The strict deadline for submission (electronically and printouts) is 20.11.2017, 15:00am.

Frequently Asked Questions (FAQ)

Are we allowed to work in groups?

Yes, but you should clearly state who contributed with what in the report. The mandatory assignments are part of your final exam, i.e., you have to fully understand all parts of what you have done.

Parallel Computing

DM818, Autumn 2017

Daniel Merkle