DM818 Mandatory Assignment 2: Parallelize Particle Simulation
Overview
The purpose of this assignment is introduction to programming in shared and distributed memory models. (Particle Simulations will/was also be covered in the lecture). You will be required to submit a distributed memory implementation based on MPI, and at least one shared memory implementation (based on pthreads or OpenMP).
Your goal is to parallelize a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behaviour shown in the following animation:
The range of interaction forces is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.
Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations. Don't forget to use the no output ( -no ) option for your actual timing runs you use in these calculations and plots.
Correctness and Performance
A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01-0.05 . More details as well as an average distance are described in the source file.
The code we are providing will do this distance check based on the calls to the interaction function but the full autograder will do the checks based on the outputted txt file. We'd recommend keeping the correctness checker in your code (for the OpenMP and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct ".txt" file.
The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts. However, make sure your code is fully tested before running the autograders at full scale (repeatedly running the autograders unmodified for testing can eat up your compute hours allocation quickly).
There will be two types of scaling that are tested for your parallel codes: In strong scaling we keep the problem size constant but increase the number of processors In weak scaling we increase the problem size proportionally to the number of processors so the work/processor stays the same (Note that for the purposes of this assignment we will assume a linear scaling between work and processors)
For more details on the options provided by each file you can use the -h flag on the compiled executables.
It may help to have a clean O(n) serial CPU implementation as a reference. Besides the simple correctness check provided, you can check the correctness of your algorithm by comparing your solution to the serial implementation up to 100 timesteps. Your solution is correct up the the tenth decimal place by this point in time with any order of summation, but not correct if you have missed a particle interaction.
Important note for Performance:
While the scripts we are providing have small numbers of particles (500-1000) to allow for the O(n^2) algorithm to finish execution, the final codes should be tested with values 100 times larger (50000-100000) to better see their performance.
Source Code
You may start with the serial and parallel implementations supplied below. All of them run in O(n2) time, which is unacceptably inefficient.
|
You are welcome to use any NERSC cluster in this assignment. If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.
You can use the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version (not tested since 2011).
Resources for the mandatory part
- Programming in shared and distributed memory models have been introduced in the Lectures, the slides are available in the Blackboard System.
- Shared memory implementations may require using locks that are availabale as omp_lock_t in OpenMP (requires omp.h) and pthread_mutex_t in pthreads (requires pthread.h).
- Alternatively, you may consider using atomic operations such as __fetch_and_swap in the IBM compiler (requires builtins.h) and __sync_lock_test_and_set in the newer versions of the GNU compiler.
- Distributed memory implementation may benefit from overlapping communication and computation that is provided by nonblocking MPI routines such as MPI_Isend and MPI_Irecv.
- Other useful resources: pthreads tutorial, OpenMP tutorial, OpenMP specifications and MPI specifications.
Report / Submission
The assignment can be done alone, or in groups of two people. The submission of the source code has to be supplemented by a write-up (max. 10 pages), one write-up for the whole group. The write-up should contain
- The names of the people in your group.
- A statement of who contributed with what.
- A plot in log-log scale that shows that your serial and parallel codes run in O(n) time and a description of the data structures that you used to achieve it. You can also use plot in log-linear scale that shows your performance as a percent of peak performance for different numbers of processors.
- A description of the synchronization you used in the shared memory implementation.
- A description of the communication you used in the distributed memory implementation.
- A description of the design choices that you tried and how did they affect the performance.
- Speedup plots that show how closely your parallel codes approach the idealized p-times speedup and a discussion on whether it is possible to do better.
- Where does the time go? Consider breaking down the runtime into computation time, synchronization time and/or communication time. How do they scale with p?
- A discussion on using pthreads, OpenMP and MPI for this assignment.
For submitting the report and the sources proceed as follows: Create the following directory structure one of IMADAs pool machines (of course you have to copy the data from the NERSC filesystem to the IMADA system before doing that, and you have to modify the directory names below if you use different systems than Edison for the OpenMP, pthreads, and MPI implementation.)
mandatory2/ mandatory2/report/ mandatory2/openmp-edison/sources/ mandatory2/openmp-edison/results/ mandatory2/pthreads-edison/sources/ mandatory2/pthreads-edison/results/ mandatory2/mpi-edison/sources/ mandatory2/mpi-edison/results/
Put your report, sources, and results in the corresponding
directory. The source directories should include all files that are
necessary to produce an executable code on the corresponing NERSC
resources. Each of the three sources
directories and each
of the three results
directories should also contain
a README
file that describes details. It basically has to
be possible to copy the complete sources to the file system of the
system of your choice, and to produce executable files via a
simple make
command. The report
directory,
the sources
directories, and the results
directories should not contain more than 10MB of data each. You should consider to
include a webpage (please use the filename index.html
) in the results
directories that includes animations to show the correctness of your simulation for reasonably chosen
number of particles.
The submission procedure for blackboard will be specified soon.
Additionally, you shall hand in a printout of your report and of your source code (10 pages limit applies only to the report, not to the source code). (my mailbox, my office, or during the lecture).
The strict deadline for submission (electronically and printouts) is 20.11.2017, 15:00am.
Frequently Asked Questions (FAQ)
- Are we allowed to work in groups?
Yes, but you should clearly state who contributed with what in the report. The mandatory assignments are part of your final exam, i.e., you have to fully understand all parts of what you have done.