| Date | Description |
| 20-Jan-08 | The parameter order for the -lin option was incorrectly stated. The correct order is START END STEP |
| 20-Jan-08 | Output of process IDs is not garbled, as was previously stated. |
| 18-Jan-08 | Quantified notion of measurement consistency |
| 15-Jan-08 | Original posting |
This assignment will familiarize you with DataStar and the basics of MPI. You'll conduct experiments with provided MPI code and also make some modifications. All the code you need lives in the following publicly readable directory, which will be referred to henceforth as $(A2)
Before you begin, print out or bring up in your browser the Using DataStar document. You will find it handy.
You will run this code, but not make any changes. Copy the directory $(A2)/Basic to your home directory. Compile and run the two programs contained inside, using the supplied Makefile which defines various compiler and loader flags.
Your first program is a parallel "hello world" program. It prints "Hello World'' from each processor along with the rank (process ID). It also reports the total number of processors participating in the run. Run the program interactively on on between 3 and 8 processors. Here is some sample output:
# processes: 3
Hello world from process 2 Hello world from process 0 Hello world from process 1
Do the process ID's come in any particular order? Why or why not?
The second program, called message, sends 2 integers from each process to process 0, which prints out the received values. A common convention is to call process 0 the root. Run the program several times on 4 processes, noting any variation in message order.
Run interactively first, then submit a batch job using the batch submission script run_batch.sh (this script runs both programs). You can retrieve your output after you've received mail that the job has finished. Be sure to modify the file as instructed, to ensure the message is sent to you.
We discussed the α-β model of communication performance in class as well as the rendezvous model of communication. The Ring program will enable you to explore these models, and also note any interesting performance phenomena involving communication around a ring of processors. In the course of conducting your experiments, you’ll also make some changes to the code.
The program we've supplied you, ring, is found in $(A2)/Ring. This program treats the processors as if connected in a ring. Process 0 sends a message to successor process 1, and so on until process P-1, which then sends the message back to process 0, which is P-1's successor in modulo P arithmetic. The cycle then repeats. Messages of various sizes are passed around the ring, according to a progress schedule specified by the user.
Ring measures the total time taken to
pass messages around the ring, and reports the statistics based on these times
as a function of increasing message length. (See below for how
Ring collects
timing information.)
Ring accepts up to 3 command line arguments:
By default, NNN=1024,TTT=5 and message sizes form a logarithmic progression. Under the default settings Ring passes messages of 2, 4, 8, ..., 1K, ..., 256K five times around the processor ring. The -lin setting is useful for exploring small ranges of messages sizes that may exhibit interesting behavior.
Run ring on 4 processors on a single node, and experiment with message
sizes into
the megabyte range. Take measurements using the load leveler script
run_1x4.sh.
Note any differences in message passing times on differing
numbers of processors, and attempt to explain them.
You should collect timings in batch mode only.
Your timings will reflect
switch contention from other jobs. In a true dedicated environment,
you would be
the only user on the machine and contention would be
minimal (mostly from OS services); running times would be shorter and
more reproducible. Since it is
unrealistic to exclude other users from the machine.
Run several times (at least 3 times), taking the minimum
running time. This technique will help
filter out spurious timings. If you don't see consistent results
(that is, a consensus of at least 2 runs agree
to within 5% to 10%) conduct more runs until you
do.
Note any variations in the timings and include in your report.
To get accurate timings, you may need to increase the repetitions parameter for short messages below about 1KB in length. Use the message size parameter to restrict the maximum message length to 1KB, and successively double the repetitions parameter until the timings stabilize. For longer lengths (1KB to 16KB) you may be able to use a smaller repetition factor.
Next, run your job on 4 nodes using 1 process per node. Use the script run_4x1.sh Do you note a difference in the startup cost and bandwidth? Keep the runs short, sticking with the specified 5 minute job time limit; we are using only 1 of 8 CPUs on each node.
Using the results from experiment (2), determine an approximate value for n1/2. Using this estimate of the half power point, determine it more precisely with the help of the linear progression feature -lin. Explore message sizes from about one-half to double your estimate of the half power point, in increments of 128 bytes.
Now, provide the following information.
Plot the bandwidth as a function of the number of bytes in the message. On a single plot, display the 4-processor runs that were made on (a) a single node and (b) on 4 nodes, where the message sizes varied as powers of two into the megabyte range. Note any interesting behavior. When generating plots, merge the timings from several runs, selecting, for each given message size, the minimum value obtained across all the runs.
You should observe that this curve levels off after a certain point is reached, which is called the peak bandwidth . Why does this occur? Use logarithmic scales as necessary to improve readability.
Report the message startup time, which is the overhead of initiating a message. A good approximation to this time is simply the message passing time for the shortest messages. Pick the smallest time you observed as the startup time.
Plot the message passing time in the region of the eager limit, noting and commenting on the observed behavior.
Determine the half power point n1/2, where the achieved bandwidth is one half the peak.
You may have noticed that some of the output in the Ring program is garbled.
In particular, one or more process names appear later in the output than is
consistent with their appearance in the code. What do you think is causing this
behavior?
To make your plots, use your favorite plotting package, e.g. gnuplot, or plot in matlab. (For some documentation on GnuPlot, go to http://www.duke.edu/~hpgavin/gnuplot.html.)
You'll conduct two additional experiments that involve modifying the code. Message passing time depends on a number of factors, including the amount of copying needed to transmit and receive the data. Modify the Ring program to assess the effect of additional memory copying. In particular, copy each incoming message into a buffer before sending the data to the next process. You might also try a second copy and note any effects observed, but this is optional. A place-holder has been provided to enable you to specify message copying on the command line. See the file cmdLine.C, defines a variable to hold the copy count. You'll need to modify the code in ring.C.
Report the bandwidth as a function of message passing size into the megabyte range, as in experiment (1) earlier. (It is not necessary to explore finely detailed behavior, though you are invited to do so.)
DataStar is organized as a collection of 8-way SMP nodes which communicate over a switch. Communication performance is higher when it involves processors on the same node than on different nodes. In fact, MPI ranks are packed contiguously onto a node on DataStar. (You can see this by observing the mapping of hostnames to MPI ranks, which is reported in the output.)
Modify the ring program so that the links in the logical processor ring cross off-processor links only. To observe this effect, run on 2 nodes with 8 CPUs per node. (Use with the script run_2x8.sh). Thus, rank 0 will send to process 8, which will send to process 1, which will send to process 9, and so on, until process 15 completes the cycle by sending to process 0. You need only change the computation of the quantity nextnode, which is found on one line of ring.C.
Prior to collecting timings, ring "warms up" the machine by
passing messages around the ring once two times before it turns on the timer. This helps
to eliminate transient program behavior from your timing runs.
(To improve repeatability, you may need to increase the WARM_UP parameter specified in ring.h.)
The ring program
divides the timings by the number of repetitions around the ring, and reports
three quantities: message length, bandwidth( x106 bytes/sec), and
message time in microseconds (ms).
Ring takes timing measurements using the MPI wall-clock timer function MPI_Wtime(). To obtain an absolute time, take the difference in times sampled by two successive calls:
double t0 = MPI_Wtime();
// some code....
double t1 = MPI_Wtime();
double time = (t1-t0);
You can query the resolution of MPI_Wtime using MPI_Wtick().
Document your work in a well-written report which discusses your findings
carefully.
Discuss trends you observed
in the experimental data, any anomlous behavior
or other interesting behavior. Examine the possible causes
as best you can. [2/3/08]
Your report should present a clear evaluation of the design of
your code, including bottlenecks of the implementation, and describe any
special coding or tuned parameters. Provide code listings for files you
wrote, or that you modified based on versions we gave to you (there is no
need to provide listings of code we provided, that you did not modify).
Provide code fragments showing the modifcations you made to the code for the Additional Experiments.
[2/3/08]
Be
sure to include the plots described above, along with any plotted data in
tabular form.
Large tables of data (more than about 1/2 page)
or your raw data should
appear as files in the electronic turnin (see below). [2/3/08]
Your report should cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.
Turn in hard copy: you can put it in my mailbox on the 2nd floor if I'm not in. Also transmit the report electronically along with your code; instructions are discussed below. Your code will be run on DataStar, so be sure to report results from that machine.
Transmit an electronic copy of your report as a compressed archive file (e.g. .zip, .tgz, .tar.gz). Organize the source code and plots into subdirectories, and include a README file explaining the contents. If you like, include a .html file called index.html. From there you can then link to the README file and to the various components of your report, including the source code and any other files you want to include.
The name of your tar file should begin with your login ID, followed by the underscore character '_' followed by "hw2," i.e. mine would be baden_hw2.tar.gz. Your archive file should create a directory with the name of your login ID. Be sure and delete all .object (.o) files and executables before creating the archive file. Send the archive file to my CS email address baden (at)ucsd (dot) edu with the string cse260:hw2 in the subject: line.
Man pages for the MPI calls may be accessed with the man(1) command. They are also available on-line. More extensive documentation can be found at at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html. The calls used in the Ring program are as follows.
| MPI _Init | Initialize the MPI environment. Call this prior to making any MPI calls. |
| MPI_Finalize | Exit the MPI environment. Call this function before exiting your program. |
| MPI_Comm_size | Queries the total number of processes use in this program invocation. |
| MPI_Comm_rank | Queries this process's unique identifier, an integer called the rank |
| MPI_Send | Sends a message to another process, a return signifies that the message buffer may be reused. |
| MPI_Recv | Receives a message from another process, a return signifies that the message has been received. |
| MPI_Irecv | Immediate (non-blocking asynchronous). |
| MPI_Wait | Wait on a pending IRecv. |
| MPI_Wtime | Wall-clock timer |
| MPI_Barrier | Blocks calling process until all processes have checked in (all processes must call this routine) |
Copyright © 2008 Scott B. Baden. Last modified: 02/03/08 11:27 AM