CSE 260 Assignment 2

Due Friday, 1/25/08 at 5PM

Worth 20 points

Changelog

Date Description
20-Jan-08 The parameter order for the -lin option was incorrectly stated. The correct order is START END STEP
20-Jan-08 Output of process IDs is not garbled, as was previously stated.
18-Jan-08 Quantified notion of measurement consistency
15-Jan-08 Original posting

This assignment will familiarize you with DataStar and the basics of MPI. You'll conduct experiments with provided MPI code and also make some modifications. All the code you need lives in the following publicly readable directory, which will be referred to henceforth as $(A2)

/rmount/users02/ucsd/baden/cse260_wi08/A2

Before you begin, print out or bring up in your browser the Using DataStar document. You will find it handy.

Simplest MPI

You will run this code, but not make any changes. Copy the directory $(A2)/Basic to your home directory. Compile and run the two programs contained inside, using the supplied Makefile which defines various compiler and loader flags.

Your first program is a parallel "hello world" program. It prints "Hello World'' from each processor along with the rank (process ID). It also reports the total number of processors participating in the run.   Run the program interactively on on between 3 and 8 processors. Here is some sample output:

# processes: 3
Hello world from process 2
Hello world from process 0
Hello world from process 1

Do the process ID's come in any particular order? Why or why not?

The second program, called message, sends 2 integers from each process to process 0, which prints out the received values. A common convention is to call process 0 the root. Run the program several times on 4 processes, noting any variation in message order.

Run interactively first, then submit a batch job using the batch submission script run_batch.sh (this script runs both programs). You can retrieve your output after you've received mail that the job has finished. Be sure to modify the file as instructed, to ensure the message is sent to you.  

Ring

We discussed the α-β model of communication performance in class as well as the rendezvous model of communication. The Ring program will enable you to explore these models, and also note any interesting performance phenomena involving communication around a ring of processors. In the course of conducting your experiments, you’ll also make some changes to the code.

The program we've supplied you, ring, is found in $(A2)/Ring. This program treats the processors as if connected in a ring. Process 0 sends a message to successor process 1, and so on until process P-1, which then sends the message back to process  0, which is P-1's successor in modulo P arithmetic. The cycle then repeats. Messages of various sizes are passed around the ring, according to a progress schedule specified by the user.

Ring measures the total time taken to pass messages around the ring, and reports the statistics based on these times as a function of increasing message length.  (See below for how Ring collects timing information.)

Ring accepts up to 3 command line arguments:

  • -s NNN, where NNN is the maximum size message, in kilobytes, to be passed around the ring.
  • -t TTT, where the message will be passed TTT times around the ring.
  • -lin START END STEP specifies that message sizes should start at message size START, and progress linearly in steps of STEP to the final message size of END.
  • By default, NNN=1024,TTT=5 and message sizes form a logarithmic progression. Under the default settings Ring passes messages of 2, 4, 8, ..., 1K, ..., 256K five times around the processor ring.  The -lin setting is useful for exploring small ranges of messages sizes that may exhibit interesting behavior.

    Experiments

    1. Run ring on 4 processors on a single node, and experiment with message sizes into the megabyte range. Take measurements using the load leveler script run_1x4.sh. Note any differences in message passing times on differing numbers of processors, and attempt to explain them. You should collect timings in batch mode only. Your timings will reflect switch contention from other jobs. In a true dedicated environment, you would be the only user on the machine and contention would be minimal (mostly from OS services); running times would be shorter and more reproducible. Since it is unrealistic to exclude other users from the machine. Run several times (at least 3 times), taking the minimum running time. This technique will help filter out spurious timings. If you don't see consistent results (that is, a consensus of at least 2 runs agree to within 5% to 10%) conduct more runs until you do. Note any variations in the timings and include in your report.

      To get accurate timings, you may need to increase the repetitions parameter for short messages below about 1KB in length. Use the message size parameter to restrict the maximum message length to 1KB, and successively double the repetitions parameter until the timings stabilize. For longer lengths (1KB to 16KB) you may be able to use a smaller repetition factor.

    2. Next, run your job on 4 nodes using 1 process per node. Use the script run_4x1.sh Do you note a difference in the startup cost and bandwidth? Keep the runs short, sticking with the specified 5 minute job time limit; we are using only 1 of 8 CPUs on each node.

    3. Using the results from experiment (2), determine an approximate value for n1/2. Using this estimate of the half power point, determine it more precisely with the help of the linear progression feature -lin. Explore message sizes from about one-half to double your estimate of the half power point, in increments of 128 bytes.

    4. Explore message passing performance in the region of the eager limit, which is set at 64K in the load leveler scripts. Again, use the linear progression feature of ring to zoom in on fine details.

    Now, provide the following information.

    1. Plot the bandwidth as a function of the number of bytes in the message. On a single plot, display the 4-processor runs that were made on (a) a single node and (b) on 4 nodes, where the message sizes varied as powers of two into the megabyte range. Note any interesting behavior. When generating plots, merge the timings from several runs, selecting, for each given message size, the minimum value obtained across all the runs.

      You should observe that this curve levels off after a certain point is reached, which is called the peak bandwidth . Why does this occur? Use logarithmic scales as necessary to improve readability.

    2. Report the message startup time, which is the overhead of initiating a message. A good approximation to this time is simply the message passing time for the shortest messages. Pick the smallest time you observed as the startup time.

    3. Plot the message passing time in the region of the eager limit, noting and commenting on the observed behavior.

    4. Determine the half power point n1/2, where the achieved bandwidth is one half the peak.

    You may have noticed that some of the output in the Ring program is garbled. In particular, one or more process names appear later in the output than is consistent with their appearance in the code. What do you think is causing this behavior?

    To make your plots, use your favorite plotting package, e.g. gnuplot, or plot in matlab. (For some documentation on GnuPlot, go to http://www.duke.edu/~hpgavin/gnuplot.html.)

    Additional experiments

    You'll conduct two additional experiments that involve modifying the code. Message passing time depends on a number of factors, including the amount of copying needed to transmit and receive the data. Modify the Ring program to assess the effect of additional memory copying. In particular, copy each incoming message into a buffer before sending the data to the next process. You might also try a second copy and note any effects observed, but this is optional. A place-holder has been provided to enable you to specify message copying on the command line. See the file cmdLine.C, defines a variable to hold the copy count. You'll need to modify the code in ring.C.

    Report the bandwidth as a function of message passing size into the megabyte range, as in experiment (1) earlier. (It is not necessary to explore finely detailed behavior, though you are invited to do so.)

    DataStar is organized as a collection of 8-way SMP nodes which communicate over a switch. Communication performance is higher when it involves processors on the same node than on different nodes. In fact, MPI ranks are packed contiguously onto a node on DataStar. (You can see this by observing the mapping of hostnames to MPI ranks, which is reported in the output.)

    Modify the ring program so that the links in the logical processor ring cross off-processor links only. To observe this effect, run on 2 nodes with 8 CPUs per node. (Use with the script run_2x8.sh). Thus, rank 0 will send to process 8, which will send to process 1, which will send to process 9, and so on, until process 15 completes the cycle by sending to process 0. You need only change the computation of the quantity nextnode, which is found on one line of ring.C.


    How the ring program takes timings

    Prior to collecting timings, ring "warms up" the machine by passing messages around the ring once two times before it turns on the timer. This helps to eliminate transient program behavior from your timing runs. (To improve repeatability, you may need to increase the WARM_UP parameter specified in ring.h.) The ring program divides the timings by the number of repetitions around the ring, and reports three quantities: message length, bandwidth( x106 bytes/sec), and message time in microseconds (ms).

    Ring takes timing measurements using the MPI wall-clock timer function MPI_Wtime(). To obtain an absolute time, take the difference in times sampled by two successive calls:

            double t0   = MPI_Wtime(); 
            // some code....
            double t1   = MPI_Wtime();
            double time = (t1-t0);
    

    You can query the resolution of MPI_Wtime using MPI_Wtick().

     



     


    Things you should turn in

    Document your work in a well-written report which discusses your findings carefully. Discuss trends you observed in the experimental data, any anomlous behavior or other interesting behavior. Examine the possible causes as best you can. [2/3/08] Your report should present a clear evaluation of the design of your code, including bottlenecks of the implementation, and describe any special coding or tuned parameters. Provide code listings for files you wrote, or that you modified based on versions we gave to you (there is no need to provide listings of code we provided, that you did not modify). Provide code fragments showing the modifcations you made to the code for the Additional Experiments. [2/3/08] Be sure to include the plots described above, along with any plotted data in tabular form. Large tables of data (more than about 1/2 page) or your raw data should appear as files in the electronic turnin (see below). [2/3/08]

    Your report should cite any written works or software as appropriate in the text of the report. Citations should include any software you used that was written by someone else, or that was written by you for purposes other than this class.

    Turn in hard copy: you can put it in my mailbox on the 2nd floor if I'm not in.  Also transmit the report electronically along with your code; instructions are discussed below. Your code will be run on DataStar, so be sure to report results from that machine.

    Checklist

    Provide the following
    1. A plot displaying the message bandwidth as a function of the number of bytes in the message, comparing 1x4 and 4x1 geometries.
    2. A separate plot focusing in on the region of the eager limit (comparing both geometries), but in units of time.
    3. The message startup time, peak bandwidth, and half power point n1/2 (comparing both geometries)
    4. A plot showing the message copying cost. On the same plot, compare with the cost without copying (1).
    5. Bandwidth for interleaved MPI rank assignment, comparing with default assignment.

    Electronic turnin instructions

    Transmit an electronic copy of your report as a compressed archive file (e.g. .zip, .tgz, .tar.gz). Organize the source code and plots into subdirectories, and include a README file explaining the contents. If you like, include  a .html file called index.html. From there you can then link to the README file and to the various components of your report, including the source code and any other files you want to include.

    The name of your tar file should begin with your login ID, followed by the underscore character '_' followed by "hw2,"  i.e. mine would be baden_hw2.tar.gz. Your archive file should create a directory with the name of your login ID. Be sure and delete all .object (.o)  files and executables before creating the archive file. Send the archive file to my CS email address baden (at)ucsd (dot) edu with the string cse260:hw2 in the subject: line.


    Documentation

    Man pages for the MPI calls may be accessed with the man(1) command. They are also available on-line. More extensive documentation can be found at at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html. The calls used in the Ring program are as follows.

    MPI _Init Initialize the MPI environment. Call this prior to making any MPI calls.
    MPI_Finalize Exit the MPI environment. Call this function before exiting your program.
    MPI_Comm_size Queries the total number of processes use in this program invocation.
    MPI_Comm_rank Queries this process's unique identifier, an integer called the rank
    MPI_Send Sends a message to another process, a return signifies that the message buffer may be reused.
    MPI_Recv Receives a message from another process, a return signifies that the message has been received.
    MPI_Irecv Immediate (non-blocking asynchronous).
    MPI_Wait Wait on a pending IRecv.
    MPI_Wtime Wall-clock timer
    MPI_Barrier Blocks calling process until all processes have checked in (all processes must call this routine)


    Copyright © 2008  Scott B. Baden. Last modified: 02/03/08 11:27 AM