CSE.240b Mini-project
    
Current time to beat: 13.9s; speedup: 6.76x (baseline serial version: 94.0s)
    
Due: 11:59pm, Saturday, February 10th, 2007

The goal of the mini-project is to get some idea of the challenges inherent in paralallizing serial programs.  A few years ago, this would have been only peripherally related to architecture, but now, with the rise of CMPs, understanding why parallelizing code is hard is key to understanding the trade-offs in building a high performance computer system.

For the project you will be parallelizing equake, one of the Spec2000 benchmarks.  Equake is a floating-point intensive finite-element earthquake simulator.  The goal is to make it as fast as possible.  There will be a prize for the fastest implementation (measured on the large input set.  The current best time is listed above.).

The mini-project builds on the warmup, so keep the things you learned there in mind.  We will be using DataStar again.

Updates:

2/2/2006:  There is  a new version of helper.h in my equake directory (see below).  It contains a more interesting example of inline assemmbly.  Also note that the store contiditional instruction in PowerPC is 'stwcx.'  The "." is crucial, but that is not clear from the manual.


Provided Code

For this project, some helper code has been provided for the assignment in:

/rmount/users02/ucsd/swanson/equake

You can use cp to copy it to your local directory. This code is provided to help you with parts of the assignment.  Much of the code is the same as in the warm-up, but you should take a look to through to familiarize yourself with the changes.

The major differences are:
  1. The Makefile now builds equake.gcc.  The source code for equake is in quake.c.
  2. barrier.{h,c} contains a barrier implementation you may use (note that this is a synchronization primitive, not a "memory fence" instruction.)
  3. The Makefile also builds equake_diff.gcc.  This is a tool for comparing the output of your version of equake to the correct output.  See notes below for why this is necessary.
  4. There are 3 sets of input data for equake: small.input, medium.input, and large.input.  For each input, there is a file with the correct output:  small.output-correct, etc.

The Assignment

Collaboration Policy: For this assignment, you may not share code with other students. However, you may discuss the problems you are having abstractly, or inquire/suggest as to what function, macro or instruction you should use to overcome the problem. You may also supply weblinks for others to use to address an issue.

Deliverables: In the following parts of the assignment, certain parts are marked as "Deliverables".  These are the parts to be printed out and submitted. Please keep these concise and do not include excess
material.  Excessively unclear writeups will not be given full credit.


IFiguring out where to spend your time.  A key part of paralellizing an application is figuring out where it spends its time.  Remember Amdahl's Law.  Profiling is an excellent way to figure out where to focus your efforts, and gprof is a pretty good profiling tool.  To use gprof, you must build a gprof-enabled binary.  The Makefile contains a rule to do so.  The executable it generates is called equake.gcc.prof.  A man page for gprof is available on DataStar.

To collect profiling data run equake on the medium or large data set by doing

       equake.gcc.prof medium.input > medium.output

Verify that the output is correct by doing

       equake_diff medium.output medium.output-correct

It will say "Files are essentially the same" if the output is correct.

The program will run for a while and when it's finished, a file call gmon.out will appear in the directory.  View the profiling output by doing

       gprof equake.gcc.prof

Look at the man page to interpret the output.

Deliverables:  Answer the following questions:  Which functions take up most of the execution time?  If you focus your efforts on just those functions, what's the largest speedup you can hope to achieve?  What if you just focused on the single most time-consuming function?


IIParallelizing the code.  Now that you know where to focus your attention, you can start parallelizing the code.  You can use any of the facilities that pthreads provides.   In addition there is a barrier implementation in barrier.h and barrier.c (Documentation is barrier.h).  barrierExample.c contains an example of using the barrier.  You should also look for opportunities to use the load-reserved and store-conditional instructions that the PowerPC architecture provides for doing atomic update operations.  Make sure to include the appropriate 'sync' instructions when writing code for atomic update, if needed.  You will need to use inline assembly to use these speciallized instructions.  helper.h contains some very simple instances of inline assembly in the SYNC #define.  Documentation about using gcc's inline assembly to do more interesting things is here.

In parallelizing your code make sure you do a few things:
  1. Try pthread's facilities first.  They are a good way to get your feet wet.
  2. Attempt to use the load-reserved/store-conditional instructions.  If they don't help performance, describe why you think this is.
  3. Make sure your code produces reasonable results.  Use equake_diff.gcc early and often.  It is acceptable if your output does not exactly match the output of the serial version of equake.gcc.  Reordering floating point operations can change the result of some operations.  For the purposes of this assignment, if equake_diff.gcc says it's good enough, it is. (in the real world, the question of when floating point rounding errors are acceptable is much more complicated.)
  4. Provide a command line option for specifying the number of threads to use.  Does increasing the number of threads always increase performance?
Here are some things to consider in parallelizing your code
  1. Loops are excellent places to look for parallelism.
  2. Pay close attention to the presence and absence of dependencies between loop iterations.
Here are some things to consider if you want to get the best performance possible
  1. Data layout can have a strong effect on performance.
  2. Pthreads operations are expensive.
  3. The barrier implementation is not optimized (see #2).
Measuring performance:

To roughly measure the performance of your implementation you can run your program interactively, but for "official" measurements you will need to use the batch system.  To run equake.gcc on the batch use the run-equake script.  It takes 4 arguments:

       run-equake <executable> <input file> <thread count> <target queue>

and generates the job submission script you need to submit with llsubmit.  For instance

       run-equake equake.gcc large.input 1 express | llsubmit -

will run equake.gcc on large.input with 1 thread and put it into the express queue (turn around times seems to be on the order of 10s of minutes as opposed to hours).  Values of <target queue> are "express" and "normal".  You can only submit express jobs from dspoe.sdsc.edu.  Normal jobs can be submitted only from dslogin.sdsc.edu.  The script will pass the thread count as the second argument to equake.gcc.  It's up to you to make sure your version of equake implements it correctly.

Once the job is finished (you should receive an email about it), you'll have a file with a name like LL_out.221069 that contains the output of the program.  Each line starts with '   0:'.  You'll need to strip them off before you can run equake_diff.  This bit of perl will do that for you:

          perl -ne 's/\s+0://g; print' < LL_out.221069 > medium.out

A second file, LL_err.221069, will contain the rest of the program output, including the timing information.

Deliverables:  A copy of your parallelized code (please, submit only the parts of quake.c that are necessary to understand what you did.).  Include a description of what you did, why, and why it's correct.  If you refer to your code in your description, make it easy to find the relevant parts in your code listing.  Also include a brief description of things you tried and that did not work (Parallelizing code can be tough.  Our goal is for you to understand the issues involved.  Dead ends are a great, if frustating, way to learn.).  Also include the running time of your best implementation on large.input.  Email Steven a copy of the executable for your final version.

As you go, measure the performance of your implementation and send me the results.  I'll keep the web page updated with the current best time.



Since this assignment is newly created, please do not hesitate to post for clarifications on the assignment, subject to the restriction that you should not give away any answers. Students are free to respond to these clarification requests.


Resources

The SDSC DataStar Userguide contains a wealth of information about using and programming DataStar.

The GNU Make manual may be useful to understand the Makefile used in the provided code.

Power 4 Manuals: Book 1, Book 2, Book 3.