CSE 260 Homework Assignment #3: Connected Component Labeling

Due: Thursday 2/7/08

Worth 35 points

Changelog

Date Description
6-Mar-08 Added afternotes, reflections on the assigment
3-Feb-08 Added a reference to the "Tips for writeups"
28-Jan-08 Added discussion pertaining to teams of 3.
28-Jan-08 Removed discussion of SPRNG random number generator.
28-Jan-08 Original posting



In this laboratory you'll parallelize and experiment with an existing connected component labeling algorithm provided to you. You should also attempt to speed up single processor performance. Refer to the Leighton and Schroeder handouts for reference.

Connected component labeling

A serial implementation of the connected component labeling algorithm, ccl, is available on DataStar $(PUB)/A3. Be sure to look over the file README.txt for additional information about how to use the code.

Although ccl is a serial program, it contains some hooks to help get you started with the MPI implementation: broadcasting command line arguments and taking times with MPI_Wtime(). The hooks may be enabled by uncommenting the definition of MPI_ON within the Makefile. The code is configured to use the AIX built in random number generator, drand48. This is not ideal, as we want independent streams across processors, but it will suffice for the purpose of this assignment. If you are interested, look into the SPRNG random number generator, or the ESSL routine (on AIX systems like DataStar): pdurng()

The program provides various command line options which are parsed in the source file cmdLine.C. In particular, you may seed the random number generator with an arbitrary value via the -s flag; else you get a seed based on the time of day. The program outputs the seed so that you will be able to reproduce a given run:

Random # generator seed: 1201251902

The code considers points to be adjacent only if they are nearest neighbors on the Manhattan coordinate directions-- left, right, up, and down. This is different from the stencil used in the Schroeder excerpt, which includes corners. If you like, experiment with the 9-point stencil, but be sure to present results for the Manhattan stencil.

The assignment

Determining the criticality point

First determine pc, the critical value of the independent probability p, that maximizes the running time.   It may be easier to pinpoint pc by looking at tabular data; sample p more closely in the neighborhood of criticality so you can determine  pc to 3 decimal places. Include tabular data along with your report, and also plot the running time as a function of p.

Since the initial conditions are randomized you'll need to take steps to ensure that your timings are statistically significant. For each value of p, make several (5 to 10) runs each with a different random number seed. Plot the mean value with a curve, but also plot the maximum and minimum value for each value of p you measured.   You should also repeat a few of your runs using the same random number seed, to see if your timings are consistent. To do this, use the -s flag to set the random number seed to the value reported by the reference run you wish to reproduce.

To determine an appropriate value of N, experiment by starting with N=100 and doubling N until the run completes in about 5 to 10 seconds at criticality.

Performance Enhancements

Parallelize the code. Label components in two phases. In the first phase, each processor labels its assigned part of the domain. (You may use ghost cells to contain labels from cells in neighboring processes.)  In the second phase, processes re-label their clusters, using newly updated labels obtained from neighboring processes.  This second phase may take several iterations before all labels are finalized.   Try and make the code go as fast as you can. If you are working in a team of two or more, you must implement 2D partitioning, else 1D.

Experiments

Report weak scaling results, that is, let N2 grow linearly with P (This is linear growth and is scalable). Run on 4, 8, 16, and 32 processors. If you implemented 2D partitioning, than compare results across possible geometries for the numbers of processors. Explain your results.

Teams of 3

If you are working in a team of 3, I expect a more ambitious assignment. In particular, you must implement the 3D variant of the algorithm, determining the criticality point, and the optimal processor geometry for differing values of P. See me about the details if you are not sure.

Things you should turn in

You should document your work in a well-written report of 5 pages, not including code listings or plots. Turn in an electronic copy of your report and code as in the previous assignment.

Read over these Tips for your writeup. [2/3/08]

If you are working in a team, include a self-evaluation discussing the division of labor and other aspects of how you worked together as a team team. A copy of the form is available here.

Include two appendices. The first should contain performance data for the clustering algorithm on one processor, as well as your scaling experiments. Any plotted data should also be included in tabular form. In the second appendix, submit a listing of your software (only in electronic form, please).

Your writeup should discuss decisions you made in parallelizing the algorithm. Present a clear evaluation of performance, including bottlenecks of the implementation. Describe any special coding or tuned parameters. What factors limit performance?

Verify correctness, in particular, you must show that results are independent of the number of processors. To do this, you'll need to make several runs with different random number seeds for a fixed value of N, showing that the average number of clusters (and the average cluster size) is independent of P. Do not attempt to look at output, as labels may vary on different numbers of processors. (The provided code will output results for small clusters, but this is a convenience)

Afternotes

Reflections on the assignment.
  1. When interpreting weak scaling results, factor out asymptotic behavior of the running time, that depends on N. Thus, normalize with respect to the # iterations, and N2. Otherwise, the effect obscures the weak scaling results.
  2. Strong scaling can help explain sources of overheads. So can a study of average and total cluster size, as a function of the independent probability.
  3. Criticality:
  4. Look at number of iterations as well as running time
  5. Is the behavior of the curve different for the 9-point stencil?
  6. Practical matters
  7. To reduce the number of job submissions, modify the code to make repeated runs for the same parameters settings, and to also sweep over a range of parameter settings.
  8. You can verify results without the need for statistical analysis using the following strategy. Generate the input from a root processor, so that there is just 1 random number seed. Take care to avoid non scalable memory consumption. Using this strategy, the number of clusters and the average cluster size is independent of the number of processors, for a given value of N and random number seed.


Copyright © 2008 Scott B. Baden. [Sun Mar 9 21:20:38 PDT 2008]