Next: 4.3 Uniform Grid Patches Up: 4 Performance Analysis Previous: 4.1 Performance Comparison

4.2 Execution Time Analysis

Figure 4b illustrates that almost all of the benefit of additional processors is in the reduction of execution times at the highest levels of the adaptive grid hierarchy. Lower levels have too little work for efficient parallelization. Note that we cannot simply remove the lower levels because they play a vital role in the numerical convergence of our eigenvalue algorithm. We can expect better scaling as we address more complicated problems, which place additional computational work at the highest levels.

Tables 2 provides a detailed accounting for the parallel execution time on the Intel Paragon; results for the IBM SP2 are similar. We divide the execution time for the eigenvalue algorithm, including time spent building the adaptive grid hierarchy, into numerical computation, time lost to load imbalance, communication among grids at the same level of the hierarchy, communication between levels, error estimation, load balancing, and grid generation. Error estimation, load balancing, and grid generation consume only a few percent of the total execution time. The vast majority of the time is spent in numerical computation, load imbalance, and communication (intralevel and interlevel). As computation times drop with additional processors, communication overheads become a dominant factor in overall performance. On 32 Paragon nodes, communication accounts for about half of the total execution time.

  
Table 2: Execution time breakdown for the eigenvalue calculation on the Intel Paragon. Times are in seconds. Percentages may not add up to 100% due to rounding. The execution time is dominated by computation, load imbalance, and communication (both intralevel and interlevel). The relative cost of communication increases with additional processors; communication overheads account for about half of the total execution time on 32 nodes.

It is difficult to assess adaptive mesh library overheads on parallel computers since we do not yet have detailed hardware performance analyzers such as those on the Cray C-90. It would be impractical to develop a message passing version of our code by hand (i.e. without the software support offered by our API) because of the implementation complexity. We can assume that there is little library overhead in computation, since all numerical work is done in Fortran. The remaining contributor of overheads is interprocessor communication. Experiments indicate that perhaps half of the interprocessor communication time is due to overheads in the LPARX communication routines [15]; the remainder is spent in the operating system message routines. We are currently working on a re-design of the LPARX communication libraries which we believe will eliminate most of this additional overhead [12].



Next: 4.3 Uniform Grid Patches Up: 4 Performance Analysis Previous: 4.1 Performance Comparison

Scott R. Kohn and Scott B. Baden