Next: 4.1 Performance Comparison Up: A Parallel Software Infrastructure Previous: 3.3 Coarse-Grain Computation

4 Performance Analysis

Portability and performance are two vital considerations in the design and implementation of any numerical library. Parallel computers obsolesce at an alarming rate; portability ensures that numerical software will run on the most powerful and up-to-date computational resources available. Computational scientists will not use software libraries unless they deliver reasonable performance. In this section, we analyze the performance and overheads of our adaptive mesh library. We begin with a performance comparison of an Intel Paragon and an IBM SP2 with one processor of a Cray C-90, and the succeeding section presents a detailed breakdown of parallel execution times.

It is an open research question whether non-uniform refinement structures can be efficiently supported in a data parallel language. One implementation strategy for structured adaptive mesh methods in a data parallel language such as High Performance Fortran [14] would restrict all refinement patches to be the same size [3]. We therefore conclude this section with an analysis of the performance implications of requiring uniformly sized refinement regions.

The motivating application for our structured adaptive mesh library is the adaptive solution of nonlinear eigenvalue problems arising in materials design [6]. We present computational results for the calculation of the lowest eigenvalue and associated eigenvector of the 3d Hamiltonian for a ring of ten hydrogen ions located in the Z=0 plane. While this is a synthetic problem, its structure resembles real materials design applications of interest (e.g. ring structures). Our eigenvalue solver method is based on the multilevel iterative algorithm of Mandel and McCormick [18].

The adaptive mesh hierarchy for this problem consists of eight levels with a total of 844x10^3 grid points. The first six levels are the usual uniform multigrid grids (with a mesh refinement ratio of two) and the next two are adaptively refined (with a mesh refinement ratio of four). The resolution on the finest level corresponds to a uniform mesh of size 512^3; thus, for this application, adaptivity reduced memory requirements by a factor of 160 (844x10^3 as compared to 512^3). In the following sections, we report the cost for one iteration of the eigenvalue algorithm over this grid hierarchy. Each complete iteration requires approximately 320 million floating point operations, or approximately 375 flops per grid point spread out over about ten different numerical routines requiring intervening communication. (Our numerical kernels typically execute only about forty or fifty flops per grid point.) Table 1 summarizes software releases and compiler flags for all benchmarks. All floating point arithmetic used 64-bit numbers.

  
Table 1: Software version numbers and compiler optimization flags. All benchmarks used release v2.0 of the LPARX system.





Next: 4.1 Performance Comparison Up: A Parallel Software Infrastructure Previous: 3.3 Coarse-Grain Computation

Scott R. Kohn and Scott B. Baden