A Parallel SoftwareInfrastructure for Structured AdaptiveMesh Methods Scott R. Kohn and Scott B. Baden Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0114 Abstract Structured adaptive mesh algorithms dynamically allocate computational resources to accurately resolve interesting portions of a numerical calculation. Such methods are difficult to implement and parallelize because they rely on dynamic, irregular data structures. We have developed an efficient, portable, parallel software infrastructure for adaptive mesh methods; our software provides computational scientists with high-level facilities that hide low-level details of parallelism and resource management. We have applied our software infrastructure to the solution of adaptive eigenvalue problems arising in materials design. We describe our software infrastructure and analyze its performance. We also present computational results which indicate that the uniformity restrictions imposed by a data parallel Fortran implementation of a structured adaptive mesh application would significantly impact performance. 1 Introduction The accurate solution of many problems in science and engineering requires the resolution of unpredictable, localized physical phenomena. Examples include shock waves in computational fluid dynamics [1] and the near-singular atomic core potentials in materials design [6]. The key feature of these problems is that some portions of the problem domain---for example, regions containing the shock waves or the atomic nuclei---require higher resolution, and thus more computational effort, than other areas of the computational space. Structured adaptive numerical methods [2, 5, 19] dynamically place computational resources, such as CPU cycles and memory, only where needed to meet accuracy requirements; thus, they can achieve better accuracy for the same computational resources as compared to non-adaptive methods. Although structured adaptive mesh methods incur some overhead costs associated 1 with adaptivity, such as error estimation and data structure management, these overheads are insignificant when compared to the savings gained through selective refinement [3]. Adaptive mesh methods are difficult to implement on serial architectures---not to mention parallel machines---because they rely on dynamic, complicated data structures with irregular communication patterns. On parallel platforms, the programmer is burdened with the responsibility of managing data distributed across processor memories and orchestrating interprocessor communication and synchronization. Because adaptive mesh applications change in response to the dynamics of the problem, little can be known about the structure of the computation at compile-time. Thus, decisions about data decomposition, the assignment of work to processors, and the calculation of communication patterns must be made at run-time. It is an open research question whether the irregularity of an adaptive mesh application can be efficiently supported in a data parallel language such as High Performance Fortran [14]. In Section 4.3, we present computational results which show that the restrictions imposed by a data parallel implementation would significantly impact performance. We have developed an efficient, portable, parallel software infrastructure for structured adaptive mesh methods which hides these implementation details. It presents computational scientists with high-level tools which allow them to concentrate on the application and mathematics instead of low-level concerns of data distribution and interprocessor communication. Such support enables scientists to develop parallel, portable, efficient applications in a fraction of the time that would have been required if the application had been developed from scratch. Our structured adaptive mesh API (Application Programmer Interface), implemented as a collection of C++ classes, is built on the parallelization and communication mechanisms of the LPARX run-time system [16]. Our software runs on a variety of high-performance computer platforms, including the Cray C-90, IBM SP2, Intel Paragon, and networks of workstations under PVM. We have applied our adaptive software infrastructure to the solution of eigenvalue problems arising in materials design [6, 15]. By exploiting adaptivity in our applications, we have reduced memory consumption and computation time by more than two orders of magnitude over an equivalent uniform mesh method. This paper is organized as follows. Section 2 introduces the salient features of structured adaptive mesh algorithms to motivate the software facilities required by the adaptive mesh library. In Section 3, we describe the adaptive mesh API and its facilities. Section 4 analyzes in detail parallel performance and library overheads. We then review related research and conclude with an analysis and discussion. 2 *** See HTML or PostScript for Figure *** Figure 1: Three levels of a structured adaptive mesh hierarchy. The eight dark circles represent regions of high error, such as atomic nuclei in materials design applications. The mesh spacing of each level is half of the previous coarser level. 2 Adaptive Mesh Algorithms This section provides a high-level description of the structured adaptive mesh algorithm. We present the salient features of the method to motivate the abstractions described in Section 3. Further numerical details can be found elsewhere [2, 5, 15, 19]. Adaptive methods may be structured or unstructured, depending on how they represent the numerical solution to the problem. Unstructured adaptive methods [10, 11] store the solution using graph or tree representations; these methods are called "unstructured" because connectivity information must be stored for each unknown. Structured methods, such as adaptive mesh refinement [2] and structured multigrid algorithms [5, 19], employ a hierarchy of nested mesh levels in which each level consists of many simple, rectangular grids. Each rectangular grid in the hierarchy represents a structured block of many thousands of unknowns. Because of these dissimilar data representation strategies, structured adaptive methods require different software support and implementation approaches than unstructured methods. Here we consider only structured adaptive methods. Structured adaptive mesh methods solve partial differential equations using a hierarchy of nested, locally structured finite difference grids. The grid hierarchy can be thought of as a single composite grid in which the discretization is non-uniform. All grids at the same level of the hierarchy have the same mesh spacing, but successive levels have finer spacing than the ones preceding it, providing a more accurate representation of the solution. Adaptive mesh methods refine this discretization of space to accurately represent localized physical phenomena (see Figure 1). When creating a new level, the algorithm refines the hierarchy according to an error estimate calculated at run-time. These new higher-resolution grids, called refinement patches, are used only where necessary. In general, the location and size of refinement patches must be computed at run-time, as they cannot be predicted a priori. Adaptive mesh algorithms communicate information about the numerical solution between levels of the hierarchy and also among grids at the same level of the hierarchy. Around the boundary of each grid patch is a ghost cell region which locally caches data from adjacent grids or, where no neighboring grids exist, from the next coarser level of the hierarchy. Without the proper software support, managing these bookkeeping details can be difficult because of the irregular and unpredictable placement 3 of refinement patches [9]. Note that ghost cell regions and communication are required by the adaptive mesh method and are not simply artifacts of the parallel implementation. 3 Adaptive Mesh API The adaptive mesh algorithms of the previous section are quite difficult to implement on both sequential and parallel architectures. Refinement regions vary in size and location in the computational space, resulting in complicated geometries (see Figure 1). Communication patterns between grid patches and between grid levels are irregular and change as the hierarchy is modified. On message passing platforms, the programmer must explicitly manage grid data distributed across the processor memories and orchestrate interprocessor communication. Even shared memory multiprocessors require the explicit, low-level management of data locality and communication for reasonable performance [20]. These implementation difficulties soon become unmanageable and can obscure the mathematics behind the algorithms. The goal of our adaptive mesh API is to provide scientists with high-level support for adaptive mesh applications. Scientists using our library facilities can concentrate on their specific applications rather than being concerned with the underlying implementation details. Of course, our software is portable and efficient. Portability among high performance computing platforms guarantees that applications software will run on the most powerful and up-to-date computational resources available. Such a powerful software infrastructure is essential in developing sophisticated, reusable code. 3.1 Software Infrastructure Overview Our adaptive mesh software infrastructure consists of three primary components: numerical operations, grid management facilities, and display routines. The numerical routines define the elliptic partial differential equation to be solved. The display library contains some simple graphing and plotting facilities for visualizing data. The grid hierarchy management routines comprise the most complex and interesting portion of the adaptive mesh API. These facilities manage all aspects of the grid hierarchy: data structure bookkeeping, error estimation, grid generation, workload balancing and processor assignment, and communication. An important observation is that the grid management facilities are independent of the numerical details of a particular partial differential equation; the same routines may be used to solve a number of different numerical problems. One feature of our adaptive mesh library is that its facilities are independent of problem dimension. Scientists using the library see the same abstractions and interface whether they are working in two or three spatial dimensions. Numerical 4 details differ, but the interfaces for grid generation, error estimation, load balancing, and grid hierarchy management are identical. Dimension independence provides programmers the freedom to develop and debug simpler, faster 2d versions of their applications on workstations and then, when confident that the code is working, recompile for 3d on a supercomputer. In practice, we have found dimension independence particularly useful; the adaptive mesh libraries and a materials design application [15] were first developed on workstations in 2d. Our adaptive mesh software is layered on top of the LPARX software abstractions [16]. LPARX provides run-time parallel support such as distributed data management, coarse- grain parallel execution, interprocessor communication, and synchronization. The adaptive mesh API builds on this foundation and add facilities specifically tailored towards adaptive mesh applications. LPARX's concept of structural abstraction and its support for first-class data decompositions have been vital to our success. Structural abstraction enables the application to represent and manipulate the structure of data---the "floorplan" describing where data is located---separately from the data itself. For example, when adding a new level to the adaptive mesh hierarchy, refinement regions at the new level are represented as first-class, language-level objects. The structure of the new refinement level is determined by grid generation routines. Refinement patch descriptions are then manipulated by load balancing and processor assignment algorithms. Only then does the code actually allocate the data associated with the refinement patches. In contrast, languages such as HPF [14] provide very limited run-time control over the dynamic allocation and placement of irregular distributed data. Structural abstraction enables our API to represent and modify dynamic refinement structures at run-time. In the following sections, we describe grid hierarchy management and our model of coarse-grain parallel numerical computation. Further details can be found elsewhere [15]. 3.2 Grid Hierarchy Management Recall from Section 2 that structured adaptive mesh methods store data using a composite grid implemented as a hierarchy of levels (see Figure 2). To represent this structure, our API uses three data types: a Grid, an IrregularGrid, and a CompositeGrid. The Grid represents a single, logically rectangular array at one level of the composite grid hierarchy. A collection of Grids at one level of the hierarchy is an IrregularGrid, and a set of IrregularGrids organized into levels is a CompositeGrid. LPARX defines the basic building blocks for IrregularGrid and CompositeGrid, which provide specialized abstractions appropriate for adaptive mesh applications. Each object provides facilities appropriate for its role in the adaptive mesh hierarchy. For example, IrregularGrid defines communication operations only 5 *** See HTML or PostScript for Figure *** Figure 2: A composite grid is represented using a Grid, an IrregularGrid, and a CompositeGrid. A CompositeGrid consists of IrregularGrid objects organized into levels. Each IrregularGrid is a collection of Grids. Real grid hierarchies in multiple dimensions are vastly more complicated (see Figure 1). among Grids at the same level of refinement; an IrregularGrid has no notion of "level." Instead, communication between levels is managed by CompositeGrid. Communication among Grids in the hierarchy employs LPARX's copy-on-intersection operation, a high-level facility which copies data between the logically overlapping portions of two Grids. Data motion involves no explicit computations involving subscripts; all bookkeeping details and interprocessor communication are managed by the run-time system and are completely hidden from the user. The parallelism in our adaptive mesh application lies at the level of the IrregularGrid. We can parallelize across one level of the hierarchy, but there is little opportunity for parallelism across levels. Therefore, the Grids in an IrregularGrid are distributed across processors, and we compute over these Grids in parallel. Following the LPARX model, each Grid in an IrregularGrid is assigned to one processor. Of course, a single processor may be responsible for many Grids. 3.3 Coarse-Grain Computation The previous section described how a single level of the adaptive grid hierarchy---an IrregularGrid---is distributed across processors. In this section, we discuss parallel execution over such distributed structures. Parallel numerical computation is expressed using LPARX's forall construct, a coarse-grain data parallel loop that executes each iteration as if on its own virtual processor. Each iteration executes independently of all other iterations. For each Grid, the API calls a serial numerical kernel, typically written in Fortran, which executes on one processor. There are a number of advantages of separating parallel execution from serial, numerical computation. Numerical code may be optimized to take advantage of low-level node characteristics, such as vector units or multiple physical processors, without regard to the higher level parallelism. Existing serial code may not need to be re-implemented when parallelizing an application. Furthermore, we can leverage existing, mature sequential compiler technology. Figure 3 compares our model of coarse-grain parallelism with a fine-grain data parallel style [3, 17] Coarse-grain parallelism executes in parallel over the entire collection of grids. Each grid is assigned to one processor, and numerical computation on that grid is sequential. In contrast, fine-grain parallelism 6 Coarse-Grain Parallelism Fine-Grain Parallelism // Parallel loop over grids // Serial loop over grids forall i in U for i in U // Serial loop over elements // Parallel loop over elements for j,k,l in U(i) forall j,k,l in U(i) // Do numerical work // Do numerical work end for end forall end forall end for Figure 3: Coarse-grain data parallelism (left) expresses parallel execution over the entire collection of grids; computation on each individual grid is serial. In contrast, fine-grain data parallelism (right) expresses parallelism over the data elements of each grid, and the grids are handled sequentially. processes grids sequentially and expresses parallelism over the elements of a single grid. There are a number of advantages to coarse-grain parallelism. Because the numerical computation is serial, we may employ numerical methods on each grid which do not parallelize efficiently. For example, Gauss-Seidel relaxation works well as a smoother in multigrid, but it cannot be easily expressed in a fine-grain data parallel style. Coarse-grain parallelism also allows more asynchrony between processors and is therefore a better match to current coarse-grain message passing architectures. To improve the efficiency of the fine-grain model, Parsons and Quinlan [22] are developing run-time methods for automatically extracting coarse-grain tasks from fine-grain data parallel loops. Another model, processor subsets [8], combines the coarse-grain and fine-grain approaches; parallelism is expressed both over grids and within each grid. 4 Performance Analysis Portability and performance are two vital considerations in the design and implementation of any numerical library. Parallel computers obsolesce at an alarming rate; portability ensures that numerical software will run on the most powerful and up-to-date computational resources available. Computational scientists will not use software libraries unless they deliver reasonable performance. In this section, we analyze the performance and overheads of our adaptive mesh library. We begin with a performance comparison of an Intel Paragon and an IBM SP2 with one processor of a Cray C-90, and the succeeding section presents a detailed breakdown of parallel execution times. It is an open research question whether non-uniform refinement structures can be efficiently supported in a data parallel 7 *** See HTML or PostScript for Table *** Table 1: Software version numbers and compiler optimization flags. All benchmarks used release v2.0 of the LPARX system. language. One implementation strategy for structured adaptive mesh methods in a data parallel language such as High Performance Fortran [14] would restrict all refinement patches to be the same size [3]. We therefore conclude this section with an analysis of the performance implications of requiring uniformly sized refinement regions. The motivating application for our structured adaptive mesh library is the adaptive solution of nonlinear eigenvalue problems arising in materials design [6]. We present computational results for the calculation of the lowest eigenvalue and associated eigenvector of the 3d Hamiltonian for a ring of ten hydrogen ions located in the Z=0 plane. While this is a synthetic problem, its structure resembles real materials design applications of interest (e.g. ring structures). Our eigenvalue solver method is based on the multilevel iterative algorithm of Mandel and McCormick [18]. The adaptive mesh hierarchy for this problem consists of eight levels with a total of 844,000 grid points. The first six levels are the usual uniform multigrid grids (with a mesh refinement ratio of two) and the next two are adaptively refined (with a mesh refinement ratio of four). The resolution on the finest level corresponds to a uniform mesh of size 512^3; thus, for this application, adaptivity reduced memory requirements by a factor of 160 (844,000 as compared to 512^3). In the following sections, we report the cost for one iteration of the eigenvalue algorithm over this grid hierarchy. Each complete iteration requires approximately 320 million floating point operations, or approximately 375 flops per grid point spread out over about ten different numerical routines requiring intervening communication. (Our numerical kernels typically execute only about forty or fifty flops per grid point.) Table 1 summarizes software releases and compiler flags for all benchmarks. All floating point arithmetic used 64-bit numbers. 4.1 Performance Comparison Figure 4a compares the execution times for the IBM SP2, Intel Paragon, and one processor of a Cray C-90. Note that although the SP2 processors are approximately four times faster then the Paragon processors, its communication network is about half as fast. We ran the same applications code on all machines except that the Fortran kernels on the Cray C-90 are annotated to aid vectorization. The Paragon and the SP2 compare quite favorably against the C-90: for this application, four SP2 nodes or 32 Paragon nodes deliver the performance of one C-90 processor. Although 8 *** See HTML or PostScript for Figure *** Figure 4: (a) Adaptive eigenvalue solver performance results for the IBM SP2, Intel Paragon, and one processor of the Cray C-90. Times represent one iteration (averaged over ten) of the eigenvalue algorithm. The application would not run on fewer than four Paragon processors because of memory constraints. We report wallclock times for the SP2 and Paragon (processor nodes are not time-shared) and CPU times for the C-90. (b) A level-by-level accounting of the execution time for one iteration of the eigenvalue algorithm. The benefits of parallelism are limited to the highest levels of the hierarchy because lower levels have too little work for efficient parallelization. all Fortran numerical kernels of our code vectorize, hardware performance monitors on the C-90 report that our application achieves an aggregate rate of only 155 megaflops (million floating point operations per second) over the entire code and a peak rate of 290 megaflops. Our code realizes only a fraction of the Cray C-90's peak performance of 1000 megaflops due to short vector lengths in the Fortran routines (between 10 and 30). Of course, vector lengths are tied directly to grid size. We could achieve a higher megaflop rate and longer vector lengths by using larger grids and more memory. Note, however, that time to solution for a specified accuracy, not megaflop rate, is the important metric. Placing additional grid points in regions where they are not needed to improve resolution does not necessarily result in more accurate solutions. For example, we doubled the number of grid points used by the solver for this problem and yet achieved the same answer (to within 0.02%). The additional grid points were used to over-refine portions of the computational space where no further refinement was necessary. On the Cray C-90, our implementation using the adaptive mesh libraries would be comparable in performance to a Fortran code developed by hand without library support. Approximately 90% of the execution time of our application is spent on numerical computation in Fortran routines, 7% in transferring data between grids (which happens to be written in C++ but would also be required in an all-Fortran implementation), and the remaining 3% in miscellaneous routines. Even if we attribute the last 3% as all library overhead (which it is not), the ease of using an applications library and the benefits of portability to high-performance parallel architectures far outweighs the small loss in performance. 4.2 Execution Time Analysis Figure 4b illustrates that almost all of the benefit of additional processors is in the reduction of execution times at the highest levels of the adaptive grid hierarchy. Lower levels have too little work for efficient parallelization. Note 9 *** See HTML or PostScript for Table *** Table 2: Execution time breakdown for the eigenvalue calculation on the Intel Paragon. Times are in seconds. Percentages may not add up to 100% due to rounding. The execution time is dominated by computation, load imbalance, and communication (both intralevel and interlevel). The relative cost of communication increases with additional processors; communication overheads account for about half of the total execution time on 32 nodes. that we cannot simply remove the lower levels because they play a vital role in the numerical convergence of our eigenvalue algorithm. We can expect better scaling as we address more complicated problems, which place additional computational work at the highest levels. Tables 2 provides a detailed accounting for the parallel execution time on the Intel Paragon; results for the IBM SP2 are similar. We divide the execution time for the eigenvalue algorithm, including time spent building the adaptive grid hierarchy, into numerical computation, time lost to load imbalance, communication among grids at the same level of the hierarchy, communication between levels, error estimation, load balancing, and grid generation. Error estimation, load balancing, and grid generation consume only a few percent of the total execution time. The vast majority of the time is spent in numerical computation, load imbalance, and communication (intralevel and interlevel). As computation times drop with additional processors, communication overheads become a dominant factor in overall performance. On 32 Paragon nodes, communication accounts for about half of the total execution time. It is difficult to assess adaptive mesh library overheads on parallel computers since we do not yet have detailed hardware performance analyzers such as those on the Cray C-90. It would be impractical to develop a message passing version of our code by hand (i.e. without the software support offered by our API) because of the implementation complexity. We can assume that there is little library overhead in computation, since all numerical work is done in Fortran. The remaining contributor of overheads is interprocessor communication. Experiments indicate that perhaps half of the interprocessor communication time is due to overheads in the LPARX communication routines [15]; the remainder is spent in the operating system message routines. We are currently working on a re-design of the LPARX communication libraries which we believe will eliminate most of this additional overhead [12]. 4.3 Uniform Grid Patches Data parallel Fortran languages such as High Performance Fortran [7, 13, 14] do not readily support non-uniform grid structures. In their Connection Machine Fortran implementation of a 2d 10 *** See HTML or PostScript for Table *** Table 3: Uniform grid patches require additional memory resources as compared to non-uniform refinements because the grid generator does not have the freedom in selecting the optimal grid size needed to cover a particular region of space. Thus, uniform patches lead to over-refinement. Small patches reduce over-refinement; however, small grid patches introduce a large number of ghost cells relative to the number of unknowns. In this application, the ghost cells region is one cell thick. adaptive mesh refinement application on the CM-2, Berger and Saltzman [3] required that all refinement regions be the same size. To ascertain the performance implications of such a restriction, we have implemented a grid generation strategy identical to that used by Berger and Saltzman. One of the important trade-offs in a uniform refinement strategy is the selection of the appropriate patch size. The key is to find a grid size which is large enough to be computationally efficient on the target parallel architecture but yet small enough to limit over-refinement. Large patches typically refine more of the computational domain than what is needed and thus waste memory resources. Small patches may represent too little work to be efficient. Another consideration is the ratio of ghost cells, boundary points used to locally cache data from other grids, to interior grid points. The width of this boundary region depends on the numerical kernels of the application. Our eigenvalue solver uses a ghost cell width of one; Berger and Saltzman use four. For small patches, these ghost regions can represent a significant fraction of the total memory, especially in three dimensions. For example, the boundary cells for a 16^3 patch in 3d with a ghost cell width of four (24^3 total) represent 70% of the memory used to store the patch. Furthermore, boundary cells may introduce additional computational work for some numerical methods (e.g. flux correction for hyperbolic partial differential equations [1]). We compare the non-uniform refinement approach against four uniform grid sizes: 12^3, 16^3, 24^3, and 32^3. Each patch is augmented with a ghost cell region of width one. These four sizes bracket the range of useful patch sizes; for this particular application, 12^3 is too small and 32^3 is too large. Memory overheads are reported in Table 3. Uniform refinement with the smallest patch size requires only about 26% more memory than non-uniform refinement; the largest patch size uses almost three times more memory. Figure 5a presents the execution time for one iteration of the adaptive eigenvalue application on the Intel Paragon. Note that we cannot report the results for the 32^3 patch size on four processors; this problem would not run because of memory limitations. The 16^3 patch size gives the best performance for all numbers of processors, running between 40% and 60% slower 11 *** See HTML or PostScript for Figure *** Figure 5: These graphs illustrate the performance costs of requiring uniform grid patches as opposed to non-uniform patches. By default, the adaptive mesh libraries employ non-uniform patches. Results for the 32^3 patch size on four processors are not reported; the problem would not run due to memory limitations. (a) A comparison of execution times for a variety of uniform patch sizes. (b) Relative space-time (megabyte-hour) costs of uniform refinement patches as compared to non-uniform patches. than the non-uniform refinement method. Both memory usage and computation time are important computational resources for adaptive mesh applications. In fact, many accounting systems for parallel computers charge not only for CPU time but also for memory usage. To capture both resources, Figure 5b presents the relative space-time (i.e. megabyte-hour) cost of uniform refinement patches as compared to non-uniform patches. In this metric, uniform patches are between two and eight times more expensive. These results clearly show that uniform refinement patches are more expensive than the identical application using non-uniform patches. To solve a problem to a specified accuracy, structured adaptive mesh methods employing uniform refinement regions will require more computational resources (flops and memory) than one using non-uniform refinement regions. Likewise, given fixed resources, non-uniform refinements will solve a particular problem to higher accuracy. 5 Related Work Adaptive mesh refinement techniques for multiple spatial dimensions were first developed by Berger and Oliger [2] to solve time-dependent hyperbolic partial differential equations. These techniques are based on previous work on locally nested refinement structures in one spatial dimension by Bolstad [4]. Adaptive mesh refinement methods were later used by Berger and Colella to resolve shock waves in computational fluid dynamics [1]. Our work with adaptive mesh methods applies this same computational methodology to elliptic partial differential equations and adaptive eigenvalue problems [15]. Berger and Saltzman have implemented a parallel 2d adaptive mesh refinement code in Connection Machine Fortran for the CM-2 [3]. Their data parallel implementation required that all regions of refinement be the same size. As a result, the application over-refined some portions of the computational space, using 60% more memory than an equivalent implementation without the uniform size restriction. Our experiments indicate that uniform refinement regions also result in excessive 12 overheads in three dimensions (see Section 4.3). Because of compiler limitations, their code did not execute efficiently on the CM-5. Quinlan et al. have developed an adaptive mesh library called AMR++, based on the P++ data parallel C++ array class library [17]. P++ supports fine-grain data parallel operations on arrays distributed across collections of processors; it automatically manages data decomposition, interprocessor communication, and synchronization. In contrast to this fine-grain array parallelism, we employ a coarse-grain parallelism in which operations are applied in parallel to entire collections of arrays. An object oriented library for structured adaptive mesh refinement has been developed at Lawrence Livermore National Laboratory [9]. This software is intended to support hyperbolic gas dynamics applications running on vector supercomputers. The basic abstractions employed in are very similar to our own; in fact, Livermore's adaptive mesh refinement libraries are being parallelized using portions of the LPARX software. Parashar and Browne are developing a software infrastructure supporting parallel adaptive mesh refinement methods for black hole interactions [21]. Their method is based on a clever load balancing and processor mapping strategy that maps grids to processors through locality-preserving space filling curves. We plan to compare their grid generation and data decomposition strategies with ours in the near future. 6 Conclusions We have developed an efficient, portable, parallel software infrastructure for structured adaptive mesh algorithms. It provides computational scientists with high-level tools that hide implementation details of parallelism and resource management. Such powerful software support is essential for the timely development of quality reusable numerical software. We are currently applying our adaptive mesh infrastructure to the solution of adaptive eigenvalue problems arising in materials design [6]. We also plan employ our infrastructure to address the hyperbolic partial differential equations of computational fluid dynamics. Two distinguishing characteristics of our work are the concepts of structural abstraction and coarse-grain data parallelism, both borrowed from LPARX. Structural abstraction and first-class data decompositions enable our software to represent and manipulate refinement structures as language-level objects. In contrast, a language such as High Performance Fortran that supports compile-time data layout provides little freedom in the expression of irregular, run-time data distribution. Our model of coarse-grain data parallel numerical computation maps efficiently onto the current generation of message passing parallel architectures. 13 It is an open research question whether data parallel languages can efficiently support the irregular refinement structures employed by structured adaptive mesh algorithms. Previous implementations [3] have required uniform refinements to fit the fine-grain data parallel model. Our experiments in 3d indicate that such a restriction results in costly over- refinement and a significant loss in computational performance. Thus, the efficient, portable implementation of structured adaptive mesh methods remains an outstanding challenge for the data parallel community. Software Availability LPARX and a preliminary version of the adaptive mesh software libraries is available via the World Wide Web at address http://www-cse.ucsd.edu/users/skohn/lparx.html. The software is also available through the San Diego Supercomputer Center. Acknowledgements This work was supported by NSF contract ASC-9110793 and ONR contract N00014-93-1-0152. Intel Paragon and Cray C-90 time at the San Diego Supercomputer Center was provided by a UCSD School of Engineering Block Grant. Access to the IBM SP2 was provided by the Cornell Theory Center. We would like to thank Greg Cook, Stephen Fink, Chris Myers, and Charles Rendleman for their useful suggestions on how to improve LPARX and the adaptive mesh software. We would also like to thank Eric Bylaska, Alan Edelman, Ryoichi Kawai, Beth Ong, and John Weare for numerous valuable discussions on numerical methods in materials design. References [1] Marsha J. Berger and Phillip Colella. Local adaptive mesh refinement for shock hydrodynamics. Journal of Computational Physics, 82(1):64--84, May 1989. [2] Marsha J. Berger and Joseph Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of Computational Physics, 53(3):484--512, March 1984. [3] Marsha J. Berger and Jeff Saltzman. Structured adaptive mesh refinement on the Connection Machine. In Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, March 1993. [4] J. Bolstad. PhD thesis, Stanford University, 1982. 14 [5] Achi Brandt. Multi-level adaptive solutions to boundary- value problems. Mathematics of Computation, 31(138):333-- 390, April 1977. [6] Eric J. Bylaska, Scott R. Kohn, Scott B. Baden, Alan Edelman, Ryoichi Kawai, M. Elizabeth G. Ong, and John H. Weare. Scalable parallel numerical methods and software tools for material design. In Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Francisco, California, February 1995. [7] Barbara Chapman, Piyush Mehrotra, Hans Moritsch, and Hans Zima. Dynamic data distribution in Vienna Fortran. In Proceedings of Supercomputing '93, November 1993. [8] Barbara Chapman, Piyush Mehrotra, and Hans Zima. Extending hpf for advanced data parallel applications. Technical Report 94-34, ICASE, May 1994. [9] William Y. Crutchfield and Michael L. Welcome. Object oriented implementation of adaptive mesh refinement algorithms. Scientific Programming, 2:145--156, Winter 1993. [10] R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallel unstructured euler solver using software primitives. Technical Report 92-12, ICASE, Hampton, VA, March 1992. [11] Karen D. Devine and Joseph E. Flaherty. Dynamic load balancing for parallel finite element methods with h- and p-refinement. In Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Francisco, California, February 1995. [12] Stephen J. Fink, Scott B. Baden, and Scott R. Kohn. Flexible communication schedules for block structured applications. (in preparation), 1995. [13] G. Fox, S. Hiranandani, K. Kennedy, C Koelbel, U. Kremer, C. Tseng, and M. Wu. Fortran D language specification. Technical Report TR90-141, Dept. of Computer Science, Rice University, Houston, TX, December 1989. [14] High Performance Fortran Forum. High Performance Fortran Language Specification, November 1994. [15] Scott R. Kohn. A Parallel Software Infrastructure for Dynamic Block-Irregular Scientific Calculations. PhD thesis, University of California at San Diego, June 1995. [16] Scott R. Kohn and Scott B. Baden. Irregular coarse-grain data parallelism under LPARX. Journal of Scientific Programming, to appear. 15 [17] M. Lemke and Daniel Quinlan. P++: A C++ virtual shared grids based programming environment for architecture- independent development of structured grid applications. In Lecture Notes in Computer Science. Springer-Verlag, September 1992. [18] Jan Mandel and Steve McCormick. Multilevel variational method for A*u=lambda*B*u on composite grids. Journal of Computational Physics, 80(2):442--452, February 1989. [19] Steve F. McCormick, editor. Multilevel Adaptive Methods for Partial Differential Equations. SIAM, Philadelphia, 1989. [20] Shubhendu S. Mukherjee, Shamik D. Sharman, Mark D. Hill, James R. Larus, Anne Rogers, and Joel Saltz. Efficient support for irregular applications on distributed memory machines. In to appear in Proceedings of the 1995 Symposium on Principles and Practice of Parallel Programming, 1995. [21] Manish Parashar and James C. Browne. An infrastructure for parallel adaptive mesh refinement techniques. (draft), 1995. [22] Rebecca Parsons and Daniel Quinlan. Run-time recognition of task parallelism within the P++ parallel array class library. In Scalable Libraries Conference, 1993. [23] V. S. Sunderam. PVM: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4):315-- 339, December 1990. Biographical Sketches Scott B. Baden Scott B. Baden received the B.S. degree (magna cum laude) in electrical engineering from Duke University in Durham, North Carolina in 1978, and the M.S. and Ph.D. degrees in computer science from the University of California, Berkeley, in 1982 and 1987, respectively. He is currently an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego and is also a Senior Fellow at the San Diego Supercomputer Center. Prior to this appointment, he was a postdoctoral researcher in the Mathematics Group at the University of California's Lawrence Berkeley Laboratory between 1987 and 1990, taking off ten months to travel the globe. Dr. Baden's current research interests are in the areas of parallel and scientific computation: programming methodology, algorithm design, load balancing, and computer architecture. Dr. Baden is a member of the IEEE Computer Society, SIAM, and the Tau Beta Pi Engineering Honor Society. 16 Scott R. Kohn Scott R. Kohn received his B.S. degree in electrical engineering, mathematics, and computer science from the University of Wisconsin---Madison in 1990 and M.S. and Ph.D. degrees in computer science from the University of California, San Diego (UCSD), in 1992 and 1995. He was awarded a 1995 NSF Postdoctoral Fellowship in Computational Science and Engineering and is currently working in the Department of Chemistry at UCSD on the development of numerical methods and software support for advanced materials design applications. His research interests are in the area of scientific computation, including numerical methods for adaptive and irregular problems, programming abstractions, and high-performance computer architectures. 17