Figure 4a compares the execution times for the IBM SP2, Intel Paragon, and one processor of a Cray C-90. (The IBM SP2 results were obtained on a pre-production machine at the Cornell Theory Center; these times should improve as the system is tuned and enters full production use.) Note that although the SP2 processors are approximately four times faster then the Paragon processors, its communication network is about half as fast. We ran the same applications code on all machines except that the Fortran kernels on the Cray C-90 are annotated to aid vectorization.
Figure 4: (a) Adaptive eigenvalue solver performance results for the IBM
SP2, Intel Paragon, and one processor of the Cray C-90. Times represent
one iteration (averaged over ten) of the eigenvalue algorithm. The
application would not run on fewer than four Paragon processors because
of memory constraints. We report wallclock times for the SP2 and Paragon
(processor nodes are not time-shared) and CPU times for the C-90.
(b) A level-by-level accounting of the execution time for one
iteration of the eigenvalue algorithm. The benefits of parallelism
are limited to the highest levels of the hierarchy because lower
levels have too little work for efficient parallelization.
The Paragon and the SP2 compare quite favorably against the C-90: for this application, four SP2 nodes or 32 Paragon nodes deliver the performance of one C-90 processor. Although all Fortran numerical kernels of our code vectorize, hardware performance monitors on the C-90 report that our application achieves an aggregate rate of only 155 megaflops (million floating point operations per second) over the entire code and a peak rate of 290 megaflops. Our code realizes only a fraction of the Cray C-90's peak performance of 1000 megaflops due to short vector lengths in the Fortran routines (between 10 and 30). Of course, vector lengths are tied directly to grid size. We could achieve a higher megaflop rate and longer vector lengths by using larger grids and more memory. Note, however, that time to solution for a specified accuracy, not megaflop rate, is the important metric. Placing additional grid points in regions where they are not needed to improve resolution does not necessarily result in more accurate solutions. For example, we doubled the number of grid points used by the solver for this problem and yet achieved the same answer (to within 0.02%). The additional grid points were used to over-refine portions of the computational space where no further refinement was necessary.
On the Cray C-90, our implementation using the adaptive mesh libraries would be comparable in performance to a Fortran code developed by hand without library support. Approximately 90% of the execution time of our application is spent on numerical computation in Fortran routines, 7% in transferring data between grids (which happens to be written in C++ but would also be required in an all-Fortran implementation), and the remaining 3% in miscellaneous routines. Even if we attribute the last 3% as all library overhead (which it is not), the ease of using an applications library and the benefits of portability to high-performance parallel architectures far outweighs the small loss in performance.