CSE 160, Lecture 15 (11/7/06):

Data parallel languages

Note: these notes are based in part on on-line lecture notes written by Jim Demmel at UC Berkeley.

So far we have been programming with two types of SPMD programming models: message passing and share memory. We will now look at a different kind of parallel programming model, called Data Parallelism.

With SPMD programming a single program text is executed by all processors. This is a explicit programming model in the sense that the programmer has to orchestrate data motion and synchronization, and to contend with system dependent issues concerning the hardware

How is parallelism expressed?

How is communication between processors expressed?

How is synchronization between processors expressed?

What global/local data structures can be constructed?

How do you optimize for performance?

Data Parallel Programming

In data parallel programming, the programmer has the illusion of a single program running in a single address space. Parallelism arises when operating on arrays (or sometimes lists or sequence) with each array element conceptually being processed in parallel on its own virtual processor. Any interprocessor communication is implicit in the array operation. This is a natural model for an SIMD computer, but can also be implemented under MIMD.

Data parallel languages go back many years to the days of the ILIAC IV (1960's) and later saw a resurgence in the 1980s with Connection Machine Fortran, which was implemented on Thinking Machine's CM-2, an SIMD machine, and CM-5, an MIMD machine. Another data parallel language, High Performance Fortran (HPF) was later devised by committee. Today, data parallelism arises in Fortran 90 and 95 and in Matlab. To see how it works, let us look at an example we've seen earlier in the course: the solution of an ordinary differential equation.

Let U[ ] and UNew[ ] be arrays. The following array statement computes the difference between the two arrays at all points:

dU = ( UNew - U )

The arrays must be conformable, i.e. have the same size and shape, in order to be operated on together in this way. The above array expression has an implicit loop and is equivalent to the following forall loop.

forall i = 0 to n+1 dU[i] = UNew[i] - U[i] end

Continuing, consider the convergence computation: mean square error

Err = sqrt ( sum ( dU*dU ) )

As before the computation (dU*dU) involves an implicit loop. The intrinsic function sum() is applied to the entire vector quantity dU*dU and reduces the vector to a scalar. Operations like sum(), prod(), max(), min() are called reduction operations because the reduce arrays to scalars, and are commonly provided in data parallel languages.

Array Sections. A portion of an array is defined by a triplet in each dimension. It may appear wherever an array is used.

       A[3]              ! third element
       A[1:5]            ! first 5 element
       A[:5]             ! same
       A[1:10:2]         ! odd elements in order
       A[10:-2:2]        ! even elements in reverse order
       A[1:2,3:5]        ! 2-by-3 subblock
       A[1,:]            ! first row
       A[:,j]            ! jth column

We may use array sections to handle the (Jacobi) updates for our ODE solver:

UNew(1:N) = ( U[2:N+1] + U[0:N-1] - 8*h^2 ) / 2

Which form is more appropriate? That depends on the situation, and sometimes on the programmer's preference. However, array sections may place a burden on the compiler, to generate efficient serial code. Consider the following 3 array assingment statements using array sections:

X[1:N] = ( B[2:N+1] + C[0:N-1] ) / 2 Y[1:N] = ( B[1:N] + C[1:N] ) / 2 Z[2:N-1] = X[2:N-1] * Y[2:N-1]

A straightforward translation of this code will generate 3 for loops:

for i = 1 to N X[i] = ( B[2:N+1] + C[0:N-1] ) / 2 end for for i = 1 to N Y[1:N] = ( B[1:N] + C[1:N] ) / 2 end for for i = 1 to N Z[2:N-1] = X[2:N-1] * Y[2:N-1] end for

This code can put a lot of pressure on cache if N is large. We would like to fuse the loops as follows:

for i = 1 to N X[i] = ( B[2:N+1] + C[0:N-1] ) / 2 Y[1:N] = ( B[1:N] + C[1:N] ) / 2 for i = 1 to N Z[2:N-1] = X[2:N-1] * Y[2:N-1] end for

Implicit Communication. Operations on conformable array sections may require interprocessor communication. In theory, each array element is assigned its own virtual processor, and so the expression

B( 1:10 ) + B( 11:20 )

involves implicit communication between the elements in B[1:10], which are on virtual processors 1:10, and B[11:20] which are on virtual processors 11:20. However, as we will see below, virtual processors get mapped onto physical processors, and much of the implicit communication will involve local memory accesses.

A[ 1:10 ] = B[ 1:10 ] + B[ 11:20 ]
C[ 1:5:2 ] = C[ 2:6:2 ]	! shift noncontiguous sections
D = D[ 10:1:-1 ]	! permutation (reverse)
A = [1,0,2,0,0,0,4]
I = [1,3,7]
B = A[I]	! B = [1,2,4] "gather"
C(I) = B	! C = A "scatter", no duplicates in I
D = A[[ 1, 1, 3, 3 ])	! replication
B = CSHIFT( A, 1 )	! B = [4,1,0,2,0,0,0], circular shift
B = TRANSPOSE( H )	! matrix transpose
B = SPREAD(A,2,3)	! if B is 3-by-7 then ! B = [ 1,0,2,0,0,0,4 ] ! [ 1,0,2,0,0,0,4 ] ! [ 1,0,2,0,0,0,4 ]

Conditional Operation. THere is also a where statement, which is like a conditional. Where's may not be nested. Only assignment is permitted in the body of a "where". Consider the following loop, where we must protect agains illegal square roots and division by 0:

for i = 1 to N
if (X[i] <= 0) then
X[i] = sqrt(abs(X[i]))
else
X[i] = 1.0/X[i]
end if

This may be written more succinctly using the where clause:

       where (X <= 0)
	    X = sqrt(abs(X))
       elsewhere
	    X = 1.0/X

Data Layout. Whether or not a statement will be executed in parallel depends on how the data it accesses is arranged on the machine. An array spread across all p processors can hope to enjoy p-fold parallelism in operations, whereas arrays that aren't distributed will run sequentially.

Briefly, scalar data and arrays which do not participate in any array operations (like A=B+C) will be replicated on all processors. All processor will perform the identical operations on such scalar data, which is equivalent to serial execution. We need compiler directives to deal with various issues in balancing the work evenly over the processors, and also to help ensure that arrays are laid out the way the user wants them to be. We'll use the following syntax to distribute array, using pragmas, which are non-executable declarations.

#pragma distribute A[ BLOCK, : ), B( BLOCK , : ), C( : , BLOCK ) float A[64,8], B[64,8], C[64,8] A = B A = C

In this code fragment, the distribute pragma tells the compiler where to store the arrays A, B, C. Suppose to we have a 64 processor machine. Here are two natural ways to store a 64-by-8 array:

    Mem 0     Mem 1            ...                Mem 64

    A[1,1]    A[2,1]           ...                A[64,1]
    A[1,2]    A[2,2]           ...                A[64,2]
     ...       ...             ...
    A[1,8]    A[2,8]           ...                A[64,8]

and

    Mem 0     Mem 1    ...   Mem 7   Mem 8  ...   Mem 64

    C[1,1]    C[1,2]   ...   C[1,8]   xxx   ...    xxx
    C[2,1]    C[2,2]   ...   C[2,8]   xxx   ...    xxx
     ...       ...            ...
    C[64,1]   C[64,2]  ...   C[64,8]  xxx   ...    xxx

The former is specified by A[ BLOCK , : ], which says the first subscript should be stored with different values corresponding to different parallel memories, and different values of the second subscript should be stored sequentially in the same processor memory. Thus, there is parallelism available in the BLOCK direction, but not in the : direction. The latter is specified by A[ :, BLOCK ], which says the second subscript should be stored with different values corresponding to different parallel memories, and different values of the first subscript should be stored sequentially in the same processor memory. Thus, there is parallelism available in the BLOCK direction, but not in the : direction.

Since A and B have the same layout, the statement A=B is perfectly parallel. On the other hand, A=C essentially performs a matrix transpose, which incurs large amount of communication. When optimizing programs for time, it is essential to make sure layouts are chosen to minimize communication.

We also have the keyword CYCLIC, with its block form CYCLIC(k). Notably, block CYCLIC decompositions are used extensively in numerical linear algebra. Parallel Gaussian elimination uses two-dimensions block cyclic layouts.

Libraries and Intrinsics. Data parallel languages usually have a intrinsics for moving data including. If you can find an appropriate intrinsic routine, it will often run faster than coding the operation oneself.

More information about the data parallel language HPF can be found by consulting Ian Foster's Text, Chapter 7.

Programming Example: Many body simulations

We talked about many body simulations earlier in the course. In many body problems, a collection of bodies, called particles (in fact the Greek word somati'dion literally means "little body"). interact according to a force law. The force law depends on the application, for example, gravitational interaction. In particle computations we compute the trajectory of the particles over a discrete instants of time called timesteps . To compute the trajectory at each time step we compute the force induced by each particle on every other particle, and then we advance the position of each particle according to the force, which is the force times the size of the timestep. The basic algorithm is

     t = 0
     while t < t_final
         for i = 1 to n                  ... n = number_of_particles
             compute f[i] = force on particle i
             move particle i under force f[i] for time dt
         end for
         compute interesting properties of particles
         t = t + dt                      ... dt = time increment
     end while

The forces are expensive to compute because the force on each particle depends on all the other particles. The simplest expression for the force f[i] on particle i is

     for i = 1 to n
         f[i] = Σ_i,j{i,j=1, j != i} f(i,j)      end for

where f(i,j) is the force on particle i due to particle j. Thus, the cost grows as O(n^2). We will look at a simple force: gravitational interaction. Assuming that each body has a mass m[i], and is located at position ( x[i], y[i], z[i] ), then f(i,j) is:

                              x[i] - x[j]   y[i] - y[j]   z[i] - z[j]
force = G * m[i] * m[j] *   ( ----------- , ----------- , ----------- )
                                 r^3           r^3            r^3

where

r = sqrt ( (x[i] - x[j])^2 + (y[i] - y[j])^2 + (z[i] - z[j])^2)

and G is the gravitational constant.

To solve this problem in a data parallel fashion, we use the following program, where we use the notation [x, y, z] to pack 3 elements into an array (this is not standard HPF but it simplifies our discussions)

float x[N], y[N], z[N], m[N], force[N]

for k = 1 to n
    force = force +  F([x, y, z], m, [x[k], y[k], z[k]], m[k])
end for

Where we have used an intuitive vector notation, and F() is appropriately defined to ignore the self-induced force:

	vector3 function F ( [x , y , z], m, [xk, yk, zk], mk )
	d = sqrt ( (x-xk)^2 + (y-yk)^2 + (z-zk)^2 )

	where (d != 0)
	    F =  m * mk * [ x-xk , y-yk, z-zk ] / d
	end where

Is this algorithm feasible? Let's look at the communication cost. For each iteration of the k loop we must compute n forces (ignoring the degenerate computation of 0). However, because the multiplication within F() involves scalar values, (the components xk,yk,zk) the scalars must be broadcast to all other processors. If we are running on (p < n) processors, this costs log(p) message starts per iteration of the k loop.

Alternatively, let's configure the processors in a ring and have a second copy of the x,y,z, and m vectors. Then, we pass the copies around the ring using the CSHIFT operation, 1 CSHIFT per iteration of the k loop:

float x(N), y(N), z(N), m(N), force(i)
float xc(N), yc(N), zc(N), mc(N)

[xc,yc,zc] = [x,y,z]
mc = m


for k = 1 to n
    force = force +  F([x, y, z], m, [xc, yc, zc], mc )
    xc = CSHIFT (xc, 1)....
end for

where we now have

	function F ( [x , y , z], m, [xc, yc, zc], mk )
	d = sqrt ( (x-xc)^2 + (y-yc)^2 + (z-zc)^2 )

	where (d != 0)
	    F =  m * mk * [ x-xc , y-yc, z-zc ] / d
	end where

Now we have vector operations only so there is no need to broadcast. There is only 1 message start per step vs log(P) before. Of course, we are now moving a chunk of the position and mass vectors at each step, so we must include the cost of moving the data; the communication cost for one step is

	Ts + (N/P)*W*Tw for the ring algorithm vs.
	log(P)*Ts for the broadcast algorithm
	(we can ignore the time to move the data)

where W is the number of bytes in the position vector and the mass. Assuming 4 byte floating point numbers, W=16. When will our ring algorithm be faster? When

	
	Ts + (N/P)*16*Tw < log(P)*Ts

or when
	N < (Ts/Tw) * (P/16) * log(P/2)

Consider Valkyrie, where (Tw/Ts) = 2000 If we use 16 processors, then the ring algorithm is faster for N approximately 6000.

. Last modified: Sun Nov 5 19:59:14 PST 2006

CSE 160, Lecture 15 (11/7/06): Data parallel languages

Data Parallel Programming

Programming Example: Many body simulations

CSE 160, Lecture 15 (11/7/06):

Data parallel languages