In order to achieve parallel execution, data must be distributed by means of the DISTRIBUTE directive. Programs without any DISTRIBUTE directives are always compiled to run serially.
For parallel execution of array operations, each array must be split up in memory, with each processor storing some portion of the array in its own local memory. Splitting up the array into parts is known as distributing the data. The DISTRIBUTE directive controls the distribution of arrays across each processor's local memory.
Because communication of data is very time consuming, a distribution of data that minimizes communication between processors is absolutely critical for application performance.
The DISTRIBUTE directive specifies a mapping pattern of data objects onto processors. It is used with the two keywords BLOCK and CYCLIC, which specify the distribution pattern.
The use of the DISTRIBUTE directive is best explained by examining some example distributions. Consider the case of a 16 x 16 array A in an environment with 4 processors. Here is one possible specification for A:
REAL A(16,16) !HPF$ DISTRIBUTE A(*, BLOCK)
Figure 2-1 shows this distribution.
The asterisk ( * ) for the first dimension of A means that the array elements are not distributed along the first (vertical) axis. In other words, the elements in any given column are not divided up among different processors, but assigned as a single block to one processor. This type of mapping is sometimes called "on processor" distribution. It can also be referred to as "collapsed" or "serial" distribution.
The BLOCK keyword for the second dimension means that for any given
row, the array elements are distributed over each processor in
large blocks. The blocks are of approximately equal size, with each
processor assigned to only one block. As a result, A
is broken into four contiguous groups of columns, with each group
assigned to one processor.
Another possibility is ( * , CYCLIC) distribution. As in ( * , BLOCK), the elements in any given column are assigned as a single block to one processor. The elements in any given row, however, are dealt out to the processors in round-robin order, like playing cards dealt out to players around the table. When elements are distributed over n processors, each processor, starting from a different offset, contains every n[th] column. Figure 2-2 shows the same array and processor arrangement, distributed CYCLIC, instead of BLOCK.
The pattern of distribution is figured independently for each dimension: the elements in any given column of the array are distributed according to the keyword for the first dimension, and the elements in any given row are distributed according to the keyword for the second dimension. For example, in an array distributed (BLOCK, CYCLIC), the elements in any given column are layed out in blocks, and the elements in any given row are layed out cyclically, as in Figure 2-3:
Figure 2-4 shows an example array distributed (BLOCK, BLOCK):
BLOCK, BLOCK distribution divides the array into large rectangles. The array elements in any given column or any given row are divided into two large blocks. In the above example, processor 0 gets A(1:8, 1:8), processor 1 gets A(9:16, 1:8), processor 2 gets A(1:8, 9:16), and processor 3 gets A(9:16, 9:16).
There is no simple rule for computing data distribution, because optimal distribution is highly algorithm-dependent. When the best-performing distribution is not obvious, it is possible to find a suitable distribution through trial and error, because the DISTRIBUTE directive affects only the performance of a program (not the meaning or result). In many cases, however, you can find an appropriate distribution simply by answering the following questions:
If the algorithm is oriented toward a certain dimension, the DISTRIBUTE directive can be used to map the data appropriately. For example, ( * , BLOCK) is vertically oriented, whereas (BLOCK, * ) is horizontally oriented (for detailed distribution illustrations, see Section 6.3.6).
Nearest-neighbor calculations generally run faster with a BLOCK distribution, because this lets the processor calculating any given array element have all of the necessary data in its own memory in most cases. The Digital Fortran 90 compiler includes an optimization which minimizes communication in nearest-neighbor calculations even along the edges of blocks (see Section 3.5.2). For an example of a nearest neighbor calculation, see Chapter 3, Solving Nearest Neighbor Problems.
When the calculation of an array element requires information from distant elements in the array, a CYCLIC distribution is frequently faster because of load-balancing considerations. This turns out to be the case for LU decomposition.
In the LU decomposition example, the submatrix modification uses information from other columns and rows, so it has neither a row-wise nor column-wise orientation. The column normalization statement, however, has an entirely column-wise orientation, needing information from a single column of the matrix only. Therefore, a column-wise orientation is preferred, either ( * , BLOCK) or ( * , CYCLIC).
Both of these structures make use of distant elements in the array, which means that little advantage would be gained from a block distribution. On the other hand, there is much to be gained from using a cyclic distribution in the case of our algorithm. To see why this is the case, see Figure 2-5, which depicts the computation with ( * , BLOCK) distribution (the illustration shows a 16 by 16 array worked on by four processors):
Computation is done on progressively smaller submatrices in the lower right-hand corner of the array. The first panel of the figure shows the first iteration of the DO loop, in which the entire array is worked on by all four processors. The second panel shows the seventh iteration, by which time Peer 0 is completely idle, because none of the elements of the submatrix are stored in its memory. The third panel of the figure shows the eleventh iteration of the DO loop, by which time both Peer 0 and Peer 1 are idle. The fourth panel shows the fifteenth iteration, where only Peer 3 is working, with the other three processors idle. For most of time spent in the DO loop, one or more processors are left idle.
In contrast, ( * , CYCLIC) distribution has all four processors active until only 3 out of 16 columns remain to be completed (see Figure 2-6. This load balancing consideration makes ( * , CYCLIC) distribution the better choice for this algorithm.
If you are familiar with the low-level details of parallel programming, you may wonder how any speed-up is achieved with the LU decomposition algorithm, because the sub-matrix modification appears to require a separate communication for each element in the submatrix. If the submatrix were 1000 by 1000, this would mean one million communications for each iteration of the outer DO loop. This would clearly cost considerably more time than any speed-up achieved through parallelization, because message start-up time overwhelmingly overshadows the sending of the actual data for small messages. However, the Digital Fortran 90 compiler minimizes this communication cost through communications vectorization. Instead of sending one separate message for each array element, messages with the same destination are packaged in very large bundles or "vectors", greatly reducing message start-up overhead.
Even though LU decomposition is not an embarrassingly parallel algorithm, parallel speed-up for this algorithm with the Digital Fortran 90 compiler is excellent. With sufficiently large arrays, parallel speed-up comes close to scaling linearly: with performance increasing in near direct proportion to the number of processors used.