FORALL and Fortran 90 array assignment statements by themselves are not enough to achieve parallel execution. In order to achieve any improvement in performance through parallel execution, these structures must operate on arrays that have been distributed using the DISTRIBUTE directive. In general, parallel performance cannot be achieved without explicit use of the DISTRIBUTE directive.
Because communication of data is very time consuming, a distribution of data that minimizes communication between processors is absolutely critical for application performance.
In many calculations (such as the LU decomposition example considered in Chapter 2), cyclic distribution proves to be preferable because of load-balancing considerations. However, in nearest neighbor problems such as the heat-flow example, it is advantageous to keep nearby elements on the same processor as much as possible in order to minimize communication. Distributing the array in large blocks allows all four nearest-neighbor elements to be on the same processor in most instances. If this is done, only the elements along the perimeter of the blocks require communication between processors in order to obtain the values of neighboring elements.
Large blocks can be obtained with the keyword BLOCK. For two- dimensional nearest neighbor problems, the most important options are (BLOCK, BLOCK) distribution and ( * , BLOCK) distribution. In the current version of Digital Fortran 90, ( * , BLOCK) distribution is highly optimized (see Section 3.5.2), and frequently produces superior results. Because distributions are so easy to change in HPF, it can be worthwhile to time both options, and choose whichever performs better.
Figures showing ( * , BLOCK) distribution and (BLOCK, BLOCK) distribution can be found in Chapter 2 (Figures 2-1 and 2-4) and in Chapter 6 (Figures 6-5, 6-6, 6-17, and 6-18).
For a fuller discussion of basic distribution options, see Section 2.3.
The Digital Fortran 90 compiler performs an optimization of nearest neighbor calculations to reduce communications. Each processor sends shadow edges to each processor that needs this data during the execution of the computation. For instance, if a 16 x 16 array A is allocated (BLOCK, BLOCK) on a machine with 16 processors, the array can be thought of as divided into 16 blocks, as follows:
In addition to the area in each processor's memory allocated to the storage of that processor's block of the array, extra space is allocated for a surrounding shadow area which holds storage for those array elements in the neighboring processors which are needed for the computations. Within any one program unit, the compiler automatically determines the correct shadow-edge widths on an array- by-array, dimension-by-dimension basis. The shadow area for array A in the memory of one of the sixteen processors is shown shaded in Figure 3-2.
The Digital Fortran 90 compiler
automatically detects nearest-neighbor algorithms, and allocates
appropriately-sized shadow edges. The -show hpf option
indicates which statements were detected by the compiler as nearest-
neighbor statements.
In order to make use of the compiler's nearest-neighbor optimization, Digital recommends coding nearest-neighbor algorithms using FORALL or Fortran 90 array syntax, rather than using INDEPENDENT DO loops.
You can also use the SHADOW keyword and the
-nearest_neighbor command-line option to do the
following:
-nearest_neighbor
to control the nearest neighbor optimization, see
Sections 6.3.7, and 8.1.1.4.
-show hpf option, see Section 8.1.1.7.