This section explains data mapping in HPF. Proper data mapping is critical for the performance of any HPF program. The discussion of data mapping is divided as follows:
Section 6.3.6 (DISTRIBUTE Directive) includes an extensive set of figures showing many of the major distributions for one- and two-dimensional arrays.
HPF is designed for data parallel programming, a programming model in which the work of large-scale uniform array operations is divided up among a number of processors in order to increase performance. In data parallel programming, each array is split up into parts, and each part is stored on a different processor. In most cases, it is most efficient for operations to be performed by the processor storing the data in its local memory. If the arrays are mapped onto the processors in such a way that each processor has most of the information necessary to perform a given array operation on the part of the array stored locally, each processor can work independently on its own section of the array at the same time the other processors are working on other sections of the array. In this manner, an array operation is completed more quickly than the same operation performed by a single processor. In the optimal case, the speed-up scales linearly; in an environment with n processors, the operation is completed n times faster than with a single processor.
This section explains HPF's general conception or model of data mapping. After explaining basic concepts, a brief code fragment describing a sample data mapping, together with a series of figures that represent this mapping schematically, is shown.
In HPF's data mapping model, arrays are aligned into groups, which are distributed onto an abstract processor arrangement. The underlying software environment (in this case, PSE in conjunction with the Digital UNIX operating system) maps this processor arrangement onto the physical processors in the PSE cluster.
HPF data mapping can be thought of as occurring in the following five stages:
Although the program must explicitly specify at least array declaration and distribution (stages 1 and 4) in order to successfully map an array, it is not usually necessary for the program to specify all five stages.
It is easiest to summarize these five stages pictorially. Each of the illustrations on the following pages show one of the five stages. The illustrations of the first four stages are based on the code fragment in Example 6-1.
REAL A(12, 12) ! Array
REAL B(16, 16) ! and template
!HPF$ TEMPLATE T(16,16) ! declarations
!HPF$ ALIGN B WITH T ! Array
!HPF$ ALIGN A(i, j) WITH T(i+2, j+2) ! alignment
!HPF$ PROCESSORS P(2, 2) ! Declaration of processor arrangement
!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P ! Array distribution
The code fragment in Example 6-1 does not do anything; it represents only the mapping of data in preparation for some other array operations not specified here.
Assume that Example 6-1 is part of a
larger program whose source code is called foo.f90 ,
and whose executable file is foo.out . The following
two command lines show user control over the fifth stage of HPF data
mapping, at compile time, and at run time:
% f90 -wsf 4 -o foo.out foo.f90 % foo.out -peers 4 -on Fred,Greg,Hilda,Ingrid
Stage 1: Array Declaration (required) and Template Declaration (optional)
REAL A(12, 12)
REAL B(16, 16)
!HPF$ TEMPLATE T(16, 16)
Array declaration, which is the same as in a nonparallel Fortran environment, is mandatory. Template declaration is optional. Templates are used for data alignment (see the discussion of ).
Stage 2: Array Alignment (optional)
!HPF$ TEMPLATE T(16,16) !HPF$ ALIGN B WITH T !HPF$ ALIGN A(i, j) WITH T(i+2, j+2)
When two arrays will be interacting with one another, it is usually advantageous to use the ALIGN directive. The ALIGN directive ensures that corresponding elements of two arrays are always stored on the same processor.
Arrays are lined up together onto a template. A template describes an index space with a specified shape, but with no content. A template can be thought of as "an array of nothings". No storage is allocated for templates. In this example, the two arrays A and B are aligned with the template T using the ALIGN directive. B is aligned with the whole template T, whereas A is aligned with only part of T.
In the ALIGN directives, arrays A and B are the alignees, and template T is the align target. The subscripts i and j are dummy variables, which do not represent any specific values. They refer to all the valid subscript values for A. They are used to specify the correspondence between elements of A and elements of T.
The explicit naming of templates is required only for certain specialized alignments. In most cases, it is possible to use a template implicitly, by aligning the arrays with one another (see Section 6.3.6.10).
Stage 3: Declaration of Abstract Processor Arrangement (optional)
!HPF$ PROCESSORS P(2, 2)
This stage of the data mapping process defines a conceptual arrangement of the processors in preparation for a DISTRIBUTE directive (see Stage 4). Processor arrangements are called "abstract", because at compile time the processors in the arrangement are not yet identified with particular physical processors. If the program does not contain a PROCESSORS directive, the compiler defines an appropriate default processor arrangement.
When processor arrangements are explicitly declared, they must be defined to conform with both the anticipated array distribution (stage 4), and the anticipated size of the PSE cluster on which the program will be run. In this example, a 2 by 2 arrangement is used, because it is two dimensional (to support the two-dimensional BLOCK, BLOCK distribution used in stage 4), and contains a total of four processors (to conform to the PSE cluster that used in stage 5 at run time). It is also possible to determine the size of the processor arrangement dynamically at run time using the intrinsic function NUMBER_OF_PROCESSORS().
Stage 4: Distribution of the Arrays onto the Processor Arrangement (required)
DISTRIBUTE T(BLOCK, BLOCK) ONTO P
Distribution means dividing up the storage of the arrays among the processors. This usually means that each processor has only a subsection of each array in its own local memory. Array distribution must be explicitly selected by the program in order to achieve any parallel speed-up. Proper selection of a distribution is absolutely critical for application performance.
Many of these are explained in Section 6.3.6. This example uses (BLOCK, BLOCK) distribution, one of a large number of possibilities.
In the case of the example distribution (BLOCK, BLOCK), it is useful to visualize the arrays as superimposed over the processor arrangement. However, other distributions require more complex visualizations. A number of example illustrations can be found in Section 6.3.6.
Because arrays A and B have already been aligned with template T, the distribution of both arrays is implied when T is distributed.
When templates are not explicitly named, array names can be used in place of template names in DISTRIBUTE directives. See Section 6.3.6.10.
The ONTO clause may be used only when a processor arrangement has been explicitly declared. When an ONTO clause is not specified, the array is distributed onto a default processor arrangement. See Section 6.3.6.11.
Stage 5: Mapping of the Processor Arrangement onto Physical Processors
This final stage of data distribution is handled transparently by PSE at run time. PSE environment variables or command-line options can be used to include or exclude particular machines. These are described in Chapter 8, Compiling and Running Parallel Programs.
In this example, the program is run in an environment comprising workstations named Kate, Mary, Dan, and Bob. If desired, you can specify the hosts to be included in the execution, as in the example. It is usually better to leave out this specification, so that PSE can select PSE cluster members based upon load-balancing considerations.
The -wsf n compile-time command line option controls the number of processors that the program is designed to use:
% f90 -wsf 4 -o a.out a.f90
The number of processors [n] specified by the -wsf option must be equal to the number of processors specified in the PROCESSORS directive.
The -peers command line option can be used at run time to specify the number of processors to be used, and the -on command line option can be used to specify particular hosts:
% a.out -peers 4 -on Kate,Mary,Dan,Bob
In this example, the number of peers is equal to the number of hosts specified with -on. However, in some cases the number of hosts will be less than the number of peers, such as when the -virtual option is used.
The ALIGN directive is used to specify that certain data objects are to be mapped in the same way as a certain other data objects. Corresponding elements in aligned arrays are always mapped to the same processor; array operations between aligned arrays are usually more efficient than array operations between arrays that are not known to be aligned. Arrays can in some cases be aligned through careful use of matching DISTRIBUTE directives; however, Digital recommends that you use ALIGN, which is more general and more convenient.
The most common use of ALIGN is to specify that the corresponding elements of two or more arrays be mapped identically, as in the following example:
!HPF$ ALIGN A WITH B
This example specifies that the two arrays A and B are always distributed the same way. More complex alignments can also be specified. For example:
!HPF$ ALIGN E(i) WITH F(2*i-1)
In this example, the elements of E correspond to the odd elements of F. In this case, E can have, at most, half as many elements as F.
As shown in the example given in Section 6.3.2, an array can be aligned with the interior of a larger array or template:
REAL A(12, 12)
REAL B(16, 16)
!HPF$ TEMPLATE T(16,16)
!HPF$ ALIGN B WITH T
!HPF$ ALIGN A(i, j) WITH T(i+2, j+2)
In this example, the 16 x 16 array B is aligned with the template T of the same size, and the 12 x 12 array A is aligned with the interior of T. Because A and B are both aligned with the template T, A and B are said to be indirectly aligned. Each interior element of B is always stored on the same processor as the corresponding element of A :
When an asterisk ( * ) is specified for a given dimension, it specifies that alignment occurs between the non-asterisk dimensions of the alignee and the align target. For example:
!HPF$ ALIGN P(i) WITH Q(*, i)
In this example, P is aligned with the second dimension (with every row) of Q. Each element of P is available on the same processor as every element in the corresponding column of Q. This means that any given element P(i) is available on each processor that stores any element in the ith column of Q. Depending upon the mapping of Q, P may need to be partially or fully replicated in order to achieve this result.
When a whole array is aligned, the ALIGN directive can be written either with an align subscript, like this:
!HPF$ ALIGN b(i) WITH c(i)
or without an align subscript, like this:
!HPF$ ALIGN b WITH c
These two forms have slightly different semantics. When an align subscript is used, the align target is permitted to be larger than the alignee. Also, elements whose subscripts are equal are aligned, regardless of what the lower bound of each array happen to be.
When an align subscript is not used, the alignee and the align target must be exactly the same size. Corresponding elements are aligned beginning with the lower bound of each array, regardless of whether the subscripts of the corresponding elements are equal.
Using (or not using) an align subscript can have an effect on performance when the arrays are allocatable.
Other more complex alignments are possible. See the High Performance Fortran Language Specification.
Circular alignments are not permitted. For example, the following code is illegal:
!HPF$ ALIGN A WITH B !HPF$ ALIGN B WITH A ! Illegal circular alignment!
Each array can be the alignee (to the left of the WITH) only once. When a given set of data objects are aligned with each other, the object array or template) that is never an alignee (is never to the left of the WITH) is known as the ultimate align target. Only the ultimate align target is permitted to appear in a DISTRIBUTE directive. The other arrays that are aligned with the ultimate align target are implicitly distributed together with the ultimate align target.
The ALIGN directive causes data objects to be mapped across processors only if the the ultimate align target appears in a DISTRIBUTE directive. For more information, see Section 6.1.4.
Because the ALIGN directive implicitly determines the distribution of the aligned arrays, it has a direct effect on how much or little communication occurs among the processors. A poorly chosen alignment can cause severe application performance degradation, whereas a well chosen alignment can cause dramatic improvement in performance.
A template is an empty array space (or an array of nothings). A template is used as an align target (the object after WITH in an ALIGN directive), and can be distributed with the DISTRIBUTE directive.
For most programs, declaration of an explicit template is not necessary. When you do not explicitly declare a template, you can use an array name in place of a template name in the ALIGN and DISTRIBUTE directives. For an example, see Section 6.3.6.10.
Because they have no content, no storage space is allocated for templates. Templates are declared in the specification part of a scoping unit with the !HPF$ TEMPLATE directive.
Templates cannot be in COMMON. Two templates declared in different scoping units are always distinct, even if they are given the same name. Templates cannot be passed through the subprogram argument interface. For an example of passing an array that is aligned with a template to a subprogram, see Section 6.4.6.
Some specialized alignments require the use of an explicit template. For example, an explicit template is needed when a particular array needs to be distributed over only some of the processors in the executing PSE cluster partition. This cannot be done by declaring a smaller processor
arrangement, because processor arrangements must always have exactly the same number of processors as the executing PSE cluster partition. However, an array can be restricted to a subset of the partition with the following technique: A template is distributed over a full-sized processor arrangement, after which an array can be aligned with a slice of the template. For instance:
!HPF$ PROCESSORS P(4, 4) !HPF$ TEMPLATE T(4, 4) !HPF$ DISTRIBUTE(BLOCK, BLOCK) ONTO P :: T !HPF$ ALIGN A(J) WITH T(J, 1)
This technique is used in an Input/Output (I/O) optimization explained in Section 7.11.4.
Another instance where explicit declaration of a template is useful is a program where smaller arrays are to be aligned with a larger index space, but no single array spans the entire index space. For example, if four nxn arrays are aligned to the four corners of a TEMPLATE of size (n+100)x(n+100) :
!HPF$ TEMPLATE, DISTRIBUTE(BLOCK, BLOCK) :: inclusive(n+100,n+100)
REAL, DIMENSION(n,n) ::NW, NE, SW, SE
!HPF$ ALIGN NW(i,j) WITH inclusive( i , j)
!HPF$ ALIGN NE(i,j) WITH inclusive( i , j+100 )
!HPF$ ALIGN SW(i,j) WITH inclusive( i+100, j )
!HPF$ ALIGN SE(i,j) WITH inclusive( i+100, j+100 )
In this example, the template inclusive allows the four smaller arrays to be aligned together and distributed, even though no single one of them spans the entire index space.
Rather than distributing arrays directly onto physical processors, HPF uses abstract processor arrangements, which allow distributions to be expressed without reference to any particular hardware configuration. This greatly improves the portability of parallel programs.
The use of processor arrangements also permits a greater variety of data mappings to be expressed. For instance, in a program written for 16 processors, processor arrangements can be declared not only of shape 4 x 4, but also 2 x 8, 8 x 2, 1 x 16, or 16 x 1. Even though all of these shapes have the same number of processors, each shape results in a different data mapping, because the distribution in any given dimension of an array is determined by the extent of the processor arrangement in that dimension (see Section 6.3.6.4).
Here are examples of declarations of a one-, two-, and three- dimensional processor arrangement:
!HPF$ PROCESSORS P(4) !HPF$ PROCESSORS Q(4,6) !HPF$ PROCESSORS R(4,3,3)
PROCESSORS, like TEMPLATE, is an optional directive. If an array is distributed without an explicit processor arrangement (see Section 6.3.6.11), the compiler creates a default processor arrangement.
The total number of processors in a processor arrangement (the product of the values specified for each of its dimensions) must be equal to the number of PSE peers specified at compile time and run time. If a general program that can run on any number of processors is desired, the processor arrangement must be dimensioned dynamically using the intrinsic function NUMBER_OF_PROCESSORS():
!HPF$ PROCESSORS P(NUMBER_OF_PROCESSORS())
In order to produce a general program that can run on any number of processors, use the -wsf option at compile time without any numerical argument.
Any number of peers is allowed, but performance is improved in some cases if the number of processors is a power of two.
You can simulate a PSE cluster larger than the number of available physical processors (CPUs) with the -virtual run-time option.
Although a processor arrangement smaller than the number of PSE peers in the executing PSE cluster partition is not permitted, the storage of an array can be restricted to a subset of the PSE cluster partition using the TEMPLATE directive.
Like array elements, processors in each dimension of an abstract processor arrangement are by default indexed starting with one (1). This is a different numbering system from that used by PSE for physical processors, in which physical processors (referred to as peers) are numbered starting with zero.
The choice of an appropriate distribution for any given algorithm is critical to application performance. A carefully chosen DISTRIBUTE directive can improve the performance of HPF code by orders of magnitude over otherwise identical code with a poorly chosen DISTRIBUTE directive.
All HPF data mappings are constructed from various combinations of two basic types of parallel distribution: BLOCK and CYCLIC.
The DISTRIBUTE directive has two basic forms, as shown in the following two example lines:
!HPF$ DISTRIBUTE A(CYCLIC, BLOCK) !HPF$ DISTRIBUTE A(CYCLIC, BLOCK) ONTO P
Use the ONTO clause when a template or array is distributed onto an explicitly named processor arrangement. Use the DISTRIBUTE directive without an ONTO clause when a processor arrangement is not explicitly named.
The template or array named in a DISTRIBUTE directive must be an ultimate align target.
In the distribution figures on the following pages, each distribution is shown in two views: array view and processor view. For each distribution, a code fragment is given showing the declaration of an array A and an abstract processor arrangement P, followed by the distribution of A onto P. The code fragment, which describes both views, is printed under the array view.
In array view, the data mapping is shown with a series of boxes, each box representing one array element. The large letter in each box represents the name of the workstation which stores that array element in its memory.
The pattern formed by the large letters is the most important feature of the array views. Although the arrays declared in the code fragments are artificially small (by several orders of magnitude), they are large enough to show the broad patterns that appear in more realistically sized arrays. For example, in BLOCK, BLOCK distribution the array elements are always divided into (roughly) square blocks (see Figure 6-5); in * , BLOCK they are grouped in broad vertical stripes (see Figure 6-17); in * , CYCLIC they are grouped in narrow vertical stripes (see Figure 6-19).
The processor view is a different way of representing the same code fragment. In processor view, the processors are positioned to show each processor's place in the abstract processor arrangement. The processors are lined up in a single row when the processor arrangement is one dimensional, and in a rectangular pattern when the processor arrangement is two dimensional.
In processor view, each processor contains a list of the array elements stored on that processor according to the given code fragment. Also included is a schematic representation of the physical storage sequence of the array elements. This information is useful when EXTRINSIC(HPF_LOCAL) routines are used. The array elements are stored on the processor in the order listed. Each black or white box represents the storage space for one array element. Each black box represents one array element. Some processors have white boxes, representing unused storage space. Unused storage space occurs because all processors allocate the same amount of storage space for a given array, even though the number of elements stored on each processor is not necessarily equal.
In BLOCK distribution of a one-dimensional array, the array elements are distributed over each processor in large blocks. To the extent possible, each processor is given an equal number of array elements. If the number of elements is not evenly divisible by the number of processors, all processors have an equal number of elements except for the last processor, which has fewer elements than the others. [1]
Figure 6-1 is the array view of an 11 element one-dimensional array A distributed (BLOCK) onto four processors. The processor view is shown in Figure 6-2. The first three processors (Kate, Mary, and Dan) each get 3 elements, and the last processor (Bob) receives 2.
The formulas that generate the information found in Figures 6-1 and 6-2 can be found in the High Performance Fortran Language Specification . A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.
[1] The one exception to this rule is the case where the number of elements in the array is relatively small compared to the number of processors in the PSE cluster. In that case, it is possible that one or more processors have zero elements. Nevertheless, the rule still holds for those processors with a non-zero number of elements.
In cyclic distribution, the array elements are dealt out to the processors in round-robin order, like playing cards dealt out to players around the table. When elements are distributed over n processors, each processor, starting from a different offset, contains every n[th] column. Figure 6-3 is the array view of the same array and processor arrangement, distributed CYCLIC, instead of BLOCK. The processor view is shown in Figure 6-4.
The formulas that generate the information found in Figures 6-3 and 6-4 can be found in the High Performance Fortran Language Specification . A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.
When multidimensional arrays are distributed, the pattern of distribution is figured independently for each dimension, based upon the shape of the processor array. For example, when both dimensions are distributed BLOCK, the array is divided into large rectangles. BLOCK, BLOCK distribution is shown in Figures 6-5 and 6-6.
Because each dimension is distributed independently, any single column (or row) of Figure 6-5 considered in isolation resembles a one-dimensional array, distributed (BLOCK).
You may wonder why any given column (or row) of a two-dimensional array distributed (BLOCK, BLOCK) onto four processors (as in Figure 6-5) is divided into only two blocks, whereas a one-dimensional array distributed BLOCK onto the same number of processors is divided into four blocks (as in Figure 6-1).
The answer is that the number of blocks in any dimension is determined by the extent of the processor arrangement in that dimension. In Figure 6-1, the array is divided into four blocks because the processor arrangement has an extent of four. However, in BLOCK, BLOCK distribution, the processor arrangement is 2 by 2 (see the processor view, Figure 6-6. Therefore, each column (or row) has two blocks, because the processor arrangement has an extent of 2 in each dimension.
CYCLIC, CYCLIC distribution produces a sort of checkerboard effect, in which no element is on the same processor as its immediate neighbors. See Figures 6-7 and 6-8.
Each dimension is distributed independently. This means that any single column (or row) of Figure 6-7 considered in isolation resembles a one-dimensional array with a CYCLIC distribution.
For an explanation of why each column (or row) alternates between two (rather than four) processors, see Section 6.3.6.4.
The formulas that generate the information found in Figures 6-7 and 6-8 can be found in the High Performance Fortran Language Specification . A visually-oriented technique for reproducing the results of these formulas for two- dimensional distributions can be found in Section 6.3.6.9. A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.
It is not necessary for multidimensional arrays to have the same distribution in each dimension. In CYCLIC, BLOCK distribution, any row considered in isolation is divided into blocks (as in BLOCK distribution), but elements in any column alternate between processors (as in CYCLIC distribution).
This manual refers to the first dimension as vertical and the second dimension as horizontal. (CYCLIC, BLOCK) distribution means that elements are distributed cyclically along the vertical axis, and in blocks along the horizontal axis.
CYCLIC, BLOCK distribution is shown in Figures 6-9 and 6-10.
The formulas that generate the information found in Figures 6-9 and 6-10 can be found in High Performance Fortran Language Specification . A visually-oriented technique for reproducing the results of these formulas for two- dimensional distributions can be found in Section 6.3.6.9. A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.
BLOCK, CYCLIC distribution is analogous to CYCLIC, BLOCK, with the opposite orientation. See Figures 6-11 and 6-12.
The formulas that generate the information found in Figures 6-11 and 6-12 can be found in High Performance Fortran Language Specification . A visually-oriented technique for reproducing the results of these formulas for two- dimensional distributions can be found in Section 6.3.6.9. A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.
When an asterisk ( * ) occurs inside the parentheses of a DISTRIBUTE directive, it refers to array elements not being distributed along one of the axes. In other words, array elements along the axis marked with an asterisk in the DISTRIBUTE directive are not divided up among different processors, but assigned as a single block to one processor. This type of mapping is sometimes called "on processor" distribution. It can also be referred to as "collapsed" or "serial" distribution.
For example, in (BLOCK, * ) distribution, the asterisk for the second dimension means that each row [1] is assigned as a single block to one processor.
Even though (BLOCK, * ) distribution is used for two- dimensional arrays, it is considered a one-dimensional distribution, because only the first dimension is distributed. It must therefore be distributed onto a one-dimensional processor arrangement. The general rule for this is the following: The rank of the processor arrangement must be equal to the number of non-asterisk dimensions in the DISTRIBUTE directive.
Figures 6-13 and 6-14 depict (BLOCK, * ) distribution. Figures 6-15, 6-16, 6-17, 6-1
show other combinations of CYCLIC and BLOCK with * .
The formulas that generate the information found in these figures
can be found in High Performance
Fortran Language Specification . A detailed explanation
of the format of the figures found in this chapter can be found in
Section 6.3.6.1.
[1] In this manual, the second dimension is referred
to as horizontal. See the "Note on Rows and Columns" in Section 6.3.6.1.
Figure 6-21 presents a visually-oriented
technique for constructing the array view of two-dimensional
distributions. In this technique, the elements in the upper left-
hand corner of the array are assigned in the same pattern as the
processor arrangement. Figure 6-21
shows how this pattern is expanded and/or repeated to construct
the appropriate array view.
This technique can be used for all two-dimensional distributions.
This technique cannot be used to figure distributions containing an
asterisk ( * ), distributions of one-dimensional arrays, or
distributions of arrays with more than two dimensions.
This manual refers to the first axis as vertical for both arrays and
processor arrangements. See the note about "Rows and Columns" in Section 6.3.6.1.
Precise formulas that are valid for all distributions can be
found in the High Performance Fortran
Language Specification.
When a template is not explicitly named in a TEMPLATE directive, the
name of an array takes the place of the name of the template in the
DISTRIBUTE directive. Any array can be distributed, as long as it
is never an alignee (to the left of the keyword WITH) in
an ALIGN directive. The following two versions of the code fragment
from Example 6-1 are equivalent when
compiled for four processors:
When the PROCESSORS directive is not used, the ONTO clause must
be omitted from the DISTRIBUTE directive. The code fragment from
Example 6-1 without a PROCESSORS
directive looks like this:
When the DISTRIBUTE directive is used without an ONTO clause, the
compiler provides a default processor arrangement. The compiler
attempts to select an efficient shape for the default processor
arrangement, but you should not rely upon the arrangement being
any one particular shape. In the above code fragment, for example,
possible processor arrangements shapes are 4 x 1, 1 x
4, or 2 x 2. If you want one particular shape, use the
PROCESSORS directive, and distribute the array with an ONTO clause.
There is no completely general rule for determining which data
mapping is most efficient, because optimal data distribution
is highly algorithm-dependent. In most parallel programming
environments, including Digital's
PSE, communication between processors is more time consuming
than computation by a huge margin. Therefore, the primary goal
in choosing a data mapping is to minimize communication between
processors. A secondary goal is to balance the computational load
among the processors.
Array assignments in which the calculation of each array element
requires information only from its near neighbors generally
run faster with a BLOCK distribution, because this allows the
processor calculating any given array element to have all of the
necessary data in its own memory in most cases. The Digital Fortran 90 compiler includes an optimization
which minimizes communication even along the edges of blocks in
nearest-neighbor calculations. For an example of a nearest neighbor
calculation, see Chapter 3.
When the calculation requires information from distant elements
in the array, a CYCLIC distribution is frequently faster because
it improves load-balancing, dividing the work more evenly among
processors. See Section 2.3.3.
Some algorithms are column-oriented or row-oriented in the data they
require for each calculation. These algorithms frequently benefit
from a distribution with an asterisk ( * ) in one of the
dimensions.
The distribution figures in this chapter can be used as a guide
to the basic distribution choices for two-dimensional arrays; more
complex distributions such as CYCLIC(n) (CYCLIC with an
intr-expr) are also possible. See the High Performance Fortran Language Specification.
It is worthwhile to make an initial guess at a distribution, and
then try a few alternatives to see which performs best. One of
the advantages of HPF is that changing distributions is an easy
change to the coding of the program. Precise analysis of program
performance can be obtained with the pprof profiler.
The Digital Fortran 90 compiler
performs an optimization of nearest-neighbor algorithms to reduce
communications. Shadow edges are allocated to hold copies
of array elements that are near neighbors to each processor's edge
elements.
The DISTRIBUTE directive can be used to manually set shadow-edge
widths for each array dimension. When you do not set a shadow-edge
width for a given dimension, the compiler will size the shadow edge
automatically, based upon your algorithm. For example:
In this example, shadow edges 3 array elements wide will be
allocated for the first dimension of array A. The compiler will
automatically size the shadow edges for the second dimension,
because no width is specified. No shadow storage will be allocated
for the third dimension, because a shadow-edge width of zero (0) is
specified.
When you do not specify any shadow-width edges, the compiler
automatically sizes the shadow edge for all dimensions. You will
usually obtain the full performance benefit of the nearest neighbor
optimization by relying upon the compiler's automatic shadow-edge
sizing.
The primary use of the SHADOW keyword is to preventing copy in
/copy out when arrays used in nearest-neighbor computations are
passed through the procedure interface. If you want to conserve
memory by limiting the sizes of shadow-edge widths, it is usually
preferable to use the
However, there are some situations where shadow-edge widths should
be set manually:
In these cases, setting shadow-edge widths manually leads to more
efficient memory usage and prevents unnecessary local copying of
data.
You can limit shadow-edge widths to a certain maximum value with the
The nearest neighbor optimization can be disabled with the
Figure 6-13 BLOCK,* Distribution - Array
View
Figure 6-14 BLOCK, * Distribution - Processor
View
Figure 6-15 CYCLIC, * Distribution - Array
View
Figure 6-16 CYCLIC, * Distribution - Processor
View
Figure 6-17 *, BLOCK Distribution - Array
View
Figure 6-18 *, BLOCK Distribution - Processor
View
Figure 6-19 *, CYCLIC Distribution - Array
View
Figure 6-20 *, CYCLIC Distribution - Processor
View
6.3.6.9 Visual Technique for Computing
Two-Dimensional Distributions
Figure 6-21 Visual Technique for Computing
Two-Dimensional Distributions
6.3.6.10 Using DISTRIBUTE Without an
Explicit Template
REAL A(12, 12)
REAL B(16, 16)
!HPF$ TEMPLATE T(16,16)
!HPF$ ALIGN B WITH T
!HPF$ ALIGN A(i, j) WITH T(i+2, j+2)
!HPF$ PROCESSORS P(2, 2)
!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P
REAL A(12, 12)
REAL B(16, 16)
!HPF$ ALIGN A(i, j) WITH B(i+2, j+2)
!HPF$ PROCESSORS P(2, 2)
!HPF$ DISTRIBUTE B(BLOCK, BLOCK) ONTO P
6.3.6.11 Using DISTRIBUTE Without an
Explicit PROCESSORS Directive
REAL A(12, 12)
REAL B(16, 16)
!HPF$ ALIGN A WITH B(i+2, j+2)
!HPF$ DISTRIBUTE B(BLOCK, BLOCK)
6.3.6.12 Deciding on a Distribution
6.3.7 Shadow Edges for Nearest-Neighbor
Algorithms
!HPF$ DISTRIBUTE A(BLOCK(SHADOW=3), BLOCK, BLOCK(SHADOW=0))
-nearest_neighbor compile-time command-line
option.
-nearest_neighbor command-line option.
-nonearest_neighbor command-line option. This option
has the same effect as setting all shadow-edge widths to zero (0).