6.3 Data Mapping

This section explains data mapping in HPF. Proper data mapping is critical for the performance of any HPF program. The discussion of data mapping is divided as follows:

Section 6.3.6 (DISTRIBUTE Directive) includes an extensive set of figures showing many of the major distributions for one- and two-dimensional arrays.

6.3.1 Data Mapping Basics

HPF is designed for data parallel programming, a programming model in which the work of large-scale uniform array operations is divided up among a number of processors in order to increase performance. In data parallel programming, each array is split up into parts, and each part is stored on a different processor. In most cases, it is most efficient for operations to be performed by the processor storing the data in its local memory. If the arrays are mapped onto the processors in such a way that each processor has most of the information necessary to perform a given array operation on the part of the array stored locally, each processor can work independently on its own section of the array at the same time the other processors are working on other sections of the array. In this manner, an array operation is completed more quickly than the same operation performed by a single processor. In the optimal case, the speed-up scales linearly; in an environment with n processors, the operation is completed n times faster than with a single processor.

6.3.2 Illustrated Summary of HPF Data Mapping

This section explains HPF's general conception or model of data mapping. After explaining basic concepts, a brief code fragment describing a sample data mapping, together with a series of figures that represent this mapping schematically, is shown.

In HPF's data mapping model, arrays are aligned into groups, which are distributed onto an abstract processor arrangement. The underlying software environment (in this case, PSE in conjunction with the Digital UNIX operating system) maps this processor arrangement onto the physical processors in the PSE cluster.

HPF data mapping can be thought of as occurring in the following five stages:

  1. Array and template declaration (standard Fortran declarations and TEMPLATE directive)
  2. Array alignment (ALIGN)
  3. Declaration of abstract processor arrangement (PROCESSORS)
  4. Distribution onto the abstract processor arrangement (DISTRIBUTE)
  5. Mapping the abstract processor arrangement onto the physical processors (compile-time and run-time command line options)

Although the program must explicitly specify at least array declaration and distribution (stages 1 and 4) in order to successfully map an array, it is not usually necessary for the program to specify all five stages.

It is easiest to summarize these five stages pictorially. Each of the illustrations on the following pages show one of the five stages. The illustrations of the first four stages are based on the code fragment in Example 6-1.

Example 6-1 Code Fragment for Mapping Illustrations

      REAL A(12, 12)     ! Array
      REAL B(16, 16)     ! and template
!HPF$ TEMPLATE T(16,16)  ! declarations
!HPF$ ALIGN B WITH T                 ! Array
!HPF$ ALIGN A(i, j) WITH T(i+2, j+2) ! alignment
!HPF$ PROCESSORS P(2, 2)       ! Declaration of processor arrangement
!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P    ! Array distribution

The code fragment in Example 6-1 does not do anything; it represents only the mapping of data in preparation for some other array operations not specified here.

Assume that Example 6-1 is part of a larger program whose source code is called foo.f90 , and whose executable file is foo.out . The following two command lines show user control over the fifth stage of HPF data mapping, at compile time, and at run time:

% f90  -wsf 4  -o foo.out  foo.f90
% foo.out  -peers 4  -on Fred,Greg,Hilda,Ingrid

Stage 1: Array Declaration (required) and Template Declaration (optional)

      REAL A(12, 12)
      REAL B(16, 16)
!HPF$ TEMPLATE T(16, 16)

Array declaration, which is the same as in a nonparallel Fortran environment, is mandatory. Template declaration is optional. Templates are used for data alignment (see the discussion of ).

For More Information:

Stage 2: Array Alignment (optional)

!HPF$ TEMPLATE T(16,16)
!HPF$ ALIGN B WITH T
!HPF$ ALIGN A(i, j) WITH T(i+2, j+2)

When two arrays will be interacting with one another, it is usually advantageous to use the ALIGN directive. The ALIGN directive ensures that corresponding elements of two arrays are always stored on the same processor.

Arrays are lined up together onto a template. A template describes an index space with a specified shape, but with no content. A template can be thought of as "an array of nothings". No storage is allocated for templates. In this example, the two arrays A and B are aligned with the template T using the ALIGN directive. B is aligned with the whole template T, whereas A is aligned with only part of T.

In the ALIGN directives, arrays A and B are the alignees, and template T is the align target. The subscripts i and j are dummy variables, which do not represent any specific values. They refer to all the valid subscript values for A. They are used to specify the correspondence between elements of A and elements of T.

The explicit naming of templates is required only for certain specialized alignments. In most cases, it is possible to use a template implicitly, by aligning the arrays with one another (see Section 6.3.6.10).

For More Information:

Stage 3: Declaration of Abstract Processor Arrangement (optional)

!HPF$ PROCESSORS P(2, 2)

This stage of the data mapping process defines a conceptual arrangement of the processors in preparation for a DISTRIBUTE directive (see Stage 4). Processor arrangements are called "abstract", because at compile time the processors in the arrangement are not yet identified with particular physical processors. If the program does not contain a PROCESSORS directive, the compiler defines an appropriate default processor arrangement.

When processor arrangements are explicitly declared, they must be defined to conform with both the anticipated array distribution (stage 4), and the anticipated size of the PSE cluster on which the program will be run. In this example, a 2 by 2 arrangement is used, because it is two dimensional (to support the two-dimensional BLOCK, BLOCK distribution used in stage 4), and contains a total of four processors (to conform to the PSE cluster that used in stage 5 at run time). It is also possible to determine the size of the processor arrangement dynamically at run time using the intrinsic function NUMBER_OF_PROCESSORS().

For More Information:

Stage 4: Distribution of the Arrays onto the Processor Arrangement (required)

DISTRIBUTE T(BLOCK, BLOCK) ONTO P

Distribution means dividing up the storage of the arrays among the processors. This usually means that each processor has only a subsection of each array in its own local memory. Array distribution must be explicitly selected by the program in order to achieve any parallel speed-up. Proper selection of a distribution is absolutely critical for application performance.

Many of these are explained in Section 6.3.6. This example uses (BLOCK, BLOCK) distribution, one of a large number of possibilities.

In the case of the example distribution (BLOCK, BLOCK), it is useful to visualize the arrays as superimposed over the processor arrangement. However, other distributions require more complex visualizations. A number of example illustrations can be found in Section 6.3.6.

Because arrays A and B have already been aligned with template T, the distribution of both arrays is implied when T is distributed.

When templates are not explicitly named, array names can be used in place of template names in DISTRIBUTE directives. See Section 6.3.6.10.

The ONTO clause may be used only when a processor arrangement has been explicitly declared. When an ONTO clause is not specified, the array is distributed onto a default processor arrangement. See Section 6.3.6.11.

For More Information:

Stage 5: Mapping of the Processor Arrangement onto Physical Processors

This final stage of data distribution is handled transparently by PSE at run time. PSE environment variables or command-line options can be used to include or exclude particular machines. These are described in Chapter 8, Compiling and Running Parallel Programs.

In this example, the program is run in an environment comprising workstations named Kate, Mary, Dan, and Bob. If desired, you can specify the hosts to be included in the execution, as in the example. It is usually better to leave out this specification, so that PSE can select PSE cluster members based upon load-balancing considerations.

The -wsf n compile-time command line option controls the number of processors that the program is designed to use:

% f90 -wsf 4 -o a.out a.f90

The number of processors [n] specified by the -wsf option must be equal to the number of processors specified in the PROCESSORS directive.

The -peers command line option can be used at run time to specify the number of processors to be used, and the -on command line option can be used to specify particular hosts:

% a.out -peers 4 -on Kate,Mary,Dan,Bob

In this example, the number of peers is equal to the number of hosts specified with -on. However, in some cases the number of hosts will be less than the number of peers, such as when the -virtual option is used.

For More Information:

6.3.3 ALIGN Directive

The ALIGN directive is used to specify that certain data objects are to be mapped in the same way as a certain other data objects. Corresponding elements in aligned arrays are always mapped to the same processor; array operations between aligned arrays are usually more efficient than array operations between arrays that are not known to be aligned. Arrays can in some cases be aligned through careful use of matching DISTRIBUTE directives; however, Digital recommends that you use ALIGN, which is more general and more convenient.

The most common use of ALIGN is to specify that the corresponding elements of two or more arrays be mapped identically, as in the following example:

!HPF$ ALIGN A WITH B

This example specifies that the two arrays A and B are always distributed the same way. More complex alignments can also be specified. For example:

!HPF$ ALIGN E(i) WITH F(2*i-1)

In this example, the elements of E correspond to the odd elements of F. In this case, E can have, at most, half as many elements as F.

As shown in the example given in Section 6.3.2, an array can be aligned with the interior of a larger array or template:

      REAL A(12, 12)
      REAL B(16, 16)
!HPF$ TEMPLATE T(16,16)
!HPF$ ALIGN B WITH T
!HPF$ ALIGN A(i, j) WITH T(i+2, j+2)

In this example, the 16 x 16 array B is aligned with the template T of the same size, and the 12 x 12 array A is aligned with the interior of T. Because A and B are both aligned with the template T, A and B are said to be indirectly aligned. Each interior element of B is always stored on the same processor as the corresponding element of A :

When an asterisk ( * ) is specified for a given dimension, it specifies that alignment occurs between the non-asterisk dimensions of the alignee and the align target. For example:

!HPF$ ALIGN P(i) WITH Q(*, i)

In this example, P is aligned with the second dimension (with every row) of Q. Each element of P is available on the same processor as every element in the corresponding column of Q. This means that any given element P(i) is available on each processor that stores any element in the ith column of Q. Depending upon the mapping of Q, P may need to be partially or fully replicated in order to achieve this result.

When a whole array is aligned, the ALIGN directive can be written either with an align subscript, like this:

!HPF$ ALIGN b(i) WITH c(i)

or without an align subscript, like this:

!HPF$ ALIGN b WITH c

These two forms have slightly different semantics. When an align subscript is used, the align target is permitted to be larger than the alignee. Also, elements whose subscripts are equal are aligned, regardless of what the lower bound of each array happen to be.

When an align subscript is not used, the alignee and the align target must be exactly the same size. Corresponding elements are aligned beginning with the lower bound of each array, regardless of whether the subscripts of the corresponding elements are equal.

Using (or not using) an align subscript can have an effect on performance when the arrays are allocatable.

Other more complex alignments are possible. See the High Performance Fortran Language Specification.

Circular alignments are not permitted. For example, the following code is illegal:

!HPF$ ALIGN A WITH B
!HPF$ ALIGN B WITH A  ! Illegal circular alignment!

Each array can be the alignee (to the left of the WITH) only once. When a given set of data objects are aligned with each other, the object array or template) that is never an alignee (is never to the left of the WITH) is known as the ultimate align target. Only the ultimate align target is permitted to appear in a DISTRIBUTE directive. The other arrays that are aligned with the ultimate align target are implicitly distributed together with the ultimate align target.

The ALIGN directive causes data objects to be mapped across processors only if the the ultimate align target appears in a DISTRIBUTE directive. For more information, see Section 6.1.4.

Because the ALIGN directive implicitly determines the distribution of the aligned arrays, it has a direct effect on how much or little communication occurs among the processors. A poorly chosen alignment can cause severe application performance degradation, whereas a well chosen alignment can cause dramatic improvement in performance.

For More Information:

6.3.4 TEMPLATE Directive

A template is an empty array space (or an array of nothings). A template is used as an align target (the object after WITH in an ALIGN directive), and can be distributed with the DISTRIBUTE directive.

For most programs, declaration of an explicit template is not necessary. When you do not explicitly declare a template, you can use an array name in place of a template name in the ALIGN and DISTRIBUTE directives. For an example, see Section 6.3.6.10.

Because they have no content, no storage space is allocated for templates. Templates are declared in the specification part of a scoping unit with the !HPF$ TEMPLATE directive.

Templates cannot be in COMMON. Two templates declared in different scoping units are always distinct, even if they are given the same name. Templates cannot be passed through the subprogram argument interface. For an example of passing an array that is aligned with a template to a subprogram, see Section 6.4.6.

Some specialized alignments require the use of an explicit template. For example, an explicit template is needed when a particular array needs to be distributed over only some of the processors in the executing PSE cluster partition. This cannot be done by declaring a smaller processor

arrangement, because processor arrangements must always have exactly the same number of processors as the executing PSE cluster partition. However, an array can be restricted to a subset of the partition with the following technique: A template is distributed over a full-sized processor arrangement, after which an array can be aligned with a slice of the template. For instance:

!HPF$ PROCESSORS P(4, 4)
!HPF$ TEMPLATE T(4, 4)
!HPF$ DISTRIBUTE(BLOCK, BLOCK) ONTO P :: T
!HPF$ ALIGN A(J) WITH T(J, 1)

This technique is used in an Input/Output (I/O) optimization explained in Section 7.11.4.

Another instance where explicit declaration of a template is useful is a program where smaller arrays are to be aligned with a larger index space, but no single array spans the entire index space. For example, if four nxn arrays are aligned to the four corners of a TEMPLATE of size (n+100)x(n+100) :

!HPF$ TEMPLATE, DISTRIBUTE(BLOCK, BLOCK) :: inclusive(n+100,n+100)
      REAL, DIMENSION(n,n)  ::NW, NE, SW, SE
!HPF$ ALIGN NW(i,j) WITH inclusive( i , j)
!HPF$ ALIGN NE(i,j) WITH inclusive( i , j+100 )
!HPF$ ALIGN SW(i,j) WITH inclusive( i+100, j )
!HPF$ ALIGN SE(i,j) WITH inclusive( i+100, j+100 )

In this example, the template inclusive allows the four smaller arrays to be aligned together and distributed, even though no single one of them spans the entire index space.

6.3.5 PROCESSORS Directive

Rather than distributing arrays directly onto physical processors, HPF uses abstract processor arrangements, which allow distributions to be expressed without reference to any particular hardware configuration. This greatly improves the portability of parallel programs.

The use of processor arrangements also permits a greater variety of data mappings to be expressed. For instance, in a program written for 16 processors, processor arrangements can be declared not only of shape 4 x 4, but also 2 x 8, 8 x 2, 1 x 16, or 16 x 1. Even though all of these shapes have the same number of processors, each shape results in a different data mapping, because the distribution in any given dimension of an array is determined by the extent of the processor arrangement in that dimension (see Section 6.3.6.4).

Here are examples of declarations of a one-, two-, and three- dimensional processor arrangement:

!HPF$ PROCESSORS P(4)
!HPF$ PROCESSORS Q(4,6)
!HPF$ PROCESSORS R(4,3,3)

PROCESSORS, like TEMPLATE, is an optional directive. If an array is distributed without an explicit processor arrangement (see Section 6.3.6.11), the compiler creates a default processor arrangement.

The total number of processors in a processor arrangement (the product of the values specified for each of its dimensions) must be equal to the number of PSE peers specified at compile time and run time. If a general program that can run on any number of processors is desired, the processor arrangement must be dimensioned dynamically using the intrinsic function NUMBER_OF_PROCESSORS():

!HPF$ PROCESSORS P(NUMBER_OF_PROCESSORS())

In order to produce a general program that can run on any number of processors, use the -wsf option at compile time without any numerical argument.

Any number of peers is allowed, but performance is improved in some cases if the number of processors is a power of two.

You can simulate a PSE cluster larger than the number of available physical processors (CPUs) with the -virtual run-time option.

Although a processor arrangement smaller than the number of PSE peers in the executing PSE cluster partition is not permitted, the storage of an array can be restricted to a subset of the PSE cluster partition using the TEMPLATE directive.

Like array elements, processors in each dimension of an abstract processor arrangement are by default indexed starting with one (1). This is a different numbering system from that used by PSE for physical processors, in which physical processors (referred to as peers) are numbered starting with zero.

For More Information:

6.3.6 DISTRIBUTE Directive

The choice of an appropriate distribution for any given algorithm is critical to application performance. A carefully chosen DISTRIBUTE directive can improve the performance of HPF code by orders of magnitude over otherwise identical code with a poorly chosen DISTRIBUTE directive.

All HPF data mappings are constructed from various combinations of two basic types of parallel distribution: BLOCK and CYCLIC.

The DISTRIBUTE directive has two basic forms, as shown in the following two example lines:

!HPF$ DISTRIBUTE A(CYCLIC, BLOCK)
!HPF$ DISTRIBUTE A(CYCLIC, BLOCK) ONTO P

Use the ONTO clause when a template or array is distributed onto an explicitly named processor arrangement. Use the DISTRIBUTE directive without an ONTO clause when a processor arrangement is not explicitly named.

The template or array named in a DISTRIBUTE directive must be an ultimate align target.

For More Information:

6.3.6.1 Explanation of the Distribution Figures

In the distribution figures on the following pages, each distribution is shown in two views: array view and processor view. For each distribution, a code fragment is given showing the declaration of an array A and an abstract processor arrangement P, followed by the distribution of A onto P. The code fragment, which describes both views, is printed under the array view.

In array view, the data mapping is shown with a series of boxes, each box representing one array element. The large letter in each box represents the name of the workstation which stores that array element in its memory.

The pattern formed by the large letters is the most important feature of the array views. Although the arrays declared in the code fragments are artificially small (by several orders of magnitude), they are large enough to show the broad patterns that appear in more realistically sized arrays. For example, in BLOCK, BLOCK distribution the array elements are always divided into (roughly) square blocks (see Figure 6-5); in * , BLOCK they are grouped in broad vertical stripes (see Figure 6-17); in * , CYCLIC they are grouped in narrow vertical stripes (see Figure 6-19).


Note about Array Views
These figures are meant to illustrate data mapping only. No information is given about the values of array elements.

The processor view is a different way of representing the same code fragment. In processor view, the processors are positioned to show each processor's place in the abstract processor arrangement. The processors are lined up in a single row when the processor arrangement is one dimensional, and in a rectangular pattern when the processor arrangement is two dimensional.


Note about Processor Views
The physical processors are shown organized according to the abstract processor arrangement as an aid to conceptualization only. No information is given about the actual connectivity of the processors or configuration of the network.

In processor view, each processor contains a list of the array elements stored on that processor according to the given code fragment. Also included is a schematic representation of the physical storage sequence of the array elements. This information is useful when EXTRINSIC(HPF_LOCAL) routines are used. The array elements are stored on the processor in the order listed. Each black or white box represents the storage space for one array element. Each black box represents one array element. Some processors have white boxes, representing unused storage space. Unused storage space occurs because all processors allocate the same amount of storage space for a given array, even though the number of elements stored on each processor is not necessarily equal.


Note
Physical storage sequence in HPF is processor dependent. The information about physical storage sequence given in the distribution illustrations describes the current implementation of Digital Fortran 90. Programs that depend upon this information may not be portable to other HPF implementations.


Note about Rows and Columns
When referring to elements of a two-dimensional array or processor arrangement, this manual refers to the first subscript as varying with vertical movement through the array, and the second subscript as varying with horizontal movement. In other words, the first axis is vertical, and the second axis is horizontal. This notation is patterned after matrix notation in mathematics, where the elements in the first row of a matrix M are referred to as M11 , M12 , M13 . . . , the second row as M21 , M22 , M23 , and so on. This terminology is used for both arrays and processor arrangements. Array element subscripts should not be confused with Cartesian ordered pairs (x,y) , in which x varies with horizontal movement, and y varies with vertical movement.

6.3.6.2 BLOCK Distribution

In BLOCK distribution of a one-dimensional array, the array elements are distributed over each processor in large blocks. To the extent possible, each processor is given an equal number of array elements. If the number of elements is not evenly divisible by the number of processors, all processors have an equal number of elements except for the last processor, which has fewer elements than the others. [1]

Figure 6-1 is the array view of an 11 element one-dimensional array A distributed (BLOCK) onto four processors. The processor view is shown in Figure 6-2. The first three processors (Kate, Mary, and Dan) each get 3 elements, and the last processor (Bob) receives 2.

Figure 6-1 BLOCK Distribution - Array View

Figure 6-2 BLOCK Distribution - Processor View

The formulas that generate the information found in Figures 6-1 and 6-2 can be found in the High Performance Fortran Language Specification . A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.


[1] The one exception to this rule is the case where the number of elements in the array is relatively small compared to the number of processors in the PSE cluster. In that case, it is possible that one or more processors have zero elements. Nevertheless, the rule still holds for those processors with a non-zero number of elements.

6.3.6.3 CYCLIC Distribution

In cyclic distribution, the array elements are dealt out to the processors in round-robin order, like playing cards dealt out to players around the table. When elements are distributed over n processors, each processor, starting from a different offset, contains every n[th] column. Figure 6-3 is the array view of the same array and processor arrangement, distributed CYCLIC, instead of BLOCK. The processor view is shown in Figure 6-4.

Figure 6-3 CYCLIC Distribution - Array View

Figure 6-4 CYCLIC Distribution - Processor View

The formulas that generate the information found in Figures 6-3 and 6-4 can be found in the High Performance Fortran Language Specification . A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.

6.3.6.4 BLOCK, BLOCK Distribution

When multidimensional arrays are distributed, the pattern of distribution is figured independently for each dimension, based upon the shape of the processor array. For example, when both dimensions are distributed BLOCK, the array is divided into large rectangles. BLOCK, BLOCK distribution is shown in Figures 6-5 and 6-6.

Figure 6-5 BLOCK, BLOCK Distribution - Array View

Figure 6-6 BLOCK, BLOCK Distribution - Processor view

Because each dimension is distributed independently, any single column (or row) of Figure 6-5 considered in isolation resembles a one-dimensional array, distributed (BLOCK).

You may wonder why any given column (or row) of a two-dimensional array distributed (BLOCK, BLOCK) onto four processors (as in Figure 6-5) is divided into only two blocks, whereas a one-dimensional array distributed BLOCK onto the same number of processors is divided into four blocks (as in Figure 6-1).

The answer is that the number of blocks in any dimension is determined by the extent of the processor arrangement in that dimension. In Figure 6-1, the array is divided into four blocks because the processor arrangement has an extent of four. However, in BLOCK, BLOCK distribution, the processor arrangement is 2 by 2 (see the processor view, Figure 6-6. Therefore, each column (or row) has two blocks, because the processor arrangement has an extent of 2 in each dimension.

6.3.6.5 CYCLIC, CYCLIC Distribution

CYCLIC, CYCLIC distribution produces a sort of checkerboard effect, in which no element is on the same processor as its immediate neighbors. See Figures 6-7 and 6-8.

Figure 6-7 CYCLIC, CYCLIC Distribution - Array View

Figure 6-8 CYCLIC, CYCLIC Distribution - Processor View

Each dimension is distributed independently. This means that any single column (or row) of Figure 6-7 considered in isolation resembles a one-dimensional array with a CYCLIC distribution.

For an explanation of why each column (or row) alternates between two (rather than four) processors, see Section 6.3.6.4.

The formulas that generate the information found in Figures 6-7 and 6-8 can be found in the High Performance Fortran Language Specification . A visually-oriented technique for reproducing the results of these formulas for two- dimensional distributions can be found in Section 6.3.6.9. A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.

6.3.6.6 CYCLIC, BLOCK Distribution

It is not necessary for multidimensional arrays to have the same distribution in each dimension. In CYCLIC, BLOCK distribution, any row considered in isolation is divided into blocks (as in BLOCK distribution), but elements in any column alternate between processors (as in CYCLIC distribution).

This manual refers to the first dimension as vertical and the second dimension as horizontal. (CYCLIC, BLOCK) distribution means that elements are distributed cyclically along the vertical axis, and in blocks along the horizontal axis.

CYCLIC, BLOCK distribution is shown in Figures 6-9 and 6-10.

Figure 6-9 CYCLIC, BLOCK Distribution - Array View

Figure 6-10 CYCLIC, BLOCK Distribution - Processor View

The formulas that generate the information found in Figures 6-9 and 6-10 can be found in High Performance Fortran Language Specification . A visually-oriented technique for reproducing the results of these formulas for two- dimensional distributions can be found in Section 6.3.6.9. A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.

6.3.6.7 BLOCK, CYCLIC Distribution

BLOCK, CYCLIC distribution is analogous to CYCLIC, BLOCK, with the opposite orientation. See Figures 6-11 and 6-12.

Figure 6-11 BLOCK, CYCLIC Distribution - Array View

Figure 6-12 BLOCK, CYCLIC Distribution - Processor View

The formulas that generate the information found in Figures 6-11 and 6-12 can be found in High Performance Fortran Language Specification . A visually-oriented technique for reproducing the results of these formulas for two- dimensional distributions can be found in Section 6.3.6.9. A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.

6.3.6.8 Asterisk Distributions

When an asterisk ( * ) occurs inside the parentheses of a DISTRIBUTE directive, it refers to array elements not being distributed along one of the axes. In other words, array elements along the axis marked with an asterisk in the DISTRIBUTE directive are not divided up among different processors, but assigned as a single block to one processor. This type of mapping is sometimes called "on processor" distribution. It can also be referred to as "collapsed" or "serial" distribution.

For example, in (BLOCK, * ) distribution, the asterisk for the second dimension means that each row [1] is assigned as a single block to one processor.

Even though (BLOCK, * ) distribution is used for two- dimensional arrays, it is considered a one-dimensional distribution, because only the first dimension is distributed. It must therefore be distributed onto a one-dimensional processor arrangement. The general rule for this is the following: The rank of the processor arrangement must be equal to the number of non-asterisk dimensions in the DISTRIBUTE directive.

Figures 6-13 and 6-14 depict (BLOCK, * ) distribution. Figures 6-15, 6-16, 6-17, 6-1 show other combinations of CYCLIC and BLOCK with * .

The formulas that generate the information found in these figures can be found in High Performance Fortran Language Specification . A detailed explanation of the format of the figures found in this chapter can be found in Section 6.3.6.1.

Figure 6-13 BLOCK,* Distribution - Array View

Figure 6-14 BLOCK, * Distribution - Processor View

Figure 6-15 CYCLIC, * Distribution - Array View

Figure 6-16 CYCLIC, * Distribution - Processor View

Figure 6-17 *, BLOCK Distribution - Array View

Figure 6-18 *, BLOCK Distribution - Processor View

Figure 6-19 *, CYCLIC Distribution - Array View

Figure 6-20 *, CYCLIC Distribution - Processor View


[1] In this manual, the second dimension is referred to as horizontal. See the "Note on Rows and Columns" in Section 6.3.6.1.

6.3.6.9 Visual Technique for Computing Two-Dimensional Distributions

Figure 6-21 presents a visually-oriented technique for constructing the array view of two-dimensional distributions. In this technique, the elements in the upper left- hand corner of the array are assigned in the same pattern as the processor arrangement. Figure 6-21 shows how this pattern is expanded and/or repeated to construct the appropriate array view.

This technique can be used for all two-dimensional distributions.

This technique cannot be used to figure distributions containing an asterisk ( * ), distributions of one-dimensional arrays, or distributions of arrays with more than two dimensions.

This manual refers to the first axis as vertical for both arrays and processor arrangements. See the note about "Rows and Columns" in Section 6.3.6.1.

Precise formulas that are valid for all distributions can be found in the High Performance Fortran Language Specification.

Figure 6-21 Visual Technique for Computing Two-Dimensional Distributions

6.3.6.10 Using DISTRIBUTE Without an Explicit Template

When a template is not explicitly named in a TEMPLATE directive, the name of an array takes the place of the name of the template in the DISTRIBUTE directive. Any array can be distributed, as long as it is never an alignee (to the left of the keyword WITH) in an ALIGN directive. The following two versions of the code fragment from Example 6-1 are equivalent when compiled for four processors:

6.3.6.11 Using DISTRIBUTE Without an Explicit PROCESSORS Directive

When the PROCESSORS directive is not used, the ONTO clause must be omitted from the DISTRIBUTE directive. The code fragment from Example 6-1 without a PROCESSORS directive looks like this:

      REAL A(12, 12)
      REAL B(16, 16)
!HPF$ ALIGN A WITH B(i+2, j+2)
!HPF$ DISTRIBUTE B(BLOCK, BLOCK)

When the DISTRIBUTE directive is used without an ONTO clause, the compiler provides a default processor arrangement. The compiler attempts to select an efficient shape for the default processor arrangement, but you should not rely upon the arrangement being any one particular shape. In the above code fragment, for example, possible processor arrangements shapes are 4 x 1, 1 x 4, or 2 x 2. If you want one particular shape, use the PROCESSORS directive, and distribute the array with an ONTO clause.

6.3.6.12 Deciding on a Distribution

There is no completely general rule for determining which data mapping is most efficient, because optimal data distribution is highly algorithm-dependent. In most parallel programming environments, including Digital's PSE, communication between processors is more time consuming than computation by a huge margin. Therefore, the primary goal in choosing a data mapping is to minimize communication between processors. A secondary goal is to balance the computational load among the processors.

Array assignments in which the calculation of each array element requires information only from its near neighbors generally run faster with a BLOCK distribution, because this allows the processor calculating any given array element to have all of the necessary data in its own memory in most cases. The Digital Fortran 90 compiler includes an optimization which minimizes communication even along the edges of blocks in nearest-neighbor calculations. For an example of a nearest neighbor calculation, see Chapter 3.

When the calculation requires information from distant elements in the array, a CYCLIC distribution is frequently faster because it improves load-balancing, dividing the work more evenly among processors. See Section 2.3.3.

Some algorithms are column-oriented or row-oriented in the data they require for each calculation. These algorithms frequently benefit from a distribution with an asterisk ( * ) in one of the dimensions.

The distribution figures in this chapter can be used as a guide to the basic distribution choices for two-dimensional arrays; more complex distributions such as CYCLIC(n) (CYCLIC with an intr-expr) are also possible. See the High Performance Fortran Language Specification.

It is worthwhile to make an initial guess at a distribution, and then try a few alternatives to see which performs best. One of the advantages of HPF is that changing distributions is an easy change to the coding of the program. Precise analysis of program performance can be obtained with the pprof profiler.

For More Information:

6.3.7 Shadow Edges for Nearest-Neighbor Algorithms

The Digital Fortran 90 compiler performs an optimization of nearest-neighbor algorithms to reduce communications. Shadow edges are allocated to hold copies of array elements that are near neighbors to each processor's edge elements.

The DISTRIBUTE directive can be used to manually set shadow-edge widths for each array dimension. When you do not set a shadow-edge width for a given dimension, the compiler will size the shadow edge automatically, based upon your algorithm. For example:

!HPF$ DISTRIBUTE A(BLOCK(SHADOW=3), BLOCK, BLOCK(SHADOW=0))

In this example, shadow edges 3 array elements wide will be allocated for the first dimension of array A. The compiler will automatically size the shadow edges for the second dimension, because no width is specified. No shadow storage will be allocated for the third dimension, because a shadow-edge width of zero (0) is specified.

When you do not specify any shadow-width edges, the compiler automatically sizes the shadow edge for all dimensions. You will usually obtain the full performance benefit of the nearest neighbor optimization by relying upon the compiler's automatic shadow-edge sizing.

The primary use of the SHADOW keyword is to preventing copy in /copy out when arrays used in nearest-neighbor computations are passed through the procedure interface. If you want to conserve memory by limiting the sizes of shadow-edge widths, it is usually preferable to use the -nearest_neighbor compile-time command-line option.

However, there are some situations where shadow-edge widths should be set manually:

In these cases, setting shadow-edge widths manually leads to more efficient memory usage and prevents unnecessary local copying of data.

You can limit shadow-edge widths to a certain maximum value with the -nearest_neighbor command-line option.

The nearest neighbor optimization can be disabled with the -nonearest_neighbor command-line option. This option has the same effect as setting all shadow-edge widths to zero (0).


Note
In future releases, Digital intends to implement a SHADOW directive for manually setting shadow-edge widths, instead of using the SHADOW keyword inside the DISTRIBUTE directive. The new syntax will conform with Version 2.0 of the High Performance Fortran Language Specification.

For More Information: