A standard algorithm for LU decomposition, described in Numerical Recipes,[1] transforms a square matrix "in place", storing the values for all the elements of L and U in the same space in memory where the original square matrix was stored. This can be done by overlapping the two arrays so that the mandatory zeros on the opposite sides of both L and U, and the ones on the diagonal of L, are not explicitly represented in memory. The algorithm transforms the array A in the previous example into the following array:
The lower triangle of this array contains all the significant elements of L, and its upper triangle contains all the significant elements of U.
The algorithm for accomplishing this transformation is constructed of three controlling structures:
For the sake of simplicity, the issue of pivoting is ignored here, although the algorithm can be unstable without it.
[1] William H. Press [et al.], Numerical recipes in FORTRAN : the art of scientific computing. 2nd ed. Cambridge University Press, 1992.
In Fortran 77 syntax, the algorithm (without pivoting), is coded as follows:
DO k = 1, n-1
DO x = k+1, n ! Column
A(x, k) = A(x, k) / A(k, k) ! Normalization
END DO !
DO i = k+1, n !
DO j = k+1, n ! Submatrix
A(i, j) = A(i, j) - A(i, k)*A(k, j) ! Modification
END DO !
END DO !
END DO
Like all Fortran 77 code, this code is compiled and executed correctly by Digital Fortran 90. However, the compiler does not recognize it as parallelizable, and compiles it to run serially, with no parallel speed-up.
In order to achieve parallel speed-up, eligible DO loops should be changed to one of these:
Some caution is required, because these three parallel constructs are not the same as a non-parallel DO loop. In many cases, simply re-writing a DO loop into one of these three forms can result in different answers, or even be illegal. In other cases, the three forms are equivalent (for a comparison between the three, see Section 2.2.3).
In our example, the column normalization DO loop can be expressed any of these three ways:
INDEPENDENT DO loop:
!HPF$ INDEPENDENT
DO x = k+1, n
A(x, k) = A(x, k) / A(k, k)
END DO
Fortran 90 array assignment statement:
A(k+1:n, k) = A(k+1:n, k) / A(k, k)
FORALL statement:
FORALL (i=k+1:n) A(i, k) = A(i, k) / A(k, k)
The submatrix modification DO loop is too complex to be expressed by a single array assignment statement. It can, however, be marked with the INDEPENDENT directive, or changed into a FORALL.
INDEPENDENT version:
!HPF$ INDEPENDENT, NEW(j)
DO i = k+1, n
!HPF$ INDEPENDENT
DO j = k+1, n
A(i, j) = A(i, j) - A(i, k)*A(k, j)
END DO
END DO
The NEW(j) keyword tells the compiler that in each iteration, the inner DO loop variable j is unrelated to the j from the previous iteration. Digital's compiler currently requires the NEW keyword in order to parallelize nested INDEPENDENT DO loops.
FORALL version:
FORALL (i=k+1:n, j=k+1:n)
A(i, j) = A(i, j) - A(i, k)*A(k, j)
END FORALL
Putting the two together, here are two versions of the complete parallelized algorithm:
Fortran 90/95 syntax version:
DO k = 1, n-1
A(k+1:n, k) = A(k+1:n, k) / A(k, k) ! Column Normalization
FORALL (i=k+1:n, j=k+1:n) ! Sub-
A(i, j) = A(i, j) - A(i, k)*A(k, j) ! matrix
END FORALL ! Modification
END DO
DO INDEPENDENT version:
DO k = 1, n-1
!HPF$ INDEPENDENT
DO x = k+1, n ! Column
A(x, k) = A(x, k) / A(k, k) ! Normalization
END DO !
!HPF$ INDEPENDENT, NEW(j)
DO i = k+1, n !
!HPF$ INDEPENDENT !
DO j = k+1, n ! Submatrix
A(i, j) = A(i, j) - A(i, k)*A(k, j) ! Modification
END DO !
END DO !
END DO
Although Fortran 90/95 array syntax or FORALLs can serve the same purpose as DO loops did in Fortran 77, FORALLs and array assignments are parallel assignment statements, not loops, and in many cases produce a result different from analogous DO loops.
It is crucial to understand the semantic difference between DO and parallel assignment statements such as FORALL or Fortran 90 array assignment. Statements inside DO loops are executed immediately with each iteration. If a DO loop contains an assignment, an assignment will occur with each iteration.
In contrast, a FORALL specifies that the right-hand side of an assignment is computed for every iteration before any stores are done.
For example, consider the following array C:
Applying the FORALL statement
FORALL (i = 2:5) C(i, i) = C(i-1, i-1)
to this array produces the following result:
However, the following apparently similar DO loop
DO i = 2, 5 C(i, i) = C(i-1, i-1) END DO
produces a completely different result:
Because a DO loop assigns new values to array elements with each iteration of the loop, you must take into account that later iterations of the loop are operating on an array that has already been partially modified. In the above example, by the time the DO loop is ready to assign a value to element C(3,3) , element C(2,2) has already been changed from its original value. In the FORALL structure, on the other hand, no assignments are made until the right side of the assignment statement has been computed for every case.
Some operations require the use of DO loops rather than FORALL structures. For example, in the previous LU decomposition code, the outer DO loop that moves down the diagonal is a sequential operation in which a FORALL structure cannot be used. Later iterations of the loop rely upon the fact that the array has already been partially modified.
Some DO loops are eligible to be tagged with the INDEPENDENT directive, which allows for parallel execution. A loop can be tagged INDEPENDENT if the iterations can be performed in any order (forwards, backwards, or even random) and still produce the same result. For example:
!HPF$ INDEPENDENT
DO I=1, 100
A(I) = B(I)
END DO
Each of the three parallel structures (Fortran 90 array syntax, FORALL, and INDEPENDENT DO loops) has advantages and disadvantages:
DO i=1, 100
A(i) = A(i+1) + A(i)
END DO
Expressed as a DO loop, this computation is not INDEPENDENT and cannot be parallelized, because the result will vary if the iterations are not performed in sequential order.
However, the same computation can be performed in parallel with this FORALL assignment:
FORALL (i=1:100)
A(i) = A(i+1) + A(i)
END DO
FORALL guarantees that the computation will be done as if the right-hand side were computed for all 100 iterations before any stores are done, which in this particular case yields the same answers as if a DO loop were used.
The limitations of FORALL are that it can contain only assignment statements, and can contain function calls only if the function is PURE.
The disadvantage of INDEPENDENT DO is that some cases (such as the example in the previous bullet) can be expressed as a FORALL, but not as an INDEPENDENT DO. Also, in some cases using FORALL results in better optimization than INDEPENDENT DO.
[1] In the current version of Digital Fortran 90, there are a number of restrictions on function calls inside INDEPENDENT DO loops. Please consult the Release Notes.