This section explains how I/O is implemented in High Performance Fortran, and gives some guidelines about how to optimize the use of I/O.
In the current implementation of Digital Fortran 90, all I/O operations are serialized through a single processor. The Farm consists of members from which are chosen a set of peer processors on which the parallel application execute. These are formally numbered starting from 0 as Peer 0, Peer 1, and so on (see Section 8.3.1.4). In the current implementation, all I/O operations are serialized through Peer 0. This means that all data in an output statement which is not available on Peer 0 is copied to a buffer on Peer 0. Consider the following example:
INTEGER, DIMENSION(10000000) :: A
!HPF$ DISTRIBUTE A(BLOCK)
PRINT *, A
Since A is distributed BLOCK, not all of the values exist on Peer 0. Therefore the entire array A is copied from the various peers into a rather large buffer on Peer 0. The print operation is executed by Peer 0 only.
Input behaves in a similar manner: data being read is copied from the file to a buffer on Peer 0 and distributed according to the data mapping of its destination. This makes I/O slow.
The following list contains some guidelines to minimize the performance degradation caused by I/O. If your program reads or writes large volumes of data, read this carefully. When choosing among these guidelines, remember that generally speaking, computation is faster than communication, which is faster than I/O.
To specify a specific processor as Peer 0, use the -
on command-line switch when executing the program (see Section 8.4.3). Ideally, you want
Peer 0 to be a processor with a lot of real memory and attached to
the file system to which you are reading or writing. For example, to
make a processor named FRED Peer 0:
% a.out -peers 4 -on FRED,...
This specifies that the first processor (peer 0) should be FRED. The three periods (...) specify that PSE should automatically select the other peers, based on load-balancing considerations. No spaces are allowed before the three periods or in between them.
To avoid running out of memory and/or swamping the network, the best way to print a huge array is to print it in sections. For example, the array A can be printed in 1000 element sections. This causes the compiler to generate a buffer on Peer 0 sufficient for only 1000 elements, instead of an array sufficient for 10000000 elements.
DO i=1,10000,1000 PRINT *, a(i:i+1000-1) ENDDO
It is possible that there may not be enough stack space available for the 1000000 element buffer needed to print all of A at once. In this case, a segmentation fault may occur, or the program may behave differently each time it is executed, depending upon how much stack space is available on the node that is peer 0 for that run. If segmentation faults occur at execution time, try increasing the stack space (see Section 7.12). Use the following csh commands to increase the sizes of the stack and data space (other shells require different commands):
% limit stacksize unlimited % limit datasize unlimited
This increases the size available for the buffer to the maximum allowed by the current kernel configuration (see your system administrator to raise these limits further). A much more efficient solution, however, is to print the array in sections, instead of all at once.
Another way to speed up I/O is to read/write from data which is distributed in such a way that it is stored on peer 0 only. Since I/O happens only on peer 0, no buffers are needed.
The following code:
INTEGER JSEQ REAL SEQ_ARRAY(N) DO I = 1, N READ(*) JSEQ SEQ_ARRAY(I) = REAL(JSEQ) - 0.5 ENDDO
can be re-written as:
INTEGER JSEQ
REAL SEQ_ARRAY(N)
REAL TMP(N)
!HPF$ TEMPLATE T(N)
!HPF$ DISTRIBUTE T(BLOCK)
!HPF$ ALIGN TMP WITH T(1)
DO I = 1, N
READ(*) TMP(I)
ENDDO
SEQ_ARRAY = REAL(TMP) - 0.5
The first program reads into JSEQ. Because the I/O only occurs on peer 0, the value on peer 0 must subsequently be sent to all the other processors. In this program the read happens repeatedly, because it is located in a loop. This causes a dramatic reduction in performance.
In the second example a temporary array was created, TMP, that is distributed in such a way that it is stored only on peer 0. There is no need to send the values on peer 0 to all the other processors, because TMP, which was read into directly, resides only on peer 0. Later, outside the loop, a single array assignment of the entire TMP array into SEQ_ARRAY is done. There is motion involved because TMP is not distributed the same as SEQ_ARRAY; however, I/O is in general so much slower than anything else that simplifying the I/O should be the priority.
The template T was required in order to arrange for TMP to reside only on peer 0. The TEMPLATE directive creates a template T, which is distributed BLOCK.
!HPF$ TEMPLATE t(n) !HPF$ DISTRIBUTE t(BLOCK)
The entire array TMP is aligned with only the first element of T. Since T is distributed BLOCK, this first element resides on peer 0. The effect is that all of array TMP is serial on peer 0:
!HPF$ ALIGN TMP with T(1)