This section explains issues that may occur while profiling HPF programs.
Array operations in HPF programs may be computed in parallel processes. Information about these arrays is profiled by interval profiling in two primary ways:
INTEGER Z(M), A(MB), V(MB), B(MB)
The following array section assignments are profiled:
Z = 2 A = B Z(1:M) = A(3:M+2) a(1:3) = b(v(2:4))
The following array element assignments are not profiled:
a(1) = b(2) a(v(2)) = b(3)
PARAMETER (N=28) PARAMETER (M=180) PARAMETER (MA=280) PARAMETER (MB=208) INTEGER X(N),Y(N),Z(M),A(MB),SERIAL(MB) !HPF$ TEMPLATE T(M) !HPF$ TEMPLATE T1(MA) !HPF$ DISTRIBUTE T(CYCLIC) !HPF$ DISTRIBUTE T1(BLOCK) !HPF$ ALIGN Z(i) WITH T(i) !HPF$ ALIGN A(i) WITH T1(i+10) !HPF$ ALIGN X(i) WITH T(i) !HPF$ ALIGN Y(i) WITH T(i)
Communication are profiled in the following array referencing operations:
Z(1:M) = A(3:M+2) - arrays Z and A are not aligned SERIAL = A - A is distributed, SERIAL is not
Communications are not profiled in the following array reference operation because only local data is copied:
X(1:N) = Y(1:N) - arrays X and Y are aligned
Communications are not profiled in the following example because the array SERIAL is not distributed:
SERIAL(2) = -1 - array serial is not distributed
Array operations are reported in a format different from the user's original source code. In addition to the identifiers matching the user specified arrays, compiler generated temporaries are reported using the notation @n, where n is a compiler generated integer. Internal operators are reported using the notation vXXXX, where v indicates a vector operation and XXXX is the type such as PLUS (vector add). An operator without a leading v is a scalar operator.
Some of the internal operators for vector operations are:
The same operators without the v prefix are in scalar expression contexts:
The following code fragment is an example of transformed distributed array names:
!! DECLARATIONS FOR ARRAY ASSIGNMENT STATEMENT SAMPLES
PARAMETER (N=15)
!HPF$ TEMPLATE T(N)
!HPF$ DISTRIBUTE T(BLOCK)
!HPF$ TEMPLATE T2(N,N)
!HPF$ DISTRIBUTE T2(BLOCK,BLOCK)
INTEGER A(N), B(N), C(N,N), V(N), SERIAL(N)
!HPF$ ALIGN A(i) WITH T(i)
!HPF$ ALIGN B(i) WITH T(i)
!HPF$ ALIGN V(i) WITH T(i)
!HPF$ ALIGN C(i,j) WITH T2(i,j)
LOGICAL FAIL
| Transformed Sending Variables | Transformed Receiving Variables | Original Source Statement using Distributed Arrays | |
|---|---|---|---|
| B(vPLUS((1:14),1)) | => | A((1:14)) | a(1:n-1) = b(2:n) |
| B(2) | => | A(1) | a(1) = b(2) |
| B((1:15)) | => | SERIAL((1:15)) | serial = b |
| B((1:3)) | => | A(V((1:3))) | a(v(1:3)) = b(1:3) |
| B(vPLUS((1:3),1)) | => | @2((1:3)) | a(v(1:3)) = b(2:4) |
| @2((1:3)) | => | A(V((1:3))) | |
| B(2) | => | A(V(2)) | a(v(2)) = b(2) |
| B(3) | => | @3 | a(v(2)) = b(3) |
| @3 | => | A(V(2)) | |
| B(V((1:15))) | => | A((1:15)) | a(1:n) = b(v(1:n)) |
| V(vPLUS((1:3),1)) | => | @4((1:3)) | a(1:3) = b(v(2:4)) |
| B(@4((1:3))) | => | A((1:3)) | |
| B(V(2)) | => | A(2) | a(2) = b(v(2)) |
| V(3) | => | @5 | a(1) = b(v(3)) |
| B(@5) | => | A(1) |
To profile a non-F90 routine, the routine must be called from an F90 subroutine. This, in turn, is called elsewhere in the application. This "wrapper", which just calls the non-F90 routine, is required to start and finish with interval profiling.
Information about DO loops in an HPF program can also be gathered. The total number of invocations, cumulative number of iterations, and the cumulative time spent associated with each DO statement execution are recorded. Interval profiling understands many different forms of the DO construct. For example:
ii = 0
do 80 while(ii < 5)
. . .
ii = ii+1
80 continue
ii = 0 do while(ii < 5) . . . ii = ii+1 end do
ii = 0
do 80 ii = 1,5
do 80 jj = 1,10,2
. . .
80 print *, "ii, jj:", ii, jj
do 90 ii = 1,10
if(ii > 5) cycle
do 80 jj = 2,5
. . .
80 if(jj > 4) exit
. . .
90 continue
ii = 0
test: do ii = 1,8
. . .
if(ii .gt. xx) cycle test
. . .
end do test
Recording an interval profiling event is handled by a pair of profiling routine calls. A start profiling routine before and a finish profiling routine after the location of the event of interest are inserted (instrumented) by the HPF compiler. In order to complete the recording for an event, both the start and finish profiling routines must be called successfully and in the proper sequence. However, there are many different programming styles of exiting and branching in the program that can disrupt the profiling sequence. Some cases are:
If a program stops prematurely (such as by a STOP statement) or branches out in the middle of an event recording, without a closing finish profiling call for that recording event, the missing finish call can disrupt the interval profiling call stacks and may result in timing and call trace errors. The current version of interval profiling provides a mechanism to handle the closing of finish calls automatically for all cases in the previous list, with the exception of the last two.
For pc sampling, statistics from the alternate entries are only accounted to their parent entries. Only parent entries are reported.
In Digital Fortran 90, you can have the same name for local routines in different CONTAINS statements. Interval profiling recognizes local routines with the same name as different routines using their unique traced patterns and locations (line numbers) in the code.
pprof is designed to help you find the causes of performance problems by detecting:
Use the suggestions shown in Table 10-7 as guidelines for choosing and using the various report options to collect data and generate reports to analyze the performance problems at both system and program levels.
| Objective | Use |
|---|---|
| Look at an entire nonshared program | PC sampling profiling (-method sampling) |
| Focus only on the user code in the program and finer granularity in the user source code | interval profiling (-method interval) |
| View the big picture | Averaging with minimum and skew analysis (-statistics average) |
| Produce a detailed analysis of it in a particular peer | Per peer analysis (- statistics per_peer) |
| Find any unbalanced system loads in the PSE cluster that may have caused the performance problems in your programs execution | The -process_ summary |
| Produce a communication profile | -comm_peers option |
| Find the locations in user code where communication involves the use of HPF distribution directives | -comm_statements option |
| Measure and analyze the time the program executing particular event types | -interval_ summary and -event_type together |
The following is a recommended road map for using the profiler.
The report identifies which routines account for most of execution time. If most of the execution time is in user written code, continue with the next bullet item. Otherwise seek ways to improve the libraries and such that are being used. Use pc samplings -statements option which details the relative statement execution occurring at the code statement level.
If communication overhead is an insignificant factor, performance bottlenecks may exist in the use of FORALL, DO loops and local array operations in the program. You can examine time spent in specific program operations by using the -interval_summary or the -call_trace option.
Other options to try:
The resulting process summary report compares the minimum and maximum values related to:
It gives you clues on how well balanced the system load is among peers. To achieve better system balance and performance in subsequent program runs, use various run time switches (such as -on, -partition and so forth) to select a better set of machines to match the application.
The report details the communications activities within the program and reports data such as time spent communicating, idle, message counts and size. This report can be used to compare differences in communications at a high level. Use it to compare the differences in communications when the application is built using different distribution directives.