10.3 HPF Profiling Issues

This section explains issues that may occur while profiling HPF programs.

10.3.1 Array Profiles

Array operations in HPF programs may be computed in parallel processes. Information about these arrays is profiled by interval profiling in two primary ways:

  1. Array assignment statements involving whole arrays or array sections. For example:
    INTEGER Z(M), A(MB), V(MB), B(MB)
    

    The following array section assignments are profiled:

    Z = 2
    A = B
    Z(1:M) = A(3:M+2)
    a(1:3) = b(v(2:4))
    

    The following array element assignments are not profiled:

    a(1) = b(2)
    a(v(2)) = b(3)
    
  2. Communication cost associated with an array reference involving data distribution can be profiled. For example:
    PARAMETER (N=28)
    PARAMETER (M=180)
    PARAMETER (MA=280)
    PARAMETER (MB=208)
    
    INTEGER X(N),Y(N),Z(M),A(MB),SERIAL(MB)
    !HPF$ TEMPLATE T(M)
    !HPF$ TEMPLATE T1(MA)
    !HPF$ DISTRIBUTE T(CYCLIC)
    !HPF$ DISTRIBUTE T1(BLOCK)
    !HPF$ ALIGN Z(i) WITH T(i)
    !HPF$ ALIGN A(i) WITH T1(i+10)
    !HPF$ ALIGN X(i) WITH T(i)
    !HPF$ ALIGN Y(i) WITH T(i)
    

    Communication are profiled in the following array referencing operations:

    Z(1:M) = A(3:M+2)  - arrays Z and A are not aligned
    
    SERIAL = A         - A is distributed, SERIAL is not
    

    Communications are not profiled in the following array reference operation because only local data is copied:

    X(1:N) = Y(1:N)   - arrays X and Y are aligned
    

    Communications are not profiled in the following example because the array SERIAL is not distributed:

    SERIAL(2) = -1   - array serial is not distributed
    

10.3.2 Reporting Array Operations

Array operations are reported in a format different from the user's original source code. In addition to the identifiers matching the user specified arrays, compiler generated temporaries are reported using the notation @n, where n is a compiler generated integer. Internal operators are reported using the notation vXXXX, where v indicates a vector operation and XXXX is the type such as PLUS (vector add). An operator without a leading v is a scalar operator.

Some of the internal operators for vector operations are:

The same operators without the v prefix are in scalar expression contexts:

10.3.2.1 Examples of Transformed Distributed Array Names

The following code fragment is an example of transformed distributed array names:

!! DECLARATIONS FOR ARRAY ASSIGNMENT STATEMENT SAMPLES

      PARAMETER (N=15)
!HPF$ TEMPLATE T(N)
!HPF$ DISTRIBUTE T(BLOCK)
!HPF$ TEMPLATE T2(N,N)
!HPF$ DISTRIBUTE T2(BLOCK,BLOCK)

INTEGER A(N), B(N), C(N,N), V(N), SERIAL(N)
!HPF$ ALIGN A(i) WITH T(i)
!HPF$ ALIGN B(i) WITH T(i)
!HPF$ ALIGN V(i) WITH T(i)
!HPF$ ALIGN C(i,j) WITH T2(i,j)
      LOGICAL FAIL
Transformed Sending Variables    Transformed Receiving Variables  Original Source Statement using Distributed Arrays 
B(vPLUS((1:14),1))   =>   A((1:14))   a(1:n-1) = b(2:n)  
B(2)   =>   A(1)   a(1) = b(2)  
B((1:15))   =>   SERIAL((1:15))   serial = b  
B((1:3))   =>   A(V((1:3)))   a(v(1:3)) = b(1:3)  
B(vPLUS((1:3),1))   =>   @2((1:3))   a(v(1:3)) = b(2:4)  
@2((1:3))   =>   A(V((1:3)))    
B(2)   =>   A(V(2))   a(v(2)) = b(2)  
B(3)   =>   @3   a(v(2)) = b(3)  
@3   =>   A(V(2))    
B(V((1:15)))   =>   A((1:15))   a(1:n) = b(v(1:n))  
V(vPLUS((1:3),1))   =>   @4((1:3))   a(1:3) = b(v(2:4))  
B(@4((1:3)))   =>   A((1:3))    
B(V(2))   =>   A(2)   a(2) = b(v(2))  
V(3)   =>   @5   a(1) = b(v(3))  
B(@5)   =>   A(1)    

10.3.3 Profiling Extrinsic and Library Routines Using Interval Profiling

To profile a non-F90 routine, the routine must be called from an F90 subroutine. This, in turn, is called elsewhere in the application. This "wrapper", which just calls the non-F90 routine, is required to start and finish with interval profiling.

10.3.4 DO Loop Profiles

Information about DO loops in an HPF program can also be gathered. The total number of invocations, cumulative number of iterations, and the cumulative time spent associated with each DO statement execution are recorded. Interval profiling understands many different forms of the DO construct. For example:

10.3.5 Branching and Premature Exit in Interval Profiling

Recording an interval profiling event is handled by a pair of profiling routine calls. A start profiling routine before and a finish profiling routine after the location of the event of interest are inserted (instrumented) by the HPF compiler. In order to complete the recording for an event, both the start and finish profiling routines must be called successfully and in the proper sequence. However, there are many different programming styles of exiting and branching in the program that can disrupt the profiling sequence. Some cases are:

If a program stops prematurely (such as by a STOP statement) or branches out in the middle of an event recording, without a closing finish profiling call for that recording event, the missing finish call can disrupt the interval profiling call stacks and may result in timing and call trace errors. The current version of interval profiling provides a mechanism to handle the closing of finish calls automatically for all cases in the previous list, with the exception of the last two.

10.3.6 Alternate Routine Entries

For pc sampling, statistics from the alternate entries are only accounted to their parent entries. Only parent entries are reported.

10.3.7 Local Routines in CONTAINS Statements

In Digital Fortran 90, you can have the same name for local routines in different CONTAINS statements. Interval profiling recognizes local routines with the same name as different routines using their unique traced patterns and locations (line numbers) in the code.

10.3.8 Using pprof to Interpret Program Performance

pprof is designed to help you find the causes of performance problems by detecting:

Use the suggestions shown in Table 10-7 as guidelines for choosing and using the various report options to collect data and generate reports to analyze the performance problems at both system and program levels.

Table 10-7 Profiling Strategies

Objective  Use 
Look at an entire nonshared program  PC sampling profiling (-method sampling) 
Focus only on the user code in the program and finer granularity in the user source code  interval profiling (-method interval) 
View the big picture  Averaging with minimum and skew analysis (-statistics average) 
Produce a detailed analysis of it in a particular peer  Per peer analysis (- statistics per_peer) 
Find any unbalanced system loads in the PSE cluster that may have caused the performance problems in your programs execution  The -process_ summary 
Produce a communication profile  -comm_peers option 
Find the locations in user code where communication involves the use of HPF distribution directives  -comm_statements option 
Measure and analyze the time the program executing particular event types  -interval_ summary and -event_type together 

Note
If you compile or analyze a program several times, assign unique names to the output files so that you can distinguish them from one another.

10.3.8.1 A Recommended Step by Step Road Map

The following is a recommended road map for using the profiler.

  1. Start at a course level of detail and work down toward the finer detail. Program -> routine -> statement -> communications and array distribution.
  2. Choose a quiet partition to run your profiled application, avoiding perturbation by other jobs. The machines should be very similar in terms of CPU type, clock speed, memory, and network interface type to avoid skew between peer processes.
  3. Look at the big picture for potential performance problems by using pc sampling to examine the entire application. Use the -routines and -statistics average options with pprof. (pprof default report)

    The report identifies which routines account for most of execution time. If most of the execution time is in user written code, continue with the next bullet item. Otherwise seek ways to improve the libraries and such that are being used. Use pc samplings -statements option which details the relative statement execution occurring at the code statement level.

  4. Use interval profiling in order to collect more detailed data on the execution of the user written HPF code.
  5. Use pprof with the -routines and -comm_statements options to find out which routines are consuming the most time, and where operations on distributed arrays impact the communication activity in your program.

    If communication overhead is an insignificant factor, performance bottlenecks may exist in the use of FORALL, DO loops and local array operations in the program. You can examine time spent in specific program operations by using the -interval_summary or the -call_trace option.

Other options to try: