Load Sharing Facility (LSF)
Questions and Answers

HPC is an abbreviation of High Performance Computing. In this document, HPC refers to a set of Digital Alpha systems dedicated to CPU intensive computing (named Alcor, Alioth, Alphaid, and Mizar). Tempus is our AlphaServer4100; it has 4 CPUs and 4 gigabytes of physical memory. Tempus has been made available to researchers outside Memorial University as well as those within MUN. Kronos is a front-end system used by these "visitors" from outside MUN to submit jobs to Tempus.

Questions answered in this document:

Special Topics for HPC and Tempus Users:

What is LSF?

LSF is the Load Sharing Facility, a suite of programs developed by Platform Computing to manage batch compute jobs on clusters of computer systems. Users of C&C's Unix systems use LSF to submit compute jobs that require significant CPU time, memory, or disk space. Several server processes running on each system co-ordinate to distribute the load across the cluster. User jobs are submitted using the LSF queuing software and LSF determines where jobs will be run. LSF jobs are simply commands. (This is different from DQS where jobs were required to be Unix shell scripts.)

How do I submit a job to LSF?

Use the bsub command to submit jobs to LSF.
plato> bsub -q queue-name command
You may also choose to run your job on a specific machine(s).
plato> bsub -q queue -m machine-name command
plato> bsub -q queue -m "machine-name1 machine-name2" command
Note that you must bind multiple machine names together using quotes as shown above.

It's also possible to specify commands as the input to bsub. Commands are terminated by type the End-of-File character CTRL-D.

plato> bsub
bsub> command
bsub> [CTRL-D]
The bsub command can be interrupted by typing CTRL-C. More typically, you'll put the commands in file (often called a "script"). You can also specify job options--like queue selection--in the script file.
plato> bsub < script-file

We have some examples of bsub on-line: submitting SPSS jobs, compiling and submitting Fortran programs, and using Job Script files to simplify the command line.

Where will I find the errors and output from my job?

By default, output from your job is sent back to you via e-mail. You can request that the output be redirected into files using the `-o' and `-e' options to bsub. Of course, if your problem explicitly writes to specific files, rather than using the standard input and output streams, you may find little output in the logfiles or your mail. If you are having trouble with your LSF jobs, remember to check your e-mail for error messages.
Morgan Users
Error messages are addressed to <user@morgan.ucs.mun.ca>.
Kronos Users
Error messages are addressed to <user@kronos.ucs.mun.ca>.
Remember that you may have turned on mail forwarding. Kronos users should note that Kronos now uses the forward command.

How do I select a queue?

General Users
If you're not an HPC user or a Tempus user, then your jobs should be sent to one of fortran, short, or night. The fortran queue is used for very short compile and run jobs. It is used primarily by students in the introductory Fortran course (CS2602); it has a CPU limit of one minute. The short queue is more appropriate for jobs that require too much resource to be run interactively: large SPSS jobs, for example. The night queue is suitable for jobs that can be run overnight when the system is less loaded.

HPC Users
The queues on C&C's HPC cluster are named based on the CPU time limits given to jobs that execute from the particular queue; the queue names reflect this time limit. You should determine the maximum time the job will need and submit to the shortest queue group that offers the time you need. The queue configuration remains a work in progress. Currently there are several queues, but the primary HPC queues are long and week. Standard memory and file resources for these queues will be limited. Extended resource queues have also been created; these queues will have larger memory quotas: long-big and week-big.

Tempus Users
Users of the AlphaServer 4100 Tempus must submit to the long-tempus, 3day-tempus, or week-tempus queues.
Use of the week and week-tempus is discouraged as it hampers us in scheduling system maintenance.

Use the command bqueues to list the available queues. You may not be able to submit to some queues and other queues won't be useful for your application due to CPU limits (or memory limits).

plato> bqueues
QUEUE_NAME      PRIO NICE     STATUS      MAX  JL/U JL/P NJOBS  PEND  RUN  SUSP
priority-tempus 100    0   Open:Active      3    3    3     0     0     0     0
priority         43   10   Open:Active      -    -    2     0     0     0     0
fortran          35    8   Open:Active      -    -    2     0     0     0     0
morgan           35    8   Open:Active      -    -    2     0     0     0     0
night            35    8   Open:Active      -    -    2     0     0     0     0
wide-tempus      30    0   Open:Active      6    3    3     0     0     0     0
test             20    8   Open:Active      -    -    -     0     0     0     0
long             20    0   Open:Active      6    1    3     3     1     2     0
long-big         20    0   Open:Active      6    1    3     2     1     1     0
long-tempus      20    0   Open:Active      6    1    3     3     2     1     0
3day-tempus      15    0   Open:Active      3    1    3     7     4     3     0
week             10    0   Open:Active      6    1    3     1     0     1     0
week-big         10    0   Open:Active      6    1    3     1     0     1     0
week-tempus      10    0   Open:Active      3    1    3     5     3     2     0
For more detailed information, use the -l option. This command can produce a lot of output, so you may want to select specific queues.

There's more information here than most users will ever need to understand--let alone know :-). There are two classes of information that are important to all users: process limits and access restrictions. The primary process limits are CPULIMIT and MEMLIMIT. The primary access restrictions are USERS and HOSTS.

CPULIMIT
The CPULIMIT is shown as number of minutes relative to the speed of some system in the cluster: a minute of CPU time on Tempus is more valuable than a minute on Plato or Alcor because the CPUs in Tempus are faster those in Plato and Alcor. LSF attempts to compensate and give equal value to every job; faster CPUs result in reduced CPU limits.
MEMLIMIT
The MEMLIMIT places a limit on how big a program can get: the intent is to prevent any given system from running out of swap space.
USERS
The USERS parameter determines who is permitted to submit jobs to this queue.
HOSTS
The HOSTS parameter defines the hosts that serve the queue.
If your needs exceed the current limitations, please contact C&C.
plato> bqueues -l long-big
QUEUE: long-big
  -- No description provided.

PARAMETERS/STATISTICS
 PRIO NICE     STATUS       MAX JL/U JL/P NJOBS  PEND  RUN  SSUSP USUSP
  20    0    Open:Active      6    1    3    2     1     1     0     0

 CPULIMIT                           PROCLIMIT    
   1440.0 min of alioth.ucs.mun.ca      1     

 DATALIMIT    MEMLIMIT
262144 K     262144 K

SCHEDULING PARAMETERS
		 r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched  8.0   8.0   4.0    -       -     -    -     -     -      -      -  
 loadStop  16.0  12.0   8.0    -       -     -    -     -     -      -      -  

USERS:  hpc/ 
HOSTS:  hpc/ 

How do I check on my jobs (and others)?

There are two commands that may help you in checking up on your jobs and the activity of the LSF system as a whole. The bjobs command will report on current, pending, or recently completed jobs. The bhosts command will report on the status of each system in the cluster.
bjobs
Run without arguments, bjobs will tell you the current status of your running and pending jobs.
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
18617 myname   RUN   priority   plato.ucs.m tempus.ucs. myjob1     Nov 19 18:51
18766 myname   PEND  priority   plato.ucs.m tempus.ucs. myjob2     Nov 19 21:48

bjobs -l
This will produce a long listing. It's most useful when you wish to determine why a pending job is not running. Reasons will vary: the systems may be busy, you may already be running your maximum allowable number of jobs, or a system with a needed resource may be unavailable. The long listing should tell you why your job is not running.

bjobs -a
This will show information on recently terminated jobs as well as those pending or running.

bjobs -u all
This will show you the jobs of all users.

bhosts
This will report the availability and the number of active jobs on each system. If one or more hosts is unavailable or closed, use bhosts -l to see why.

bhosts -l [hostname]
This will give a longer report on the specified system (or all systems if no hostname is specified). With `-l' option, bhosts will show why a host is closed.

How do I kill a running job?

There are several ways to terminate running jobs. The most common way is to use bdel. After using bjobs to check on your jobs (as noted above), you can delete the job from LSF queue with the command.
bdel <jobid>
Here the <jobid> would be the Job ID of the job you wish to terminate.
The bdel command will delete pending jobs as well as running jobs.

Special Topics for HPC and Tempus Users:

Where do I keep my data and programs?

When using the four older systems, keep your programs and data on the /h1 filesystem; you'll find you own a directory /h1/username. When using the 4100, Tempus, keep your programs and data in /h2/username. If you're an external user accessing our system from outside MUN, then you should keep data in /k1, the Kronos filesystem.

You should change your current working directory to your /h1 (/h2 or /k1 as appropriate) directory to do all your HPC work. In particular, your current working directory should be in /h1 (/h2 or /k1) when you submit a job.

plato> cd /h1/username
plato> bsub scriptfile
You may use a subdirectory of your /h1 directory without problem; the important thing is that your current working directory should be within the /h1 (/h2 or /k1) filesystem.

If you require additional space during computation, you can use the /scratch filesystem local to the execution host. Using the scratch disks will also help solve some performance problems (see the notes on NFS below). See the notes copying files on how to make use of the /scratch filesystems. Users of the four older systems must keep in mind that there are four distinct scratch disks. This is particularly important if one wants use the output of one job as input for a succeeding job. If a job running on Alphaid puts data in /scratch, it won't be visible to a job that runs on Alcor. Thus you should copy the necessary output files back to /h1 after your model completes.

What is NFS and how does it affect my jobs?

Tempus Users
If your jobs run only on Tempus, NFS will not affect them. Because of the problems associated with NFS--described below--Tempus has been configured so that it does not import any files from other hosts. Thus, for your active jobs, all I/O will be local to the system.
NFS stands for Network File System. Mizar, Alphaid, and Alioth use NFS to access the file systems on Alcor which is the host of the /h1 filesystem. If your files are stored on Alcor and your job is executed on Mizar, Mizar must access your files across the network. This means that any Input/Output (including loading your program) will be affected by the speed of the network. How much this affects your job depends on how much I/O you do. If you have a large program or you read or write a lot of data, then your job may be affected by network lag. If the size of compiled executable is larger than 10 or 15 Megabytes, there is a potential for one system to be reading and re-reading that file across the network many times (see the Q&A on swapping). If you read or write more than 10 Megabytes every 2 to 3 minutes, you may also find that network lag impacts your jobs.

If you're not using a lot of I/O, you needn't worry about NFS delays. If you are using I/O intensive processing (e.g. image processing), you may want to specify that jobs run on the same system that holds your files. You may instead decide to copy the input files at the beginning of a job and output files at the end of job so that both input and output are on the system that will execute the job (see below).

How do I copy files from my source filesystem to a disk local to the system that executes the job? (or ``How can I use the /scratch filesystems?'')

Read the Q&A on NFS to understand why this question is important. This is not an issue for most Morgan users: Morgan user jobs will normally run on Plato, so the home directory should be local and Plato jobs won't be affected by NFS; only HPC and Tempus users actually have access to the /scratch filesystems. If you're working on Tempus, you won't likely need to use /scratch, though it does provide additional temporary disk space.

The systems in C&C's HPC cluster are configured so that /scratch always references a local file system. Thus data place on /scratch by a job running on Alcor will not be visible to a job running on Alphaid. It's best to copy the necessary input files onto /scratch at the beginning of your job and copy the necessary output files out of /scratch at the end of your job. That way, all the essential files will be kept in your /h1 directory and will be accessible from all systems.

Example:
I have a program called model with four input files: domain, sample, snap.in, and param. It will create two output files run.log and snap.out. I want to copy the input files onto the scratch disk, run the model, and copy the output files back to my current directory. The following script will do the trick:
#BSUB -q long

dest=/scratch/$USER

# Copy input to destination scratch.
cp model param domain sample snap.in $dest

# The parentheses on the following line make the effects of
# "cd $dest" temporary.  The program "model" is coded to
# create "run.log" and "snap.out".
(cd $dest; ./model)

# Copy the outputs back to the current directory
cp $dest/run.log $dest/snap.out .

# This would be a good place to remove any unnecessary output files.
# (cd $dest; rm file1 file2 ...)

What is swapping and how does it affect my jobs?

UNIX systems are multi-processing systems: several processes share the system's CPU and memory. The system is designed so that program size is not limited by physical memory (i.e. the memory cards in the system box). The disk(s) may be used to store data that is not immediately required by a running process. This memory management allows a computer to run several large programs even if they cannot all fit in physical memory at the same time. For example, a system with 128 megabytes of physical memory could run 3 or 4 processes that each require 64 megabytes of memory; it's also possible to run a single job that requires more than physical memory. The disk space allocated for saving the data of running processes is called swap and the activity of copying data to and from swap space is called swapping. This memory management system is called virtual memory.

There are three ways that swapping will affect your processes. First, the amount of swap space is still limited to the space allocated on the swap disks. There may not be enough swap space available to service your process and your program will either fail to start or it may possibly be killed by the system later, if the system runs out of space. You can monitor the current swap space usage with the command swapon -s. Second, running processes claim physical memory by pushing other processes' data out of physical memory and onto the swap disk(s). If several large jobs start pushing each other out of memory, the system wastes a great deal of time doing disk copies and the work is slowed down for all processes; in the extreme, this is called thrashing. Third, swap space shares the disk(s) with the programs and data space. Any swapping activity can create contention for disk reads and writes of data or program.

NFS can also be a factor in swapping. When program code is marked for removal in memory, it isn't copied to the swap disk. Instead, the memory is simply cleared for use. When the process becomes active again and needs that program code, the system reads the code from the original file. If your file is being accessed via NFS, then network lag will be a factor in how quickly your program starts running again.

What is checkpointing and how can I use it?

LSF provides the ability to perform process checkpointing. Checkpointing involves periodically creating an image of the running process that can be used as an initial state for a subsequent job. This is useful if your job requires more CPU time than the available queues, or it will let you recover from a system crash. To avail of this facility, you must replace the standard unix linker ld with the LSF linker chpt_ld (or ckpt_ld_f for Fortran users). In the simplest case, this will require you to use the "-c" flag to th compiler to create the object file and issue a ckpt_ld_f command manually to link your object files. In the worst case, it will mean some minor adjustments to your Makefile. An example is given in the ckpt_ld man page and on the examples page.
Paul Fardy
Modified 2000-08-07.