Load Sharing Facility (LSF)
Questions and Answers
HPC is an abbreviation of High Performance Computing.
In this document, HPC refers to a set of Digital Alpha systems
dedicated to CPU intensive computing (named Alcor, Alioth, Alphaid,
and Mizar).
Tempus is our AlphaServer4100; it has 4 CPUs and 4 gigabytes
of physical memory. Tempus has been made available
to researchers outside Memorial University as well as those
within MUN.
Kronos is a front-end system used by
these "visitors" from outside MUN to submit jobs to Tempus.
Questions answered in this document:
Special Topics for HPC and Tempus Users:
LSF is the Load Sharing Facility, a suite of programs developed by
Platform Computing to manage
batch compute jobs on clusters of computer systems. Users of
C&C's Unix systems use LSF to submit compute jobs that require
significant CPU time, memory, or disk space. Several server
processes running on each system co-ordinate to distribute the
load across the cluster. User jobs are submitted using the LSF
queuing software and LSF determines where jobs will be run. LSF
jobs are simply commands. (This is different from DQS where jobs
were required to be Unix shell scripts.)
Use the bsub command to submit jobs to LSF.
plato> bsub -q queue-name command
You may also choose to run your job on a specific machine(s).
plato> bsub -q queue -m machine-name command
plato> bsub -q queue -m "machine-name1 machine-name2" command
Note that you must bind multiple machine names together using quotes as shown
above.
It's also possible to specify commands as the input to bsub.
Commands are terminated by type the End-of-File character CTRL-D.
plato> bsub
bsub> command
bsub> [CTRL-D]
The bsub command can be interrupted by typing CTRL-C.
More typically, you'll put the commands in file (often called a
"script"). You can also specify job options--like queue selection--in
the script file.
plato> bsub < script-file
We have some examples of bsub on-line:
submitting SPSS jobs, compiling and submitting
Fortran programs, and using Job Script files to simplify the command
line.
By default, output from your job is sent back to you via e-mail.
You can request that the output be redirected into files using
the `-o' and `-e' options
to bsub. Of course, if your problem explicitly
writes to specific files, rather than using the standard input and
output streams, you may find little output in the logfiles or your
mail.
If you are having trouble with your LSF jobs, remember to check your e-mail
for error messages.
- Morgan Users
-
Error messages are addressed to <user@morgan.ucs.mun.ca>.
- Kronos Users
-
Error messages are addressed to <user@kronos.ucs.mun.ca>.
Remember that you may have turned on mail
forwarding.
Kronos users should note that Kronos now uses the forward
command.
- General Users
-
If you're not an HPC user or a Tempus user, then your jobs should
be sent to one of fortran, short,
or night. The fortran queue
is used for very short compile and run jobs. It is used primarily
by students in the introductory Fortran course (CS2602); it has
a CPU limit of one minute. The short queue
is more appropriate for jobs that require too much resource to
be run interactively: large SPSS jobs, for example. The
night queue is suitable for jobs that can
be run overnight when the system is less loaded.
- HPC Users
-
The queues on C&C's HPC cluster are named based on the CPU time limits
given to jobs that execute from the particular queue;
the queue names reflect this time limit.
You should determine the maximum time the job
will need and submit to the shortest queue group that offers the time
you need.
The queue configuration remains a work in progress.
Currently there are several queues, but the primary HPC queues
are long and week.
Standard memory and file resources for these queues will be limited.
Extended resource queues have also been created; these queues will
have larger memory quotas: long-big and
week-big.
- Tempus Users
-
Users of the AlphaServer 4100 Tempus must submit
to the long-tempus, 3day-tempus, or
week-tempus queues.
Use of the week and
week-tempus is discouraged as it hampers us in
scheduling system maintenance.
Use the command bqueues to list the available queues.
You may not be able to submit to some queues and other queues won't
be useful for your application due to CPU limits (or memory limits).
plato> bqueues
QUEUE_NAME PRIO NICE STATUS MAX JL/U JL/P NJOBS PEND RUN SUSP
priority-tempus 100 0 Open:Active 3 3 3 0 0 0 0
priority 43 10 Open:Active - - 2 0 0 0 0
fortran 35 8 Open:Active - - 2 0 0 0 0
morgan 35 8 Open:Active - - 2 0 0 0 0
night 35 8 Open:Active - - 2 0 0 0 0
wide-tempus 30 0 Open:Active 6 3 3 0 0 0 0
test 20 8 Open:Active - - - 0 0 0 0
long 20 0 Open:Active 6 1 3 3 1 2 0
long-big 20 0 Open:Active 6 1 3 2 1 1 0
long-tempus 20 0 Open:Active 6 1 3 3 2 1 0
3day-tempus 15 0 Open:Active 3 1 3 7 4 3 0
week 10 0 Open:Active 6 1 3 1 0 1 0
week-big 10 0 Open:Active 6 1 3 1 0 1 0
week-tempus 10 0 Open:Active 3 1 3 5 3 2 0
For more detailed information, use the -l option.
This command can produce a lot of output, so you may want to select
specific queues.
There's more information here than most users will ever need to
understand--let alone know :-). There are two classes of information
that are important to all users: process limits and access restrictions.
The primary process limits are CPULIMIT and MEMLIMIT.
The primary access restrictions are USERS and HOSTS.
- CPULIMIT
-
The CPULIMIT is shown as number of minutes relative to the speed
of some system in the cluster: a minute of CPU time on Tempus is
more valuable than a minute on Plato or Alcor because the CPUs
in Tempus are faster those in Plato and Alcor.
LSF attempts to compensate and give equal value to every job;
faster CPUs result in reduced CPU limits.
- MEMLIMIT
-
The MEMLIMIT places a limit on how big a program can get: the
intent is to prevent any given system from running out of
swap space.
- USERS
- The USERS parameter determines who is permitted to submit
jobs to this queue.
- HOSTS
- The HOSTS parameter defines the hosts that serve the queue.
If your needs exceed the current limitations, please contact C&C.
plato> bqueues -l long-big
QUEUE: long-big
-- No description provided.
PARAMETERS/STATISTICS
PRIO NICE STATUS MAX JL/U JL/P NJOBS PEND RUN SSUSP USUSP
20 0 Open:Active 6 1 3 2 1 1 0 0
CPULIMIT PROCLIMIT
1440.0 min of alioth.ucs.mun.ca 1
DATALIMIT MEMLIMIT
262144 K 262144 K
SCHEDULING PARAMETERS
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched 8.0 8.0 4.0 - - - - - - - -
loadStop 16.0 12.0 8.0 - - - - - - - -
USERS: hpc/
HOSTS: hpc/
How do I check on my jobs (and others)?
There are two commands that may help you in checking up on your jobs
and the activity of the LSF system as a whole. The bjobs
command will report on current, pending, or recently completed jobs.
The bhosts command will report on the status of
each system in the cluster.
- bjobs
- Run without arguments, bjobs will tell you the current
status of your running and pending jobs.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
18617 myname RUN priority plato.ucs.m tempus.ucs. myjob1 Nov 19 18:51
18766 myname PEND priority plato.ucs.m tempus.ucs. myjob2 Nov 19 21:48
- bjobs -l
- This will produce a long listing. It's most useful when you wish to
determine why a pending job is not running. Reasons will vary: the
systems may be busy, you may already be running your maximum allowable
number of jobs, or a system with a needed resource may be unavailable.
The long listing should tell you why your job is not running.
- bjobs -a
- This will show information on recently terminated jobs as well as
those pending or running.
- bjobs -u all
- This will show you the jobs of all users.
- bhosts
- This will report the availability and the number of active jobs
on each system. If one or more hosts is unavailable or closed,
use bhosts -l to see why.
- bhosts -l [hostname]
- This will give a longer report on the specified system (or
all systems if no hostname is specified). With `-l'
option, bhosts will show why a host is
closed.
There are several ways to terminate running jobs. The most common way
is to use bdel. After using bjobs
to check on your jobs (as noted above), you can delete the job from LSF
queue with the command.
- bdel <jobid>
- Here the <jobid> would be the Job ID
of the job you wish to terminate.
The bdel command will delete pending jobs as well as
running jobs.
Special Topics for HPC and Tempus Users:
When using the four older systems, keep your programs and data on
the /h1 filesystem; you'll find you own a directory
/h1/username. When using the 4100, Tempus,
keep your programs and data in /h2/username.
If you're an external user accessing our system from outside MUN, then
you should keep data in /k1, the Kronos filesystem.
You should change your current working directory to your /h1
(/h2 or /k1 as appropriate) directory to do all your HPC work.
In particular, your current working directory should be in /h1
(/h2 or /k1) when you submit a job.
plato> cd /h1/username
plato> bsub scriptfile
You may use a subdirectory of your /h1 directory without
problem; the important thing is that your current working directory
should be within the /h1 (/h2 or /k1)
filesystem.
If you require additional space during computation, you can use the
/scratch filesystem local to the execution host.
Using the scratch disks will also help solve some performance
problems (see the notes on NFS below).
See the notes copying files on how to make use of the
/scratch filesystems. Users of the four older
systems must keep in mind that there are four distinct scratch
disks. This is particularly important if one wants use the output of
one job as input for a succeeding job. If a job running on Alphaid
puts data in /scratch, it won't be visible to a job that
runs on Alcor. Thus you should copy the necessary output files back
to /h1 after your model completes.
- Tempus Users
-
If your jobs run only on Tempus, NFS will not affect them.
Because of the problems associated with NFS--described below--Tempus has
been configured so that it does not import any files from other hosts.
Thus, for your active jobs, all I/O will be local to the system.
NFS stands for Network File System. Mizar, Alphaid, and Alioth use NFS
to access the file systems on Alcor which is the host of the
/h1 filesystem. If your files are stored on Alcor and
your job is executed on Mizar, Mizar must access your files across the
network. This means that any Input/Output (including loading your
program) will be affected by the speed of the network. How much this
affects your job depends on how much I/O you do. If you have a large
program or you read or write a lot of data, then your job may be
affected by network lag. If the size of compiled executable is larger
than 10 or 15 Megabytes, there is a potential for one system to be
reading and re-reading that file across the network many times (see the
Q&A on swapping). If you read or write more
than 10 Megabytes every 2 to 3 minutes, you may also find that network lag
impacts your jobs.
If you're not using a lot of I/O, you needn't worry about NFS delays.
If you are using I/O intensive processing (e.g. image processing), you
may want to specify that jobs run on the same system that holds your files.
You may instead decide to copy the input files at the beginning of a job
and output files at the end of job so that both input and output are
on the system that will execute the job (see below).
Read the Q&A on NFS to understand why this
question is important. This is not an issue for most Morgan
users: Morgan user jobs will normally run on Plato, so the
home directory should be local and Plato jobs
won't be affected by NFS; only HPC and Tempus users actually have
access to the /scratch filesystems. If you're
working on Tempus, you won't likely need to use
/scratch, though it does provide additional
temporary disk space.
The systems in C&C's HPC cluster are configured so that /scratch
always references a local file system. Thus data place on /scratch
by a job running on Alcor will not be visible to a job running on Alphaid.
It's best to copy the necessary input files onto /scratch at the
beginning of your job and copy the necessary output files out of
/scratch at the end of your job. That way, all the essential
files will be kept in your /h1 directory and will be accessible
from all systems.
- Example:
-
I have a program called model with four input
files: domain, sample,
snap.in, and param. It will create
two output files run.log and
snap.out.
I want to copy the input files onto the scratch disk, run the
model, and copy the output files back to my current directory.
The following script will do the trick:
#BSUB -q long
dest=/scratch/$USER
# Copy input to destination scratch.
cp model param domain sample snap.in $dest
# The parentheses on the following line make the effects of
# "cd $dest" temporary. The program "model" is coded to
# create "run.log" and "snap.out".
(cd $dest; ./model)
# Copy the outputs back to the current directory
cp $dest/run.log $dest/snap.out .
# This would be a good place to remove any unnecessary output files.
# (cd $dest; rm file1 file2 ...)
UNIX systems are multi-processing systems: several processes share
the system's CPU and memory. The system is designed so that program
size is not
limited by physical memory (i.e. the memory cards in the system box).
The disk(s) may be used to store data that is not immediately required by
a running process. This memory management allows a computer to run several
large programs even if they cannot all fit in physical memory at
the same time. For example, a system with 128 megabytes
of physical memory could run 3 or 4 processes that each require 64
megabytes of memory; it's also possible to run a single job that requires
more than physical memory.
The disk space allocated for saving the data of running processes is called
swap and the activity of copying data to and from swap space is called
swapping.
This memory management system is called virtual memory.
There are three ways that swapping will affect your processes. First,
the amount of swap space is still limited to the space allocated on the
swap disks. There may not be enough swap space available to service
your process and your program will either fail to start or it may
possibly be killed by the system later, if the system runs out of space.
You can monitor the current swap
space usage with the command swapon -s.
Second, running processes
claim physical memory by pushing other processes' data out of physical
memory and onto the swap disk(s). If several large jobs start pushing
each other out of memory, the system wastes a great deal of time doing disk
copies and the work is slowed down for all processes; in the extreme, this
is called thrashing. Third, swap space shares the disk(s) with the
programs and data space. Any swapping activity can create contention for
disk reads and writes of data or program.
NFS can also be a factor in swapping. When program code is marked for
removal in memory, it isn't copied to the swap disk. Instead, the memory
is simply cleared for use. When the process becomes active again and needs
that program code, the system reads the code from the original file. If
your file is being accessed via NFS, then network lag will be a factor in
how quickly your program starts running again.
LSF provides the ability to perform process checkpointing.
Checkpointing involves periodically creating an image of the running
process that can be used as an initial state for a subsequent job. This
is useful if your job requires more CPU time than the available queues,
or it will let you recover from a system crash. To avail of this
facility, you must replace the standard unix linker ld with the LSF
linker chpt_ld (or ckpt_ld_f
for Fortran users). In the simplest case, this will require you to use
the "-c" flag to th compiler to create the object
file and issue a ckpt_ld_f command manually to link your
object files. In the worst case, it will mean some minor adjustments to
your Makefile. An example is given in the ckpt_ld
man page and on the examples page.
Paul Fardy
Modified 2000-08-07.