LSF Distributed Computing Environment
[Definitions] [LSF Commands]
[Checkpointing]
Definitions
This page is a brief introduction to some important LSF concepts and
commands. Before we start, here are some definitions of terms that will
be used frequently in this document:
- Cluster
- A collection of host computers configured to run the LSF software.
The host computers in the cluster need not be identical, they may be from
different manufacturers and run different operating systems.
- Hosts
- The individual computers within the cluster. Hosts may be single or
multiprocessor systems and as such may vary quite considerably in capability
and performance.
- Queues
- Users submit batch jobs to queues while the LSF software dispatches
the queued jobs to host computers for execution. The dispatching is done
subject to the availibility of hosts, the queue configuration and individual
job priorities among other criteria.
Some useful LSF commands
The man pages for each of these commands are available on Plato and
Kronos and you are encouraged to read them for further information. This
document is primarily intended as a brief synopsis of some of the more
useful LSF commands at your disposal.
Most of these commands share a common run time flag, -l, that will give
a long listing containing much more detailed information.
- bhist
- A database, or batch job history, is maintained of all jobs submitted
to the queues with bsub. The bhist command can be used to query this
database for a summary of jobs completed, or to get information about
running jobs. For example, bhist -a -n 2 will search
through the last two history files (choose a larger value than 2 for a
longer search back in time) and give a summary listing of all jobs you've
submitted. bhist -r -u all is another form of this command
that will give you a summary of all jobs currently running on the cluster.
This information may be useful in estimating when jobs will finish, or if
you have a waiting job, in estimating when a processor will become
available. Note that the run time reported here is elapsed time, not CPU
time, so a job submitted to a 24 hour queue may show a run time greater
than 24 hours.
- bhosts
- View all host computers available within the cluster. This command
is useful to get a quick status of the cluster showing information such
as which hosts are available (status) and how many jobs are running on
each host (run). See the man page for a complete description of the information
listed.
- bjobs
- Inquire about the status of all jobs you, or anyone else, have submitted
to the cluster. Jobs may be running on a host, or pending. Pending jobs
may be waiting for a host to become available, another job to finish, or
a number of other possible reasons. The -l flag will give the exact reason
for a pending job.
- bqueues
- View a list of all queues configured in the cluster. We have some queues
that are configured to dispatch to one of several hosts, for example the
"long" queue will dispatch jobs to any of the MUN Alpha cluster (
Alioth, Alphaid, Alcor or Mizar), while other queues such as long-tempus
will only dispatch to Tempus.
Queues may be configured with certain resource limitations. For example
a queue may be restricted to jobs requiring less than 1GB of memory, restricted
to a particular host, or less than some amount of CPU time. Here are the
queues currently configured on our cluster.
- bsub
- This is the command you will use to submit your job to a queue. It
is recommended that you choose the queue name explicitly when submitting
your job to ensure that it runs under the correct queue. Here are some
examples
using bsub.
- lsload
- This is a convenient command to inquire, among other useful statistics,
the load average and memory use of the hosts within the cluster. Use lsload
to see which hosts are busy and which are idle before submitting your job.
- lsrun
- This command is used to run a process on a selected host within the
cluster. To view a process listing on Tempus for example, lsrun -m
tempus ps aux or to execute your makefile to compile your code on
Tempus lsrun -m tempus make.
Checkpointing
LSF provides the ability to perform process checkpointing. Checkpointing
involves periodically creating an image of the running process that can
be used as an initial state for a subsequent job. This is useful if your
job requires more CPU time than the available queues, or it will let you
recover from a system crash. To avail of this facility, you must replace
the standard unix linker ld with the LSF linker chpt_ld (or ckpt_ld_f for
fortran users). In the simplest case, this will require you to use the
-c flag to the compiler to create the object file and issue a ckpt_ld_f
command manually to link your object files. In the worst case, it will
mean some minor adjustments to your makefile. An example is given in the
ckpt_ld man page and here.
Last modified April 04, 1997 by
Allan Goulding