LSF Distributed Computing Environment

[Definitions] [LSF Commands] [Checkpointing]

Definitions

This page is a brief introduction to some important LSF concepts and commands. Before we start, here are some definitions of terms that will be used frequently in this document:

Cluster
A collection of host computers configured to run the LSF software. The host computers in the cluster need not be identical, they may be from different manufacturers and run different operating systems.
Hosts
The individual computers within the cluster. Hosts may be single or multiprocessor systems and as such may vary quite considerably in capability and performance.
Queues
Users submit batch jobs to queues while the LSF software dispatches the queued jobs to host computers for execution. The dispatching is done subject to the availibility of hosts, the queue configuration and individual job priorities among other criteria.

Some useful LSF commands

The man pages for each of these commands are available on Plato and Kronos and you are encouraged to read them for further information. This document is primarily intended as a brief synopsis of some of the more useful LSF commands at your disposal.

Most of these commands share a common run time flag, -l, that will give a long listing containing much more detailed information.

bhist
A database, or batch job history, is maintained of all jobs submitted to the queues with bsub. The bhist command can be used to query this database for a summary of jobs completed, or to get information about running jobs. For example, bhist -a -n 2 will search through the last two history files (choose a larger value than 2 for a longer search back in time) and give a summary listing of all jobs you've submitted. bhist -r -u all is another form of this command that will give you a summary of all jobs currently running on the cluster. This information may be useful in estimating when jobs will finish, or if you have a waiting job, in estimating when a processor will become available. Note that the run time reported here is elapsed time, not CPU time, so a job submitted to a 24 hour queue may show a run time greater than 24 hours.
bhosts
View all host computers available within the cluster. This command is useful to get a quick status of the cluster showing information such as which hosts are available (status) and how many jobs are running on each host (run). See the man page for a complete description of the information listed.
bjobs
Inquire about the status of all jobs you, or anyone else, have submitted to the cluster. Jobs may be running on a host, or pending. Pending jobs may be waiting for a host to become available, another job to finish, or a number of other possible reasons. The -l flag will give the exact reason for a pending job.
bqueues
View a list of all queues configured in the cluster. We have some queues that are configured to dispatch to one of several hosts, for example the "long" queue will dispatch jobs to any of the MUN Alpha cluster ( Alioth, Alphaid, Alcor or Mizar), while other queues such as long-tempus will only dispatch to Tempus.

Queues may be configured with certain resource limitations. For example a queue may be restricted to jobs requiring less than 1GB of memory, restricted to a particular host, or less than some amount of CPU time. Here are the queues currently configured on our cluster.

bsub
This is the command you will use to submit your job to a queue. It is recommended that you choose the queue name explicitly when submitting your job to ensure that it runs under the correct queue. Here are some examples using bsub.
lsload
This is a convenient command to inquire, among other useful statistics, the load average and memory use of the hosts within the cluster. Use lsload to see which hosts are busy and which are idle before submitting your job.
lsrun
This command is used to run a process on a selected host within the cluster. To view a process listing on Tempus for example, lsrun -m tempus ps aux or to execute your makefile to compile your code on Tempus lsrun -m tempus make.

Checkpointing

LSF provides the ability to perform process checkpointing. Checkpointing involves periodically creating an image of the running process that can be used as an initial state for a subsequent job. This is useful if your job requires more CPU time than the available queues, or it will let you recover from a system crash. To avail of this facility, you must replace the standard unix linker ld with the LSF linker chpt_ld (or ckpt_ld_f for fortran users). In the simplest case, this will require you to use the -c flag to the compiler to create the object file and issue a ckpt_ld_f command manually to link your object files. In the worst case, it will mean some minor adjustments to your makefile. An example is given in the ckpt_ld man page and here.


Last modified April 04, 1997 by Allan Goulding