Running jobs

This section describes details of how to use the resource manager / queue system Slurm. We recommend reading Overview first and then Introduction before delving into the details.

Official documentation for Slurm - AI Cloud is currently using version 21.08.8-2.
Interactive tutorial for Slurm - this is a tutorial made by DeiC to help introduce HPC users to Slurm. Please note that this tutorial environment is not identical to our AI Cloud, but it enables you to familiarise yourself with Slurm in a safe environment where you cannot accicentally break anything for anyone else.

Bits and pieces in the queue system

The Slurm queue system is built around some concepts which are important to know in order to understand the system and how to use it:

Node: We often refer to these as compute nodes in this documentations, because this is where the computational jobs take place.
The nodes are the individual servers in the platform; see Overview for illustration and the table in Introduction for details.
Partition: You can think of partitions as different queues for the compute nodes. There are several partitions in the AI Cloud and the same nodes can be present in more than one partition.
For any one node, the different partitions it is present in can for example give access to the node under different conditions. The AI Cloud currently has the partitions: batch and prioritized. A few additional partitions are only visible to specific users of these (create, aicentre1, aicentre2).
See more details about partitions in a later section.

Resources: Slurm manages access to resources in the nodes. The important resources to know about in AI Cloud are CPUs, memory, and GPUs. Each job you submit will require certain resources. These are either implied by default values or explicitly requested by you when submitting a job.
A job can only start when Slurm can find enough of the required resources available on one or more nodes. If resources are not currently available, your jobs wait in queue until other jobs have completed and relinquished the resources.
Time limit: Partitions may impose time limits. These limits define the longest time your job can run for in the specific partition. They can be viewed with the sinfo command. If your job has not ended by this time limit, it will be automatically cancelled.; In the days leading up to the service windows, you will also be met by a time limit, which prevents you from launching jobs with end dates that surpass the date of the servicewindow. If the time parameter is not set, Slurm assumes you ask for the default maximum time for the partition. You will thus have to calculate how much time you have before the service window, and then submit a job with this parameter added. To submit a job that runs for 1 day and 8 hours, you can simply add time=1-8:00:00 to your slurm command. Additionally you can read about our recommendations for using checkpointing to work with time limits.

Checking the current state of the queue system

It is often desirable to be able to see what is going on in the queue system, for example to get an idea if there are many other jobs in queue when you wish to run a job.

Checking the overall state of the platform

Use the command sinfo to see the state of the nodes in the AI Cloud. Run sinfo --help or man sinfo in AI Cloud for detailed documentation of the command.

Example

sinfo

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*         up   12:00:00      1    mix nv-ai-04
batch*         up   12:00:00      8   idle a256-t4-[01-02],i256-a10-06,i256-a40-[01-02]...
prioritized    up 6-00:00:00      8   idle a256-t4-[01-02],i256-a10-06,i256-a40-[01-02]...

The sinfo command shows basic information about partitions in the queue system and what the states of nodes in these partitions are.

PARTITION shows which partition a line in the table relates to. Multiple lines can show the same partition, because different nodes within a partition may be in different states.

AVAIL shows the availability of the partition where "up" is normal, working state where you can submit jobs to it.

TIMELIMIT shows the time limit imposed by each partition.

NODES shows how many nodes are in the shown state in the specific partition.

STATE shows which state the listed nodes are in: "mix" means that the nodes are partially full - some jobs are running on them and they still have available resources; "idle" means that they are completely vacant and have all resources available; "allocated" means that they are completely occupied. Many other states are possible, most of which mean that something is wrong.

Checking details of the nodes

Use the command scontrol show node or scontrol show node [node name] to show details about all nodes or a specific node, respectively. Run scontrol --help or man scontrol in AI Cloud for detailed documentation of the command.

Example

scontrol show node a256-t4-01

NodeName=a256-t4-01 Arch=x86_64 CoresPerSocket=16 CPUAlloc=12 CPUTot=64 CPULoad=0.50 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:t4:6 NodeAddr=172.21.212.130 NodeHostName=a256-t4-01.srv.aau.dk Version=21.08.8-2 OS=Linux 5.4.0-170-generic #188-Ubuntu SMP Wed Jan 10 09:51:01 UTC 2024 RealMemory=244584 AllocMem=101440 FreeMem=242152 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A Partitions=batch,prioritized BootTime=2024-01-27T15:28:12 SlurmdStartTime=2024-01-27T15:28:36 LastBusyTime=2024-01-29T08:40:54 CfgTRES=cpu=64,mem=244584M,billing=64,gres/gpu=6 AllocTRES=cpu=12,mem=101440M,gres/gpu=2 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Nodesummary

The two commands sinfo and scontrol show node provide information which is either too little or way too much detail in most situations. As an alternative, we provide the tool nodesummary to show a hopefully more intuitive overview of the used/available resources in AI Cloud.

Screenshot of nodesummary in use.

Selecting a partition

The prioritized partition

The default partition in AI Cloud is prioritized. If you submit a job without specifying a partition, e.g. sbatch --gres=gpu:1 job_script.sh, your job automatically gets run in the prioritized partition. All users have access to the prioritized partition. As shown in the sinfo example above, this partition has a 6-day time limit and other jobs cannot cancel jobs in this partition.

The batch partition

As shown in the sinfo example above, the batch partition has a time limit of 12 hours and furthermore, jobs can get cancelled (pre-empted) by other jobs running in other partitions. As a regular user, the batch partition is the only way you can get access to the special compute nodes mentioned in Introduction - Overview which belong to particular research groups. Except for those compute nodes, the batch partition is not very interesting to use due to the pre-emption feature.

In order to use the batch partition, you must specify it for your jobs with the "--partition" or "-p" option:

Example

sbatch -p batch --gres=gpu:1 job_script.sh

Using the "-p" option to specify a partition for a batch job.

A more advanced way you can work with the batch partition is to enable requeueing of your jobs. That way your jobs would be able to automatically continue running at a later point if they happen to get preempted by higher-priority jobs. See running longer jobs for more details about this principle.

Special partitions

If you belong to one of the research groups with your own server in AI Cloud, you have been informed personally how to get first-priority access to it.

Currently, these servers are associated with the partitions: create, aicentre1, and aicentre2. By submitting your jobs to your group's partition, you can run jobs on the server, even if it requires cancelling jobs of users in the batch partition to provide you the requested resources. For example, users from VAP lab at CREATE can use their server nv-ai-04 by submitting jobs to the create partition:

Example

sbatch -p create --gres=gpu:1 job_script.sh

Using the "-p" option to access a special partition. Only designated users of these partitions can access them.

The special partitions have no time limit.

What is in the queue?

When using the cluster, you typically wish to see an overview of what is currently in the queue. For example to see how many jobs might be waiting ahead of you or to get an overview of your own jobs.

The command squeue can be used to get a general overview:

Example

squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
31623     batch     DRSC xxxxxxxx  R    6:45:14      1 i256-a10-10
31693     batch singular yyyyyyyy  R      24:20      1 i256-a40-01
31694     batch singular yyyyyyyy  R      24:20      1 i256-a40-01
31695     batch singular yyyyyyyy  R      24:20      1 i256-a40-01
31696     batch singular yyyyyyyy  R      24:20      1 i256-a40-01
31502 prioritiz runQHGK. zzzzzzzz PD       0:00      1 (Dependency)
31504 prioritiz runQHGK. zzzzzzzz PD       0:00      1 (Dependency)

The column JOBID shows the ID number of each job in queue. PARTITION shows which partition each job is running in. NAME is the name of the job which can be specified by the user creating it. USER is the username of the user who created the job. ST is the current state of each job; for example "R" means a job is running and "PD" means pending. There are other states as well - see man squeue for more details (under "JOB STATE CODES"). TIME shows how long each job has been running. NODES shows how many nodes are involved in each job allocation. Finally, NODELIST shows which node(s) each job is running on, or alternatively, why it is not running yet.

Showing your own jobs only:

Example

squeue --me

squeue can show many other details about jobs as well. Run man squeue to see detailed documentation on how to do this.

Using a specific type of GPU

In some cases your work requires a specific type of GPU. It could be, for example, that you need at least 20 GB of GPU RAM available. In that case at T4 GPU does not meet the requirement. It could also be that you know that an A10 GPU would be sufficient for your job, so there is no need to allocate an A40 GPU to it.

In cases like these, you can specify a specific type of GPU to allocate to your job. This is done by adding a GPU type label to the "--gres" option to the sbatch or srun commands.

Example

sbatch --gres=gpu:a10:1 my_job_script.sh

The --gres option lets you specify the type of GPU(s) to allocate to your job. In this example, the specification "gpu:a10:1" means that you are asking for 1 A10 GPU.

If you merely use "--gres=gpu:1" you will be allocated an arbitrary available GPU in the cluster.

Please see the overview table in Introduction for the types of GPU that you can specify with this option. For an NVIDIA T4 GPU, the corresponding label for the "--gres" option is "t4", so the name of the GPU type in lower-case letters.

Running longer jobs

In some cases, you need to run jobs that take longer than the 6 days which is the maximum run-time of jobs in the prioritized partition. The way to do this is to configure your jobs to be re-queued if they run out of time. There are two necessary ingredients to making this work:

Instruct Slurm that your job should be requeued if it gets stopped.
Program/configure your job workload to use checkpointing of working data so that the work can continue from the latest checkpoint when it gets the opportunity to start again.

Instruct Slurm to requeue your job

Note that this only makes sense if you have programmed or configured your workload to use checkpointing so that it is able to continue from where it last stopped. If this is not the case, your job would merely start over from the beginning when requeued and you could end up with a job that keeps starting over forever but never really finishes.

To instruct Slurm that your job can be requeued if stopped (due to for example time-out or pre-emption as mentioned above in batch), add the parameter --requeue to the sbatch command when submitting your job:

Example

sbatch --requeue --gres=gpu:t4:1 job_script.sh

Using the "--requeue" option to instruct Slurm that your job can be requeued

We advise that you request a specific type of GPU (for example T4 above) or a specific node when working with requeueable jobs, since we cannot guarantee what would happen if your job initially started running with one type of GPU and then subsequently attempted to continue from a checkpoint with a different type of GPU.

Use checkpointing

Checkpointing means that you configure or program your workload to store its working data so far to a temporary location on disk at regular intervals. When the workload starts running, it should first check if it has an already stored checkpoint on disk, and continue from there if it finds one.

This way, if your job suddenly gets stopped, you can start it again and it automatically continues running from its latest saved checkpoint.

How to implement checkpointing depends on the details of how your workload has been programmed. If you have programmed your workload from scratch yourself, the general recipe is to add the following functionality to your program:

Look for an exisiting checkpoint file.
If the file exists; load it and continue work from there.
If not; start the work from scratch.
While working; save the necessary internal data and output data so far to a checkpoint file at regular time intervals.
When the program completes without errors; save the final output data the way you would normally save your output data and delete the checkpoint file.

For example, if your program stores a checkpoint every 15 minutes, you would only risk losing up to 15 minutes of work if the job gets stopped. All of the prior work is stored in your most recent checkpoint which your workload can automatically load and continue from.

Some popular libraries often used in AI Cloud have built-in features you can use for checkpointing:

TensorFlow provides a guide here on how to use checkpointing in TensorFlow model training.
Similarly, the Keras interface also has this mechanism that can be used to implement checkpointing.
PyTorch provides a guide here on how to use checkpointing in PyTorch model training.

You are welcome to suggest additions to this list if you know useful checkpointing mechanisms for other software that can be used on AI Cloud.