Introduction

This section describes how to get started with the AI Cloud at AAU.
We are many users on this system, so please consult the section on Fair usage and follow the guidelines. We also have a community site at AAU Yammer where users can share experiences and administrators announce workshops, changes, and service to the system. If you have support questions, please contact us at support@its.aau.dk.

For new users, we recommend reading the front page and Overview section plus the section on Fair usage

Overview

The AI Cloud consists of the following compute nodes:

Name	Nodes in total	GPUs per node	CPU cores per node	CPU HW threads	RAM per node	RAM per GPU	Disk	NVLINK / NVSWITCH	Primary usage
a256-t4-[01-03]	3	6 (NVIDIA T4)	32 (AMD EPYC)	64	256 GB	16 GB	None locally	No	Interactive / smaller single-GPU jobs
i256-a10-[06-10]	5	4 (NVIDIA A10)	32 (Intel Xeon)	64	256 GB	24 GB	None locally	No	Interactive / medium single-GPU jobs
a256-a40-[04-07]	4	3 (NVIDIA A40)	32 (AMD EPYC)	32	256 GB	48 GB	None locally	No	Large single-GPU jobs
i256-a40-[01-02]	2	4 (NVIDIA A40)	24 (Intel Xeon)	24	256 GB	48 GB	6.4 TB /raid	Yes (2×2)	Large single-/multi-GPU jobs
nv-ai-[01-03]	3	16 (NVIDIA V100)	48 (Intel Xeon)	96	1470 GB	32 GB	30 TB /raid	Yes	Large / batch / multi-GPU jobs
nv-ai-04	1	8 (NVIDIA A100)	128 (AMD EPYC)	256	980 GB	40 GB	14 TB /raid	Yes	Large / batch / multi-GPU jobs
a768-l40-[01-02]	2	8 (NVIDIA L40)				48 GB		Yes	Large / batch / multi-GPU jobs
a512-mi100-01	1	8 (AMD MI100)				32 GB		Yes (Infinity Fabric link)	Large / batch / multi-GPU jobs

Important

The compute nodes nv-ai-04, i256-a40-01, and i256-a40-02 are owned by specific research groups or centers which have first-priority access to them. Other users can only access them on a limitied basis where your jobs may be cancelled by higher-priority jobs. Users outside the prioritised group can only use them via the "batch" partition (use option --partition=batch for your jobs). See Slurm jobs.

Getting started

An alternative workshop version introduction is also available, but this only applies to the AI Cloud pilot platform.

Logging in

You can access the platform using SSH. How to use SSH depends on which operating system you use on your local computer:

Linux: If you use a modern Linux distribution such as Ubuntu on your computer, the necessary tools to connect to AI Cloud are usually already installed by default.
OS X (Apple): OS X has SSH built into the command line terminal. This means you can invoke SSH commands as shown in the following examples directly from your command line.
OS X does, however, not have a built-in X server so it is necessary to install additional software if you wish to be able to show the graphical user interface (GUI) of applications (using X forwarding).
Installing XQuartz should enable OS X to use X forwarding.
Windows (Microsoft): Newer versions of Windows have SSH built into the command line terminal. This means you can invoke SSH commands as shown in the following examples directly from your command line.
Windows does, however, not have a built-in X server either so it is necessary to install additional software if you wish to be able to show the graphical user interface (GUI) of applications (using X forwarding).
If you wish to use a convenient way to connect to the AI Cloud from Windows with the ability to display the GUI of applications you run in AI Cloud, we recommend that you install and use MobaXterm on your local computer.

The AI Cloud is only directly accessible when being on the AAU network (including VPN). You can connect to AI Cloud front-end node by running the following command on the command line of your local computer:

Example

ssh -l <aau email> ai-fe02.srv.aau.dk

Replace <aau email> with your AAU email address, e.g.

ssh -l tari@its.aau.dk ai-fe02.srv.aau.dk

If you wish to access while not being connected to the AAU network, you have two options: Use VPN or use AAU's SSH gateway.

Info

If you are often outside AAU, you can use the SSH gateway by default through your personal SSH configuration (in Linux/OS X this is often located in: $HOME/.ssh/config).

Host ai-fe02.srv.aau.dk
     User <aau email>
     ProxyJump %r@sshgw.aau.dk

Add the above configuration to your personal ssh config file (often located in: $HOME/.ssh/config on Linux or OS X systems). Now you can easily connect to the platform regardless of network using the commands from preceding examples.

Transferring files

You can transfer files to/from AI Cloud using the command line utility scp from your local computer (Linux and OS X). To AI Cloud:

Example

scp some-file <aau email>@ai-fe02.srv.aau.dk:~

where '~' means your user directory on AI Cloud. You can append directories below that to your destination:

Example

scp some-file <aau email>@ai-fe02.srv.aau.dk:~/some-dir/some-sub-dir/

You can also copy in the opposite direction, e.g. from the AI Cloud pilot platform to your local computer with:

Example

scp <aau email>@ai-fe02.srv.aau.dk:~/some-folder/some-subfolder/some-file .

where '.' means the current directory you are located in on your local computer.

In general, file transfer tools that can use SSH as protocol should work. A common choice is FileZilla or the Windows application WinSCP.

Running jobs

As mentioned on the Overview page, two important building blocks of the AI Cloud are the queue system Slurm and the container technology Singularity. In order to run a computation, analysis, simulation, training of a machine learning algorithm etc., it is necessary to first obtain a container to run your application in (in most cases) and to queue it as a job in the queue system (mandatory). The queue system is the only way to get access to running applications on the compute nodes.

Warning

The front-end nodes of the AI Cloud pilot platform and the AI Cloud are not meant for running computationally intensive applications. If you attempt to do so anyway, this risks rendering the entire AI Cloud inaccessible to you and all other users on it, because you exhaust the memory and/or CPU capacity of the front-end node. This is not considered acceptable use of the platform and is not allowed.

Slurm jobs

The first important building block to be aware of is the queue system Slurm.

Applications can only be run on the compute nodes through Slurm. You quite literally only have access to the compute nodes when you have a job allocated on them.

The simplest way to run a job via Slurm is to use the command srun.

Example

srun hostname

This runs the command hostname as a job in the queue system. When run like this with no further parameters specified, Slurm will run the command on the first compute node available.
On the AI Cloud pilot platform, this will either be nv-ai-01.srv.aau.dk or nv-ai-03.srv.aau.dk. On the AI Cloud, it will be a256-t4-01.srv.aau.dk, a256-t4-02.srv.aau.dk, or a256-t4-03.srv.aau.dk.
The command will return one of these host names. If the command displays "srun: job XXXXXX queued and waiting for resources", this means that all compute nodes are fully occupied (by other users' jobs) and your job is waiting in queue to be executed when resources become available.

This was your first Slurm job. You will need this (srun) and other Slurm commands for most of your work in AI Cloud. You will see more examples in combination with Singularity in the next section. Further details about Slurm can be found on the Running jobs page. The Danish e-infrastructure Cooperation (Deic) also hosts a great e-learning course that might be helpful.

Singularity containers

Only a small set of typical Linux command line tools are installed on the compute nodes and can be run without a container. For all other applications, you must first obtain a container to run your applications in. The container provides an encapsulated software environment with your application(s) installed inside.

Obtaining containers

The recommended way to obtain containers is to first visit NVIDIA GPU Cloud (NGC) and check whether NVIDIA already provides a container with the application you need.

Screenshot of NGC website

As an example, this could be TensorFlow. You can search on NGC and find TensorFlow. Here you can choose the desired version from the "Copy image path" dropdown menu:

Screenshot of NGC TensorFlow page

This copies a link to the container which we will use shortly to download it.

We need to use Singularity to download the image and in order to run Singularity, we must run it the Slurm queue system. This results in the following command:

Example

srun --mem 32G singularity pull docker://nvcr.io/nvidia/tensorflow:22.07-tf2-py3

The above example consists of three parts:

srun: the Slurm command which gets the following command executed on a compute node.
mem: a Slurm command that allows you allocate memory to your process. A higher amount of memory than the default is needed specifically for this TensorFlow container. Please see the Containers page for a better way to avoid excessive memory requirements.
singularity pull: the Singularity command which downloads a specified container.
docker://nvcr.io/nvidia/tensorflow:22.07-tf2-py3: this part of the command itself consists of two parts. docker:// tells Singularity that we are downloading a Docker container and Singularity automatically converts this to a Singularity container upon download. nvcr.io/nvidia/tensorflow:22.07-tf2-py3 is container label copied from the NGC webpage which identifies the particular container and version that we want. This part can be pasted into the command line by pressing <CTRL>+<SHIFT>+V in the AI Cloud command line.

Once the singularity pull command has completed, you should have a file called "tensorflow_22.07-tf2-py3.sif" in your user directory (use the command ls to see the files in your current directory).

Running applications in containers

Now that you have a container image in your user directory (the file "tensorflow_22.07-tf2-py3.sif"), we can run the container. This can be done in several ways:

Shell: You can open a shell in the container. This basically gives you a command line in the runtime environment inside the container where you can work interactively, i.e. type commands in the command line to run scripts and open applications.
This is good for experimenting with how you wish to run your computations, analyses etc. in the container.

Example

srun --pty singularity shell tensorflow_22.07-tf2-py3.sif

The --pty parameter is necessary in order to enable typing commands into the command line in the job. After opening the shell in the container, your command line terminal should display:

Singularity>

This means that it is ready for you to type in commands. Type exit and hit ENTER to exit the container and stop the running job.

Exec: Execute a specified command (such as running a script) in a container.
This is useful if you know exactly which command you wish to run in your container.

Example

srun singularity exec tensorflow_22.07-tf2-py3.sif hostname

Notice here that the --pty option is not necessary if the executed command does not need keyboard input while running. Here we use hostname as a toy example of a command that prints out a simple piece of information and then exits.

Run: Run the default action configured in the container.
The container determines what the default action is. This is useful if you have obtained a container constructed to carry out a specific task.

Example

srun --pty singularity run tensorflow_22.07-tf2-py3.sif

In some cases, the default action of a container is to open the shell. This is why we use the --pty option here.

Allocating a GPU to your job

The primary role of the AI Cloud is to run software that utilises one or more GPUs for computations.

The final step we need here in order to run applications with a GPU is to actually allocate a GPU to a job using Slurm. The examples up to now have all run jobs without a GPU. It is necessary to explicitly ask Slurm for a GPU in order to be able to use one.

Example

You can allocate a GPU to a job by using the -G or --gres=gpu option for Slurm

srun -G 1 nvidia-smi

This example allocates 1 GPU to a job running the command nvidia-smi. This command prints information about the allocated GPU and then exits.
The following commands achieve the same:

srun --gres=gpu nvidia-smi
srun --gres=gpu:1 nvidia-smi

Note that the above examples all allocate 1 GPU to the job. It is possible to allocate more, for example -G 2 for two GPUs.

Software for computing on GPU is not necessarily able to utilise more than one GPU at a time. It is your responsibility to ensure that the software you run can indeed utilise as many GPUs as you allocate. It is not allowed to allocate more GPUs than your job can utilise.

Note that you can ask for a specific type of GPU if you need to, please see: Using a specific type of GPU.

In most cases you probably want to use a GPU in combination with a Singularity container. In this case, we also need to remember to enable support for NVIDIA GPUs in Singularity:

Example

srun --gres=gpu:1 singularity exec --nv tensorflow_22.07-tf2-py3.sif nvidia-smi

The --nv option enables NVIDIA GPUs in the container and must always be used when running jobs that utilise GPU(s). Otherwise, the GPU(s) will not be available inside the container.

These were a few of the most basic details to get started using the AI Cloud. Once you have familiarised yourself a bit with the AI Cloud, we suggest you have a closer looking at the pages here with details on Slurm and Singularity for more details and features. The Additional examples page contains more detailed examples of concrete use cases for the AI Cloud.

Fair usage

The following guidelines are put in place to ensure fair usage of the system for all users. The following text might be updated from time to time in order to provide the best possible service for as many users as possible.

ITS/CLAAUDIA work from the following principles for fair usage:

Good research is the success criterion and ITS/CLAAUDIA should lower the barrier for allowing this.
Researchers should enter on a level playing field.
ITS has an administrative and technical role and should in general not determine what research should have a higher priority. Students are vetted with recommendation of supervisor/staff backing that this is for research purposes.
Aim at the most open and unrestricted access model.

Based on these principles we kindly ask that all users consider the following guidelines:

Please be mindful of your allocations and refrain from allocating many resources without knowing/testing/verifying that you indeed can utilise all of the allocated resources.
Please be mindful and de-allocate the resources when you are no longer using them. This enables other users to run their jobs.

If in doubt, you can run:

squeue --me

and inspect your own job allocations. If you have allocations you are not using, then please cancel these resource allocations.

A few key points to remember:

Please refrain from allocating jobs that sit idle in order to "reserve" ressources for you. For example srun --pty bash -l opens an interactive (command line) job that keeps running without doing anything unless you actively type in commands.
This is unacceptable practice and we will cancel such inactive jobs if we encounter them.
There are typically more resources available in the evenings/nights and on weekends. If possible, start your job as a batch script (sbatch) and let it queue and rest while the computer does the work. Maybe even better, if the job is not urgent, queue the job to run late in the afternoon or use the -b or --begin option with your batch script, e.g. add the line

#SBATCH --begin=18:00:00

ITS/CLAAUDIA will keep analysing and observing the usage of the system to make the best use of the available resources based on the above principles and guidelines. If ITS/CLAAUDIA is in doubt, we will contact users and ask if the resource allocations are in line with the above principles and guidelines. We will be more active in periods of high utilization.