The CSCS Knowledge Base has been migrated to a new site docs.cscs.ch The documentation on this page has been migrated to docs.cscs.ch/guides/mlp_tutorials/ The information on this page is out of date, and all new documentation will be written on the new site. |
This tutorial will guide you through the steps required to set up a PyTorch container and do ML inference. This means that we load an existing machine learning model, prompt it with some custom data, and run the model to see what output it will generate with our data.
To complete the tutorial, we get a PyTorch container from Nvidia, customize it to suit our needs, and tell the Container Engine how to run it. Finally, we set up and run a python script to run the machine learning model and generate some output.
The model we will be running is Google's Gemma-7B, an LLM similar in style to the popular ChatGPT, which can generate text responses to text prompts that we feed into it.
~/.config/enroot/.credentials
for enroot. Enroot will be responsible for fetching the container image from NGC behind the scenes. The credentials file should look like this:machine nvcr.io login $oauthtoken password <api-token> |
<api-token>
with your actual token.[cluster][user@cluster-ln001 ~]$ cd $SCRATCH [cluster][user@cluster-ln001 user]$ mkdir pytorch-24.01-py3-venv && cd pytorch-24.01-py3-venv |
Use your favorite text editor to create a file Dockerfile
here. The Dockerfile should look like this:
FROM nvcr.io/nvidia/pytorch:24.01-py3 ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* |
The first line specifies that we are working on top of an existing container. In this case we start FROM
an NGC PyTorch container. Next, we set an ENV
ironment variable that helps us run apt-get
in the container. Finally, we RUN
the package installer apt-get
to install python virtual environments. This will let us install python packages later on without having to rebuild the container again and again. There's a bunch of extra commands in this line to tidy things up. If you want to understand what is happening, take a look at the Docker documentation.
Now that we've setup the Dockerfile, we can go ahead and pass it to Podman to build a container. Podman is a tool that enables us to fetch, manipulate, and interact with containers on the cluster. To use Podman, we first need to configure some storage locations for it. This step is straightforward, just make the file $HOME/.config/containers/storage.conf
(or at $XDG_CONFIG_HOME/containers/storage.conf
if XDG_CONFIG_HOME
is set):
[storage] driver = "overlay" runroot = "/dev/shm/$USER/runroot" graphroot = "/dev/shm/$USER/root" [storage.options.overlay] mount_program = "/usr/bin/fuse-overlayfs-1.13" |
To build a container with Podman, we need to request a shell on a compute node from Slurm, pass the Dockerfile to Podman, and finally import the freshly built container using enroot. Slurm is a workload manager which distributes workloads on the cluster. Through Slurm, many people can use the supercomputer at the same time without interfering with one another in any way:
[cluster][user@cluster-ln001 pytorch-24.01-py3-venv]$ srun --pty bash [cluster][user@nid001234 pytorch-24.01-py3-venv]$ podman build -t pytorch:24.01-py3-venv . # ... lots of output here ... [cluster][user@nid001234 pytorch-24.01-py3-venv]$ enroot import -x mount -o pytorch-24.01-py3-venv.sqsh podman://pytorch:24.01-py3-venv # ... more output here ... |
At this point, you can exit the Slurm allocation by typing exit
. You should be able to see a new squashfile next to your Dockerfile:
[cluster][user@cluster-ln001 pytorch-24.01-py3-venv]$ ls Dockerfile pytorch-24.01-py3-ven.sqsh |
This squashfile is essentially a compressed container image, which can be run directly by the container engine. We will use our freshly-built container pytorch-24.01-py3-venv.sqsh
in the following steps to run a PyTorch script that loads the Google Gemma-7B model and performs some inference with it.
~/.edf/gemma-pytorch.toml
for the container engine. The EDF should look like this:image = "/capstor/scratch/cscs/<user>/pytorch-24.01-py3-venv/pytorch-24.01-py3-venv.sqsh" mounts = ["/capstor", "/users"] writable = true [annotations] com.hooks.aws_ofi_nccl.enabled = "true" com.hooks.aws_ofi_nccl.variant = "cuda12" [env] FI_CXI_DISABLE_HOST_REGISTER = "1" FI_MR_CACHE_MONITOR = "userfaultfd" NCCL_DEBUG = "INFO" |
<user>
with your actual CSCS username. If you've decided to build the container somewhere else, make sure to supply the correct path to the image
variable.image
variable defines which container we want to load. This could either be a container from an online docker repository, like nvcr.io/nvidia/pytorch:24.01-py3
, or in our case, a local squashfile which we built ourselves.mounts
variable defines which directories we want to mount where in our container. In general, it's a good idea to use the scratch directory to store outputs from any scientific software. In our case, we will not generate a lot of output, but it's a good practice to stick to anyways.workdir
variable tells the container engine where to start working. If we request a shell, this is where we will find ourselves dropped initially after starting the container.--environment
option to specify that we want to launch the shell inside the container specified by our gemma-pytorch EDF file:[cluster][user@cluster-ln001 ~]$ cd $SCRATCH && mkdir -p gemma-inference && cd gemma-inference [cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash |
PyTorch is already setup in the container for us. We can verify this by asking pip for a list of installed packages:
user@nid001234:/capstor/scratch/cscs/user/gemma-inference$ python -m pip list | grep torch pytorch-quantization 2.1.2 torch 2.2.0a0+81ea7a4 torch-tensorrt 2.2.0a0 torchdata 0.7.0a0 torchtext 0.17.0a0 torchvision 0.17.0a0 |
However, we will need to install a few more Python packages to make it easier to do inference with Gemma-7B. We create a virtual environment using python-venv. The --system-site-packages
option ensures that we install packages in addition to the existing packages and don't accidentally install a new version of PyTorch over the one that has been put in place by Nvidia. Next, we activate the environment and use pip to install the two packages we need, accelerate
and transformers
:
user@nid001234:gemma-inference$ python -m venv --system-site-packages ./gemma-venv user@nid001234:gemma-inference$ source ./gemma-venv/bin/activate (gemma-venv) user@nid001234:/capstor/scratch/cscs/user/gemma-inference$ python -m pip install accelerate==0.30.1 transformers==4.38.1 # ... pip output ... |
Before we move on to running the Gemma-7B model, we additionally need to make an account at HuggingFace, get an API token, and accept the license agreement for the Gemma-7B model. You can save the token to $SCRATCH using the huggingface-cli:
user@nid001234:gemma-inference$ pip install -U "huggingface_hub[cli]" user@nid001234:gemma-inference$ HF_HOME=$SCRATCH/huggingface huggingface-cli login |
At this point, you can exit the Slurm allocation again by typing exit
. If you ls
the contents of the gemma-inference
folder, you will see that the gemma-venv
virtual environment folder persists outside of the Slurm job. Keep in mind that this virtual environment won't actually work unless you're running something from inside the PyTorch container. This is because the virtual environment ultimately relies on the resources packaged inside the container.
$SCRATCH/gemma-inference/gemma-inference.py
to load the model and prompt it with some custom text. The Python script should look like this:from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto") input_text = "Write me a poem about the Swiss Alps." input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") outputs = model.generate(**input_ids, max_new_tokens=1024) print(tokenizer.decode(outputs[0])) |
input_text
variable to whatever prompt you like.gemma-inference.sbatch
anywhere you like, for example in your home directory. The Slurm batch file should look like this:#!/bin/bash #SBATCH --job-name=gemma-inference #SBATCH --time=00:15:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=288 #SBATCH --environment=gemma-pytorch #SBATCH --account=<project> export HF_HOME=$SCRATCH/huggingface export TRANSFORMERS_VERBOSITY=info cd $SCRATCH/gemma-inference/ source ./gemma-venv/bin/activate set -x python ./gemma-inference.py |
cd
to our working directory, source
our virtual environment and finally run our inference script.#SBATCH --environment=gemma-pytorch
option you can also run the code in the above script wrapped into an srun -ul --environment=gemma-pytorch bash -c "..."
statement. The tutorial on nanotron e.g. uses this pattern in run_tiny_llama.sh
.[cluster][user@cluster-ln001 ~]$ sbatch ./gemma-inference.sbatch |
tail -f slurm-<job-id>.out
. Besides you're free to do whatever you like; you can close the terminal, keep working, or just wait for the Slurm job to finish. You can always check on the state of your job by logging back into the cluster and running squeue -l --me
. Once your job finishes, you will find a file in the same directory you ran it from, named something like slurm-<job-id>.out
, and containing the output generated by your Slurm job. For this tutorial, you should see something like the following:[cluster][user@cluster-ln001 gemma-inference]$ cat ./slurm-543210.out /capstor/scratch/cscs/user/gemma-inference/gemma-venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Gemma's activation function should be approximate GeLU and not exact GeLU. Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details. Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.13it/s] /capstor/scratch/cscs/user/gemma-inference/gemma-venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( <bos>Write me a poem about the Swiss Alps. In the heart of Switzerland, where towering peaks touch sky, Lies a playground of beauty, beneath the watchful eye. The Swiss Alps, a majestic force, A symphony of granite, snow, and force. Snow-laden peaks pierce the heavens above, Their glaciers whisper secrets of ancient love. Emerald valleys bloom with flowers, A tapestry of colors, a breathtaking sight. Hiking trails wind through meadows and woods, Where waterfalls cascade, a silent song unfolds. The crystal clear lakes reflect the sky above, A mirror of dreams, a place of peace and love. The Swiss Alps, a treasure to behold, A land of wonder, a story untold. From towering peaks to shimmering shores, They inspire awe, forevermore.<eos> |
Congrats! You've run Google Gemma-7B inference on four GH200 chips simultaneously. Move on to the next tutorial or try the challenge.
nvitop
. This is a tool that shows you a concise real-time summary of GPU activity. Then, run Gemma and launch nvitop at the same time:(gemma-venv) user@nid001234:/capstor/scratch/cscs/user/gemma-inference$ python ./gemma-inference.py > ./gemma-output.log 2>&1 & nvitop |
> ./gemma-output.log 2>&1
to hide any output from Python. Note also the use of the single ampersand '&'
which backgrounds the first command and runs nvitop
on top.q
to quite nvitop and you will find the output of your Python script in ./gemma-output.log
.$SCRATCH
and share it with colleagues.