In this tutorial, we will build a container image to run nanotron training jobs. We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept.
- Prerequisites.
This tutorial assumes you are able to accesstodivia SSH. It is also recommended to follow the previously linked tutorial, as this will build up from it. - Setup podman.
Edit your$HOME/.config/containers/storage.confaccording to the following minimal template:$HOME/.config/containers/storage.conf[storage] driver = "overlay" runroot = "/dev/shm/$USER/runroot" graphroot = "/dev/shm/$USER/root" [storage.options.overlay] mount_program = "/usr/bin/fuse-overlayfs-1.13"
- Modify the NGC container.
See previous tutorial for context. For simplicity, we already provide a working nanotron image in/capstor/scratch/cscs/ahernnde/container-image/nanotron/nanotron-v1.0.sqsh. If you want to build the container image from scratch, follow the next instructions. Otherwise, jump straight to the next step.
Here, we assume we are already in a compute node (runsrun --pty bashto get one interactive session). In this case, we will be creating the dockerfile in$SCRATCH/container-image/nanotron/Dockerfile. These are the contents of the dockerfile:Then build and import the container.
cd $SCRATCH/container-image/nanotron podman build -t nanotron:v1.0 . enroot import -x mount -o nanotron-v1.0.sqsh podman://nanotron:v1.0
Now exit the interactive session by runningexit. - Set up an EDF.
See the previous tutorial for context. In this case, the edf will be at$HOME/.edf/nanotron.tomland will have the following contents:$HOME/.edf/nanotron.tomlimage = "/capstor/scratch/cscs/ahernnde/container-image/nanotron/nanotron-v1.0.sqsh" mounts = [ "/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>", "/users/<username>:/users/<username>", ] workdir = "/users/<username>/" [annotations.com.hooks] aws_ofi_nccl.enabled = "true" aws_ofi_nccl.variant = "cuda12"Note that, if you built your own container image, you will need to modify the image path.
- Preparing a training job.
Now let's download nanotron. In the login node run:git clone https://github.com/huggingface/nanotron.git cd nanotron
And with your favorite text editor, create the following nanotron configuration file in$HOME/nanotron/examples/config_tiny_llama_wikitext.yaml:This configuration file will train, as a proof of concept, a gpt-2-like (109M parameters) llama model with approximately 100M tokens of wikitext with settings `tp=4, dp=2, pp=1` (which means that it requires two nodes to train). This training job will require approximately 10 minutes to run. Now, create a batchfile in
$HOME/nanotron/run_tiny_llama.shwith the contents:A few comments:$HOME/nanotron/run_tiny_llama.sh#!/bin/bash #SBATCH --job-name=nanotron # create a short name for your job #SBATCH --nodes=2 # total number of nodes #SBATCH --ntasks-per-node=1 # total number of tasks per node #SBATCH --gpus-per-task=4 #SBATCH --time=1:00:00 #SBATCH --output=logs/%x_%j.log # control where the stdout will be #SBATCH --error=logs/%x_%j.err # control where the error messages will be # Initialization. set -x cat $0 export MASTER_PORT=25678 export MASTER_ADDR=$(hostname) export HF_HOME=$SCRATCH/huggingface_home export CUDA_DEVICE_MAX_CONNECTIONS=1 # required by nanotron # export either WANDB_API_KEY=<api key> or WANDB_MODE=offline # Run main script. srun -ul --environment=nanotron bash -c " # Change cwd and run the main training script. cd nanotron/ pip install -e . # Only required the first time. TORCHRUN_ARGS=\" --node-rank=\${SLURM_PROCID} \ --master-addr=\${MASTER_ADDR} \ --master-port=\${MASTER_PORT} \ --nnodes=\${SLURM_NNODES} \ --nproc-per-node=\${SLURM_GPUS_PER_TASK} \ \" torchrun \${TORCHRUN_ARGS} run_train.py --config-file examples/config_tiny_llama_wikitext.yaml "
- If you have a wandb API key and want to synchronize the training run, be sure to set theWANDB_API_KEYvariable. Otherwise, setWANDB_MODE=offlineinstead.
- Note that we are settingHF_HOMEin a directory in scratch. This is done to place the downloaded dataset in scratch, instead of your home directory. - Launch a training job with the new image.
Run:sbatch run_tiny_llama.sh
You can inspect if your job has been submitted successfully by runningsqueue --meand looking for your username. Once the run starts, there will be a new file underlogs/. You can inspect the status of your run using:tail -f logs/<logfile>
In the end, the checkpoints of the model will be saved incheckpoints/.