GPU-Accelerated, Deterministic ML Dev Environments with Docker and CUDA

Development Environments Should Be Deterministic to Each Commit

AI Video editor

For the past two months, I’ve been building a desktop AI video editor. It relies on a local NVIDIA GPU to power inference, some of which happens locally.

Despite it being a desktop machine learning application, I’ve done nearly all the development so far in a Docker container.

In the past, I learned that machine learning dependencies can be brittle, especially when working with academic papers and their associated code.

Each experimental open source machine learning project seems to have it’s own, conflicting requirements to run.

As soon as you get a single project working, something inevitably ends up breaking somewhere else.

What if instead of this mess we had dedicated, isolated environments for each ML project, including all of its dependencies? What if the development environments were deterministic for each specific commit? You could check out and compare your latest model against last month’s in a few minutes.

With Docker and some light tooling, it’s possible to have a better experience.

Creating a GPU Accelerated Development Container

NVIDIA Accelerated Containers

NVIDIA has released and maintains a set of containers to improve the experience of developing with GPUs and deploying them to production. They offer prebuilt images with libraries like Tensorflow and PyTorch, often with very close to bleeding edge builds.

Using these containers on your machine requires a few extra steps beyond the normal Docker flow, as normal Docker containers don’t usually need access to GPUs. Because of this, there is an extension that NVIDIA offers to be able to run containers requiring access to the GPU, called NVIDIA Container Toolkit.

Installation is easy enough, as long as you have a supported, recent version of Ubuntu, the official Docker tools installed, and use the provided package repository.

The process of installing NVIDIA’s Container Toolkit is straightforward enough:

$ curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

$ sudo apt-get update
$ sudo apt-get install -y nvidia-container-toolkit
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker

Finally, you should be able to verify access to the GPU from the container via pulling a container that uses the GPU, and running nvidia-smi to ensure it’s accessible:

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

If you’ve set things up properly, you should get a text output that shows your GPU device.

Working with GPU Accelerated Docker Images

But how do you add the project specific requirements, beyond CUDA?

For this, I like to iterate through step by step. First, I’ll pick the base image I need from NVIDIA’s NGC catalog. In this case, I’ll use PyTorch:

from nvcr.io/nvidia/pytorch:23.07-py3
ENV NVIDIA_DRIVER_CAPABILITIES=all

With this, I can then do a docker build . -t localdev in the directory with this file named Dockerfile.

This begins the process of building a docker image, of which I can then start to iterate on in building my environment.

Once the image is built, I can then use a small bash script to run it and mount my development directory:

xhost +local:docker
docker run it --gpus 'all,"capabilities=compute,graphics,display,utility,video"' --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=:1 -e SAM_MODEL_PATH=sam_vit_h_4b8939.pth -v .:/app bash 

There’s a few unique(ish) flags in our Docker call that you may not have seen before.

For one, I’m exposing the X Window desktop to my application with the -v /tmp/.X11-unix stuff, along with the DISPLAY=:1. Additionally, my application expects an environment variable to set where my machine learning model (Facebook’s Segment Anything) resides.

Besides this, we’ve got the --gpus 'all,"capabilities=compute,graphics,display,utility,video"' flag. This sets up access to hardware acceleration for graphics, compute, display, video, and more. I need this for my desktop video editor, as it uses CUDA, hardware accelerated display, and hardware video decoders.

We also set --ulimit memlock=-1, which gives us access more memory space in our development environment, ulimit stack=67108864, which gives the stack larger space to work with.

With these flags, we’re giving Docker access to the physical hardware capabilities and operating system resources that are necessary for an efficient machine learning acceleration.

If you don’t need or want access to run desktop applications, you can just call the following from the command line:

docker run -it --gpus 'all,"capabilities=compute,graphics,display,utility,video"' --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v .:/app CONTAINER-IMAGE bash 

This will grant access to the GPU’s capabilities, raise limits for execution, and live mount your project directory into the container.

Adding Library Dependencies to GPU Accelerated Docker Images

Once we’ve got a shell container, the next task is to begin adding our dependencies.

For most machine learning repositories, a README.md will contain instructions to get the project set up.

I like to start with the operating system level dependencies, and then add the Python dependencies afterwards. (Some Python dependencies won’t install properly until after operating system dependencies are installed.)

Installing operating system level dependencies are easy enough in Ubuntu, usually with a single line in a Dockerfile with the following pattern:

RUN apt-get update && apt-get install -y build-essential yasm cmake libtool libc6 libc6-dev unzip wget libnuma1 libnuma-dev

The apt-get update followed by the && sign ensures that every time the line gets re-run, we re-pull the latest versions of these libraries. It’s important to have this prefix before the apt-get install -y, as it ensures that when Docker caches that layer, we re-update the entire suite of dependencies.

As my development environment grows, I tend to create new lines with my added dependencies, as Docker will re-use scripts from already existing, unchanged lines. This means I don’t have to re-download my dependencies over and over again during development loops

Once the operating system level dependencies are covered, it’s time to add the Python specific requirements.

Here’s where things can get a bit tricky. By default, your Python libraries may not actually be deterministic. This is especially true if you have a git repository URL as a library, you will not have the same result if you rebuild your Docker container at a later date if the repository updates.

For this reason, it makes sense to build, tag, and push the development container to a remote registry at commit time, where you can access the same result later.

Building a GPU Accelerated OpenCLIP Docker Container

Rather than speak in hypotheticals about building a container, let’s try to build a container image for the OpenCLIP repository.

Looking at the repository itself, it seems there are no listed dependencies, only a suggestion to do a pip install open_clip_torch.

Looking at the example application however, it seems there’s torch and PIL, or Pillow installed. Opening up the repository’s requirements.txt, we can see Pillow is not listed as a requirement. So we’ll need to add it in order to get the example application installed. Additionally, it seems the repository was written for PyTorch >= 1.9.0. We can try using the highest image for PyTorch on the NGC list, and then work backwards if that doesn’t work.

We can then hop over to NVIDIA’s PyTorch release notes, and try to find the matching container tags for a 1.x version of PyTorch.

Looking at the image, we see that container version 23.02 has the latest version of the PyTorch 1.x series. So we can do our FROM command, followed by the installation of our remaining dependencies. The full Dockerfile looks like this:

FROM nvcr.io/nvidia/pytorch:23.02-py3
RUN python -m pip install --upgrade pip
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
RUN mkdir /app
WORKDIR /app

Here we pull the PyTorch image, upgrade pip, and then copy and install the requirements listed in our requirements.txt.

To create the requirements.txt file, we can copy paste the requirements.txt of OpenCLIP, and add in Pillow and OpenCLIP itself. The final version of our requirements.txt looks like this:

torch>=1.9.0
torchvision
regex
ftfy
tqdm
huggingface_hub
open_clip_torch==2.20.0
Pillow==10.0.0
sentencepiece
protobuf<4
timm

Next, we need to verify that our application actually works, by building and then our container. Let’s do that now:

$ docker build . -t clip-container
$ docker run -it --gpus 'all,"capabilities=compute,graphics,display,utility,video"' --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v .:/app clip-container bash

Once we’re in the container, we can verify access to the GPU by running nvidia-smi. When this runs, you should see the current status of your NVIDIA GPUs. If you don’t, check to make sure you granted Docker permissions properly.

Effective ML Development Feedback Loops in a GPU Accelerated Container

With our Docker run command, we’re sync’ing our filesystem changes across the container to the local operating system. With this, we’re able to have things like live-reloading in the host operating system, increasing the speed of our feedback loops.

Again, here I like to start with the minimum viable application. In the case of OpenCLIP, that’s probably the example application. So let’s add that in:

import torch
from PIL import Image
import open_clip
import pprint

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

pprint.pprint("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

Looks like this program expects an image named CLIP.png, which resides in a subdirectory of the repository. I copied it over to my local directory, and now I can run container, and then the application in the container:

$ docker run -it --gpus 'all,"capabilities=compute,graphics,display,utility,video"' --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v .:/app clip-container bash
$ python3 example.py
Downloading (…)ip_pytorch_model.bin: 100%|██████████████████████████████████████████████████████| 605M/605M [00:09<00:00, 64.5MB/s]
Label probs: tensor([[9.9950e-01, 4.1207e-04, 8.5317e-05]])
$

It works! But if you noticed, we’re going to re-download the model every time we run the application, which would be a waste of time and bandwidth. So we should fix that too.

Adding another volume for our host computer’s ~/.cache directory should fix it. Let’s add that directory as a volume to our Docker command:

$ docker run -it --gpus 'all,"capabilities=compute,graphics,display,utility,video"' --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v .:/app -v ~/.cache:/root/.cache clip-container bash

With this, I can run the application container, and each time it has the model available to run, as long as it exists on the host machine’s home directory.

Working with Jupyter Notebooks in Accelerated Containers

Jupyter notebooks are a great way to explore, learn, and share code and data. Most machine learning models will have an accompanying Jupyter notebook, showcasing how to interact with, shape data, or train a model.

By default, Jupyter comes with a few security measures that lock down the development environment, so that nefarious individuals can’t just open up your notebooks and start injecting their own code.

One of these measures is a token that gets generated on startup. If we’re running our development container locally, we can see the token when we run our Docker container with the following command:

docker run -it -p 8888:8888 --gpus 'all,"capabilities=compute,graphics,display,utility,video"' --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v .:/app -v ~/.cache:/root/.cache clip-container /usr/local/bin/jupyter lab --allow-root --ip=* --port=8888 --no-browser  --NotebookApp.allow_origin='*' --notebook-dir=/app

This again exposes the Jupyter notebook port, mounts our machine learning models directory, and allows the network from the host computer to the Docker container.

But notably, we don’t have our token before we run the container. So we’ll need to watch the container run, and wait for the following message to appear:

Or copy and paste this URL:
        http://hostname:8888/?token=f8cdd01ae9449804b1035caee036fd8ac42cc76706c79edf

On our host computer, we replace hostname with localhost, and are able to connect to the Jupyterlab instance running within the container. We’re able to then interactively run our code, and discover the shape and structure of our data.

If we want to get a shell in our running development container, we can either open a shell in Jupyterlab, or use Docker to execute into the container:

$ docker ps 
$ docker ps
CONTAINER ID   IMAGE            COMMAND                  CREATED         STATUS         PORTS                                                 NAMES
83fa2ca10dcf   clip-container   "/opt/nvidia/nvidia_…"   2 minutes ago   Up 2 minutes   6006/tcp, 0.0.0.0:8888->8888/tcp, :::8888->8888/tcp   sleepy_cray
$ docker exec -it 83fa2ca10dcf bash

The docker ps lets us see the container id’s of all the running containers. The docker exec -it lets us connect to the running container, and the command bash gives us our shell to work with. We can then use the shell to do what we need, or attach to the container multiple times by running commands in different shells.

Adding Deterministic Container Builds to Each Commit

So now we’ve got a reproducible environment, with a container image completely isolated from the rest of our computer.

But now we should take our efforts further, and pin container image versions to each commit we do.

Luckily, git has hooks, which allow us to take action after every commit. In our case, we can create a file named post-commit in the .git/hooks/ directory of our git project. This will allow us to push the container image associated with each commit to our container registry.

The bash script could look something like this:

export HASH=$(git rev-parse HEAD)
docker build . -t clip-container:$HASH
docker push clip-container:$HASH

You should change the clip-container to match the name of your container image you’d like to build.

With this, we can do a simple change on our run_dev.sh file when we need to go back in time, setting our tag on our container image to match the tag of the commit hash we’d like to go back to.

To check out a previous commit:

git checkout <HASH>

With this, our environment variable will be updated, and we should be able to update the run_dev.sh with the tag of the specific version we’d like to check out. From there, we can make changes, test again, or re-build a model from where we left off.

Building Better Tooling for ML Development Environments

Even thought we’ve covered a bit of the tooling to improve the local development experience of machine learning models, there’s still a lot of manual process and room for error. For example, we’re still pushing and changing each container hash manually after a checkout. Our tooling should support that automatically, and transparently.

I encourage you to think through how you may improve upon the development experience, and reach out. I’ve created a repository showcasing the techniques outlined in the post, and would welcome an issue or a PR improving upon the workflow.

Feel free to checkout the repository and let me know what you think.

If you’d like to be notified of new blog posts, feel free to sign up below:

Share on

Twitter Facebook Google+ LinkedIn

GPU-Accelerated, Deterministic ML Dev Environments with Docker and CUDA

Kirk Kaiser

Development Environments Should Be Deterministic to Each Commit

Creating a GPU Accelerated Development Container

Working with GPU Accelerated Docker Images

Adding Library Dependencies to GPU Accelerated Docker Images

Building a GPU Accelerated OpenCLIP Docker Container

Effective ML Development Feedback Loops in a GPU Accelerated Container

Working with Jupyter Notebooks in Accelerated Containers

Adding Deterministic Container Builds to Each Commit

Building Better Tooling for ML Development Environments

Share on

You May Also Enjoy

Precision in Technical Communication

Tools as Creative Constraints

Building a remote controlled skateboard ramp

Is Engineering Management Bullshit?