Editor’s note: Emily Curtin is a speaker for ODSC East 2023 this May 9th-11th. Be sure to check out her talk, “Containers + GPUs In Depth,” there!

It’s time: you’ve got a great model, you’ve got it validated with customers, tested against use cases, all packaged up and ready to go. It’s time to deploy it to our ultra-modern cloud-based serverless systems.

But there’s a problem: how on earth do you hook up that GPU?

GPUs are like printers: they are frustratingly physical and require very picky configuration. In the cloud computing world there are abstraction layers upon abstraction layers that help developers stay focused on their application code without worrying about the hardware and resources, but these abstraction layers start to break down when these run-anywhere-all-packaged-up-super-virtual-it’s-a-container-it’s-fine ML models have to talk to a not-at-all-virtual GPU device.

Ingredients

Let’s assume a basic live prediction service. For the sake of example, let’s say we have a text classification model. In order to work with the text it must first be embedded into some mathematical space so that the model can work with it. This embedding process benefits enormously from GPU speedup. For some of our heavier applications at Mailchimp, we’ve seen as much as 6 – 10x GPU speedup for applications like this.

In this example, our custom application code for this text classification server is built using some open-source ML library like PyTorch, TensorFlow, mlpack, etc.

These libraries in turn rely on a series of CUDA toolkit libraries for mathematical operations. Some examples:

  • cuBLAS
  • cuRAND
  • cuSPARSE

These CUDA toolkit libraries have their own dependencies on lower-level CUDA libraries.

  • libcudart – runtime API or “high-level” API
  • libcuda – User-mode driver or “low-level” API
  • nvidia.ko – Kernel-mode driver

And at the bottom of this call stack, you’ve got the operating system of your server and the physical GPU device.

The call stack on a single server without looks roughly like this:

HTTP or RPC server
Custom ML model and business logic
Open source ML library (PyTorch, Tensorflow, XGBoost, etc.)
CUDA Toolkit libraries (cuSPARSE, cuRAND, cuSOLVER, etc.)
libcudart: High level CUDA API
Libcuda: low level API / user space driver
Nvidia.ko kernel space driver
Operating System
GPU

Now let’s make this more complicated.

In the cloud world, we’re rarely working with bare metal servers. It’s much more common to be working with containerized applications and a container orchestration framework like Kubernetes, or maybe some other cloud vendor-branded serverless framework (that is almost certainly Kubernetes under the hood). This call stack adds a few more layers and also spreads out over multiple servers in a cluster.

This is a simplified version of that call stack:

Application ContainerHTTP or RPC server
Custom ML model and business logic
Open source ML library (PyTorch, Tensorflow, XGBoost, etc.)
CUDA Toolkit libraries (cuSPARSE, cuRAND, cuSOLVER, etc.)
libcudart: High level CUDA API
Each GPU-enabled Kubernetes Node (Server)Daemonset that forwards GPU drivers
Libcuda: low level API / user space driver
Nvidia.ko kernel space driver
Node Operating System
Hardware DeviceGPU

From GPU to Container

Assuming you’re using a managed Kubernetes service like GKE or EKS, your provider will have instructions about the initial hookup between GPU devices and your Kubernetes nodes.

For instance, GKE documentation specifies how to set up a cluster with a GPU pool attached. This comes with some restrictions regarding the operating system of the nodes.

With GKE, what you get is the ability to attach GPU devices to your deployments in a similar way that you attach storage devices. Then the two driver libraries, user space and kernel space, are made available by a Daemonset on the cluster and forwarded to your pod. These are made available in /usr/local/nvidia/lib64.

All this handles the bottom layer of the stack, but now we have to marshal those middle layers?

From Container to Application

Different ML libraries have different ways of handling the CUDA runtime API and CUDA toolkit libraries.

For instance, the PyTorch developers in the past have done a lot of work to wrap the CUDA toolkit dependencies in their own libtorch_cuda library which is then distributed with the PyTorch wheel package. All the user has to bring to the table (in addition to the drivers mentioned above) is libcudart.

To take another example, Tensorflow does not try to wrap the toolkit libraries. It expects those dependencies to be available in the environment so that it can dynamically link against them.

To satisfy the requirements for libcudart and the toolkit libraries, you will have to invoke one or many of the following strategies:

  • Rely on your ML library to distribute these higher level dependencies in the wheel package.
  • Use a base container that has these toolkit libraries installed (in a CUDA version that matches your ML library).
  • DIY install these into your base container.

Takeaways:

GPUs are physical devices with very picky requirements. As such, the ML applications that benefit from GPU acceleration are so tied to their runtime environments that distributing the model without its environment is often nonsensical.

I hope this tutorial has helped to elucidate the process of getting your GPU-accelerated ML models into production, without having to dive through too many cmake files! In my talk GPUs + Containers In Depth, at ODSC East 2023, we’ll dive deeper into the human side of this problem. I’ll discuss strategies for smoothing the developer experience for Data Scientists so that they can spend more time refining their models and less time wrestling with the GPU call stack.

About the author:

Emily Curtin is a Staff MLOps Engineer at Intuit Mailchimp, meaning she gets paid to say “it depends” and “well actually.” Professionally she leads a crazy good team focused on helping Data Scientists do higher-quality work faster and more intuitively. Non-professionally she paints huge landscapes and hurricanes in oils, crushes sweet V1s (as long as they’re not too crimpy), rides her bike, reads a lot, and bothers her cats. She lives in Atlanta, GA, which is inarguably the best city in the world, with her husband Ryan who’s a pretty darn cool guy.