
Abstract: There’s a perfect storm happening in AI infrastructure: While new applications of AI in all industries are creating workloads with ever more complex compute requirements, many organizations are choosing to build their AI infrastructure from the ground up using containers and Kubernetes.The problem is that Kubernetes wasn't built to run these new compute-intensive AI workloads, it was built to run microservices on consumer CPUs. In this talk, we’ll tackle the challenge of optimizing the AI infrastructure stack using Kubernetes, NVIDIA GPUs, and Run:AI. Walking through an example of a well-architected AI Infrastructure stack, we’ll discuss how Kubernetes can be augmented with advanced GPU scheduling to maximize efficiency and speed up data science initiatives.
Session Outline
You’ll learn:
● The dynamic resource requirements of Artificial Intelligence/Machine Learning based
workloads
● Best practice management of compute resources such as Graphical Processing Units
(GPUs) to drive better resource allocation and increase cluster utilization
● How to apply advanced scheduling methods to dynamically set priorities and policies,
leading to more efficient job orchestration
Bio: Rob Magno is a Sales Engineer/Solution Architect at Run:AI based in New Jersey. He has been working in the Docker and Kubernetes space for the past five years. He enjoys tackling the diverse customer challenges that come with orchestrating AI/ML workloads through Kubernetes.