How to Make LLMs Fit into Commodity Hardware Again: A Practical Guide


LLMs like ChatGPT are all the hype. Using them as they are or as the key part of a RAG (Retrieval-Augmented Generation) system stretches the limits of what is possible in software development today.

Unfortunately, those models typically run in the cloud either because vendors just don’t want to share their models or because there simply is no hardware you could buy in large numbers to make them run in the first place. There are, however, reasons why you would want an LLM to run on machines managed by yourself:
- Cost of operation
- Privacy / data protection
- Latency
- Full control of availability and scaling

In this hands-on workshop we will show different approaches on how to make powerful LLMs fit onto affordable GPUs (like a T4) or - in special cases - even make them run on CPU. We will round this up by showing you how to evaluate and compare the performance of these small LLMs.

We bring all examples for you to follow along as notebooks on Google Colab. So all you need is a laptop and a browser.


Oliver Zeigermann has been developing software with different approaches and programming languages for more than 3 decades. In the past decade, he has been focusing on Machine Learning and its interactions with humans.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google