Editor’s Note: James Demmel, PhD and Yang You, PhD are speakers for ODSC West 2022 coming this November 1st-3rd. Be sure to check out their talk, “Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training,” there!

The success of the Transformer models has pushed the limits of deep learning to operate on the scale of trillions of parameters. This proliferation of large model size has outpaced the advances in hardware, resulting in an urgent need to distribute enormous models across multiple GPUs. Despite this trend, best practices for choosing an optimal strategy are still lacking due to the breadth of knowledge required across both deep learning and parallel computing. This makes AI researchers and developers begin to pore upon further questions: 

  • How to improve the training and inference efficiency of large models to reduce costs? 
  • Can we accommodate larger models with limited resources? 
  • What efforts can we make to enable more AI community members to access to big models easily? 

In this blog, we will take one of the most popular AI models in Hugging Face Hub, OPT from Meta, to demonstrate how to train and fine-tune your large AI models at a low cost with minimal modifications to your code.

Learn more about how you Colossal-AI uses

Open source code: https://github.com/hpcaitech/ColossalAI

Trends in Large AI Models

Forbes News, the world’s leading voice, recently declared large AI models as one of six AI trends to watch for in 2022. As large-scale AI models continue their superior performances across different domains, trends emerge, leading to distinguished and efficient AI applications that have never been seen in the industry, eg. Copilot, DALL-E 2, etc.

Image Generated by Imagen (left 2 col.) vs DALLE-2 (right 2 col.) “Greek statue of a man tripping over a cat”

In recent years, the outstanding performance of model scaling has led to an escalation in the size of pre-trained models. Unfortunately, training and even simply fine-tuning large AI models are usually unaffordable, requiring tens or hundreds of GPUs. In fact, it’s getting so frustrating out there that GPT-3, a model that contains 175 billion parameters, takes the most state-of-the-art GPU: the NVIDIA A100, more than 100 years and $12 million.

In addition, existing deep learning frameworks like PyTorch and TensorFlow may not offer a satisfactory solution for very large AI models. Furthermore, advanced knowledge of AI systems is typically required for sophisticated configurations and optimization of specific models. Therefore, many AI users, such as engineers from small and medium-sized enterprises, can’t help but feel overwhelmed by the emergence of large AI models.

Colossal-AI example

To address the above challenges, Prof. James Demmel of UC Berkeley and Prof. Yang YOU of NUS led the Colossal-AI project , a unified deep learning system for the big model era, which integrates many efficient techniques like multi-dimensional tensor parallelism, sequence parallelism, heterogeneous memory management, etc. By using Colossal-AI, users can efficiently and quickly deploy large AI model training and inference, reducing large AI model budgets and scaling down the labor cost of learning and deployment.

Colossal-AI example

Best of all, it is completely open-source and requires only minimal modifications to allow existing deep learning projects to be trained with much larger models on a single consumer-grade graphics card. In particular, it makes downstream tasks and application deployments such as large AI model fine-tuning and inference much easier. It even grants the convenience of training AI models at home!

Use Large Model OPT with Low Cost

About Open Pretrained Transformer (OPT)

Meta recently released Open Pretrained Transformer (OPT), a 175-Billion parameter AI language model. To encourage AI democratization in the community, Meta has released both the code and trained model weights, which stimulates AI programmers to perform various downstream tasks and application deployments. We will now demonstrate fine-tuning Casual Language Modeling with pre-training weights of the OPT model provided by Hugging Face Hub.

About Hugging Face

Hugging Face is a popular AI community that strives to advance and democratize AI through open source and open science. Hugging Face has had success collating large-scale models into their own model hub with over 50,000 models, including trendy large AI models like GPT and OPT.

Configure with Colossal-AI 

It is very simple to use the powerful features of Colossal-AI. Users only need a simple configuration file, and are not required to alter their training logic to equip models with their desired features (e.g. mixed-precision training, gradient accumulation, multi-dimensional parallel training, and memory redundancy elimination). 

Suppose we intend to develop the OPT on one GPU. We can accomplish this by leveraging heterogeneous training from Colossal-AI, which only requires users to add relevant items to the configuration files. Among the items added, tensor_placement_policy, which can be configured as cuda, cpu, or auto, determines our heterogeneous training strategy. Each training strategy has its distinct advantages: 

  • cuda: puts all model parameters on GPU, suitable for scenarios where training persists without weights offloading;
  • cpu: puts all model parameters on CPU, suitable for giant model training, only keeps weights on GPU memory that participate in current computation steps;
  • auto: determines the number of parameters to keep on GPU by closely monitoring the current memory status. It optimizes the usage of GPU memory and minimizes the expensive data transmission between GPU and CPU.

For typical users, they can just select the auto strategy, which maximizes training efficiency by dynamically adapting its heterogeneous strategy with respect to its current memory state.

from colossalai.zero.shard_utils import TensorShardStrategy
zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),

Launch with Colossal-AI

With the configuration file ready, only a few lines of code are needed for the newly declared functions.

Firstly, awaken Colossal-AI through a single line of code in the configuration file. Colossal-AI will automatically initialize the distributed environment, read in configuration settings, and integrate the configuration settings into its components (i.e. models and optimizers). 


After that, users may define their own datasets, models, optimizers, and loss functions per usual, or by using raw PyTorch code. Only their models need to be initialized under ZeroInitContext. In the given example, we adopt the OPTForCausalLM model along with its pretrained weights by Hugging Face, and make adjustments on the Wikitext dataset.

With ZeroInitContext(target_device=torch.cuda.current_device(),
    model = OPTForCausalLM.from_pretrained(

Next, use colossalai.initialize to integrate heterogeneous memory functions defined in the configuration file, into the training engine to enable the feature.

engine, train_dataloader, eval_dataloader, lr_scheduler = 

Remarkable Performance from Colossal-AI

On a single GPU, Colossal-AI’s automatic strategy provides remarkable performance gains from the ZeRO Offloading strategy by Microsoft DeepSpeed. Users can experience up to a 40% speedup, at a variety of model scales. However, when using a traditional deep learning training framework like PyTorch, a single GPU can no longer support the training of models at such a scale.

Colossal-AI example

Adopting the distributed training strategy with 8 GPUs is as simple as adding a -nprocs 8 to the training command of Colossal-AI!

Behind the Scenes

Such remarkable improvements come from Colossal-AI’s efficient heterogeneous memory management system, Gemini. To put it simply, Gemini uses a few warmup steps during model training to collect memory usage information from PyTorch computational graphs. After warm-up, and before performing each operation, Gemini pre-allocates memory for the operator equivalent to its peak usage based on the collected memory usage records. At the same time, it re-allocates some model tensors from GPU memory to CPU memory. 

The inbuilt memory manager by Gemini attaches a state to each tensor, including HOLD, COMPUTE, FREE, etc. Based on the queried memory usage, the manager constantly converts the tensor states, and adjusts tensor positions. Compared to the static memory classification by DeepSpeed’s ZeRO Offload, Colossal-AI Gemini employs a more efficient use of GPU and CPU memory, maximizes model capacities, and balances training speeds, all with small amounts of hardware equipment.

For the representative of large models, GPT, Colossal-AI is capable of training up to 1.5 billion parameters on a gaming laptop with RTX 2060 6GB. For a PC with RTX3090 24GB, Colossal-AI can train GPT with 18 billion parameters. Colossal-AI can also bring significant improvements to high performance graphics cards such as a Tesla V100.

If further use of the NVME function would take advantage of cheaper SSD, we could even train a model with 12 billion parameters on a RTX 3080 10GB, reaching 120 times that of PyTorch!

Furthermore: convenient and efficient parallelization

Parallel and distributed technologies are vital methods to further accelerate model training. To train the world’s largest and most advanced AI models within the shortest time, efficient distributed parallelization is still a necessity. Issues found in existing solutions include limited parallel dimension, low efficiency, poor versatility, difficult deployment, and lack of maintenance. With this in mind, Colossal-AI uses technologies such as efficient multi-dimensional parallelism and heterogeneous parallelism to allow users to deploy large AI models efficiently and rapidly with minimal modifications to their code.

To counter complications arising from data, pipeline, and 2.5D parallelism simultaneously, a simple line of code declaration suffices with Colossal-AI. The typical system/framework method of hacking into underlined code logic is no longer necessary.

parallel = dict(
    tensor=dict(mode='2.5d', depth = 1, size=4)

For a super-large AI model such as GPT-3, Colossal-AI only needs half the computing resources  compared to the NVIDIA solution to start training. If the same computing resources were used, the speed could be further increased by 11%, which could reduce the training cost of GPT-3 by over a million dollars.

In theory, this sounds fantastic, but what about in practice? Colossal-AI has proven its capabilities in application to real-world issues across a variety of industries, including autonomous driving, cloud computing, retail, medicine and chip production

One such monumental application is to accelerate the iterative development of AI technology in protein structure prediction (e.g., AlphaFold from DeepMind, one of the top 10 scientific breakthroughs in 2021). We applied the efficient Colossal-AI scheme to develop an accelerated AI model to predict protein structures: FastFold. FastFold has successfully surpassed other schemes including those proposed by Google and Columbia University. It successfully reduces the training time of AlphaFold from 11 days to 67 hours, simultaneously lowering the overall cost. Moreover, the process of long sequence inference is accelerated by about 9.3 to 11.6 times.

Colossal-AI team further worked with BioMap for xTrimo Multimer, which can predict both monomer and multimer structure simultaneously accelerating the process by up to 11 times!

Colossal-AI is keen on open source community development. We offer detailed tutorials, and support the latest cutting-edge applications such as PaLM, OPT, and AlphaFold. Colossal-AI will regularly produce new and innovative features. We always welcome suggestions and discussions, and would be more than willing to help if you encounter any issues. Your suggestions are highly appreciated. Thanks to all of the open source developers of Colossal-AI, Colossal-AI reached No. 1 in trending projects on Github and Papers With Code recently, together with projects that have as many as 10K stars.


Project address: https://github.com/hpcaitech/ColossalAI







About the authors/ODSC West 2022 speakers:

Prof. James Demmel is the Dr. Richard Carl Dehmel Distinguished Professor of Computer Science and Mathematics at the University of California at Berkeley, and former Chair of the EECS Dept. He also serves as Chief Strategy Officer for the start-up HPC-AI Tech, whose goal is to make large-scale machine learning much more efficient, with little programming effort required by users.  Demmel’s research is in high performance computing, numerical linear algebra, and communication avoiding algorithms. He is known for his work on the widely used LAPACK and ScaLAPACK linear algebra libraries.  He is a member of the National Academy of Sciences, National Academy of Engineering, and American Academy of Arts and Sciences; a Fellow of the AAAS, ACM, AMS, IEEE and SIAM; and winner of the IPDPS Charles Babbage Award, IEEE Computer Society Sidney Fernbach Award, the ACM Paris Kanellakis Award, the J. H. Wilkinson Prize in Numerical Analysis and Scientific Computing, and numerous best paper prizes.

Prof. Yang You is a Presidential Young Professor at the National University of Singapore. He received his Ph.D. in Computer Science from UC Berkeley. The focus of his current research is scaling up deep neural networks training on distributed systems or supercomputers. In 2017, his team broke the world record of ImageNet training speed, which was covered by the technology media like NSF, ScienceDaily, Science NewsLine, and i-programmer. In 2019, his team broke the world record of BERT training speed. The BERT training techniques have been used by many tech giants like Google, Microsoft, and NVIDIA. Yang You’s LARS and LAMB optimizers are available in industry benchmark MLPerf. He is a winner of IPDPS 2015 Best Paper Award (0.8%), ICPP 2018 Best Paper Award (0.3%) and ACM/IEEE George Michael HPC Fellowship. Yang You is a Siebel Scholar and a winner of Lotfi A. Zadeh Prize. Yang You was nominated by UC Berkeley for ACM Doctoral Dissertation Award (2 out of 81 Berkeley EECS PhD students graduated in 2020). He also made Forbes 30 Under 30 Asia list (2021) for young leaders and IEEE-CS TCHPC early career award. Twitter | LinkedIn