Setting Up a Training, Fine-Tuning, and Inferencing of LLMs with NVIDIA GPUs and CUDA

The field of artificial intelligence (AI) has witnessed remarkable advancements in recent years, and at the heart of it lies the powerful combination of graphics processing units (GPUs) and parallel computing platform.

Models such as GPT, BERT, and more recently Llama, Mistral are capable of understanding and generating human-like text with unprecedented fluency and coherence. However, training these models requires vast amounts of data and computational resources, making GPUs and CUDA indispensable tools in this endeavor.

This comprehensive guide will walk you through the process of setting up an NVIDIA GPU on Ubuntu, covering the installation of essential software components such as the NVIDIA driver, CUDA Toolkit, cuDNN, PyTorch, and more.

The Rise of CUDA-Accelerated AI Frameworks

GPU-accelerated deep learning has been fueled by the development of popular AI frameworks that leverage CUDA for efficient computation. Frameworks such as TensorFlow, PyTorch, and MXNet have built-in support for CUDA, enabling seamless integration of GPU acceleration into deep learning pipelines.

According to the NVIDIA Data Center Deep Learning Product Performance Study, CUDA-accelerated deep learning models can achieve up to 100s times faster performance compared to CPU-based implementations.

NVIDIA’s Multi-Instance GPU (MIG) technology, introduced with the Ampere architecture, allows a single GPU to be partitioned into multiple secure instances, each with its own dedicated resources. This feature enables efficient sharing of GPU resources among multiple users or workloads, maximizing utilization and reducing overall costs.

Accelerating LLM Inference with NVIDIA TensorRT

While GPUs have been instrumental in training LLMs, efficient inference is equally crucial for deploying these models in production environments. NVIDIA TensorRT, a high-performance deep learning inference optimizer and runtime, plays a vital role in accelerating LLM inference on CUDA-enabled GPUs.

According to NVIDIA’s benchmarks, TensorRT can provide up to 8x faster inference performance and 5x lower total cost of ownership compared to CPU-based inference for large language models like GPT-3.

NVIDIA’s commitment to open-source initiatives has been a driving force behind the widespread adoption of CUDA in the AI research community. Projects like cuDNN, cuBLAS, and NCCL are available as open-source libraries, enabling researchers and developers to leverage the full potential of CUDA for their deep learning.

Installation

When setting  AI development, using the latest drivers and libraries may not always be the best choice. For instance, while the latest NVIDIA driver (545.xx) supports CUDA 12.3, PyTorch and other libraries might not yet support this version. Therefore, we will use driver version 535.146.02 with CUDA 12.2 to ensure compatibility.

Installation Steps

1. Install NVIDIA Driver

First, identify your GPU model. For this guide, we use the NVIDIA GPU. Visit the NVIDIA Driver Download page, select the appropriate driver for your GPU, and note the driver version.

To check for prebuilt GPU packages on Ubuntu, run:

sudo ubuntu-drivers list --gpgpu

Reboot your computer and verify the installation:

nvidia-smi

2. Install CUDA Toolkit

The CUDA Toolkit provides the development environment for creating high-performance GPU-accelerated applications.

For a non-LLM/deep learning setup, you can use:

sudo apt install nvidia-cuda-toolkit
However, to ensure compatibility with BitsAndBytes, we will follow these steps:
[code language="BASH"]
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes/
bash install_cuda.sh 122 ~/local 1

Verify the installation:

~/local/cuda-12.2/bin/nvcc --version

Set the environment variables:

export CUDA_HOME=/home/roguser/local/cuda-12.2/
export LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64
export BNB_CUDA_VERSION=122
export CUDA_VERSION=122

3. Install cuDNN

Download the cuDNN package from the NVIDIA Developer website. Install it with:

sudo apt install ./cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb

Follow the instructions to add the keyring:

sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-08A7D361-keyring.gpg /usr/share/keyrings/

Install the cuDNN libraries:

sudo apt update
sudo apt install libcudnn8 libcudnn8-dev libcudnn8-samples

4. Setup Python Virtual Environment

Ubuntu 22.04 comes with Python 3.10. Install venv:

sudo apt-get install python3-pip
sudo apt install python3.10-venv

Create and activate the virtual environment:

cd
mkdir test-gpu
cd test-gpu
python3 -m venv venv
source venv/bin/activate

5. Install BitsAndBytes from Source

Navigate to the BitsAndBytes directory and build from source:

cd ~/bitsandbytes
CUDA_HOME=/home/roguser/local/cuda-12.2/ 
LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64 
BNB_CUDA_VERSION=122 
CUDA_VERSION=122 
make cuda12x
CUDA_HOME=/home/roguser/local/cuda-12.2/ 
LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64 
BNB_CUDA_VERSION=122 
CUDA_VERSION=122 
python setup.py install

6. Install PyTorch

Install PyTorch with the following command:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

7. Install Hugging Face and Transformers

Install the transformers and accelerate libraries:

pip install transformers
pip install accelerate

The Power of Parallel Processing

At their core, GPUs are highly parallel processors designed to handle thousands of concurrent threads efficiently. This architecture makes them well-suited for the computationally intensive tasks involved in training deep learning models, including LLMs. The CUDA platform, developed by NVIDIA, provides a software environment that allows developers to harness the full potential of these GPUs, enabling them to write code that can leverage the parallel processing capabilities of the hardware.
Accelerating LLM Training with GPUs and CUDA.

Training large language models is a computationally demanding task that requires processing vast amounts of text data and performing numerous matrix operations. GPUs, with their thousands of cores and high memory bandwidth, are ideally suited for these tasks. By leveraging CUDA, developers can optimize their code to take advantage of the parallel processing capabilities of GPUs, significantly reducing the time required to train LLMs.

For example, the training of GPT-3, one of the largest language models to date, was made possible through the use of thousands of NVIDIA GPUs running CUDA-optimized code. This allowed the model to be trained on an unprecedented amount of data, leading to its impressive performance in natural language tasks.

import torch
import torch.nn as nn
import torch.optim as optim
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Define training data and hyperparameters
train_data = [...] # Your training data
batch_size = 32
num_epochs = 10
learning_rate = 5e-5
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# Training loop
for epoch in range(num_epochs):
for i in range(0, len(train_data), batch_size):
# Prepare input and target sequences
inputs, targets = train_data[i:i+batch_size]
inputs = tokenizer(inputs, return_tensors="pt", padding=True)
inputs = inputs.to(device)
targets = targets.to(device)
# Forward pass
outputs = model(**inputs, labels=targets)
loss = outputs.loss
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')

In this example code snippet, we demonstrate the training of a GPT-2 language model using PyTorch and the CUDA-enabled GPUs. The model is loaded onto the GPU (if available), and the training loop leverages the parallelism of GPUs to perform efficient forward and backward passes, accelerating the training process.

CUDA-Accelerated Libraries for Deep Learning

In addition to the CUDA platform itself, NVIDIA and the open-source community have developed a range of CUDA-accelerated libraries that enable efficient implementation of deep learning models, including LLMs. These libraries provide optimized implementations of common operations, such as matrix multiplications, convolutions, and activation functions, allowing developers to focus on the model architecture and training process rather than low-level optimization.

One such library is cuDNN (CUDA Deep Neural Network library), which provides highly tuned implementations of standard routines used in deep neural networks. By leveraging cuDNN, developers can significantly accelerate the training and inference of their models, achieving performance gains of up to several orders of magnitude compared to CPU-based implementations.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels))
def forward(self, x):
with autocast():
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
out = F.relu(out)
return out

In this code snippet, we define a residual block for a convolutional neural network (CNN) using PyTorch. The autocast context manager from PyTorch’s Automatic Mixed Precision (AMP) is used to enable mixed-precision training, which can provide significant performance gains on CUDA-enabled GPUs while maintaining high accuracy. The F.relu function is optimized by cuDNN, ensuring efficient execution on GPUs.

Multi-GPU and Distributed Training for Scalability

As LLMs and deep learning models continue to grow in size and complexity, the computational requirements for training these models also increase. To address this challenge, researchers and developers have turned to multi-GPU and distributed training techniques, which allow them to leverage the combined processing power of multiple GPUs across multiple machines.

CUDA and associated libraries, such as NCCL (NVIDIA Collective Communications Library), provide efficient communication primitives that enable seamless data transfer and synchronization across multiple GPUs, enabling distributed training at an unprecedented scale.

</pre>
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize distributed training
dist.init_process_group(backend='nccl', init_method='...')
local_rank = dist.get_rank()
torch.cuda.set_device(local_rank)
# Create model and move to GPU
model = MyModel().cuda()
# Wrap model with DDP
model = DDP(model, device_ids=[local_rank])
# Training loop (distributed)
for epoch in range(num_epochs):
for data in train_loader:
inputs, targets = data
inputs = inputs.cuda(non_blocking=True)
targets = targets.cuda(non_blocking=True)
outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()

In this example, we demonstrate distributed training using PyTorch’s DistributedDataParallel (DDP) module. The model is wrapped in DDP, which automatically handles data parallelism, gradient synchronization, and communication across multiple GPUs using NCCL. This approach enables efficient scaling of the training process across multiple machines, allowing researchers and developers to train larger and more complex models in a reasonable amount of time.

Deploying Deep Learning Models with CUDA

While GPUs and CUDA have primarily been used for training deep learning models, they are also crucial for efficient deployment and inference. As deep learning models become increasingly complex and resource-intensive, GPU acceleration is essential for achieving real-time performance in production environments.

NVIDIA’s TensorRT is a high-performance deep learning inference optimizer and runtime that provides low-latency and high-throughput inference on CUDA-enabled GPUs. TensorRT can optimize and accelerate models trained in frameworks like TensorFlow, PyTorch, and MXNet, enabling efficient deployment on various platforms, from embedded systems to data centers.

import tensorrt as trt
# Load pre-trained model
model = load_model(...)
# Create TensorRT engine
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network()
parser = trt.OnnxParser(network, logger)
# Parse and optimize model
success = parser.parse_from_file(model_path)
engine = builder.build_cuda_engine(network)
# Run inference on GPU
context = engine.create_execution_context()
inputs, outputs, bindings, stream = allocate_buffers(engine)
# Set input data and run inference
set_input_data(inputs, input_data)
context.execute_async_v2(bindings=bindings, stream_handle=stream.ptr)
# Process output
# ...

In this example, we demonstrate the use of TensorRT for deploying a pre-trained deep learning model on a CUDA-enabled GPU. The model is first parsed and optimized by TensorRT, which generates a highly optimized inference engine tailored for the specific model and hardware. This engine can then be used to perform efficient inference on the GPU, leveraging CUDA for accelerated computation.

Conclusion

The combination of GPUs and CUDA has been instrumental in driving the advancements in large language models, computer vision, speech recognition, and various other domains of deep learning. By harnessing the parallel processing capabilities of GPUs and the optimized libraries provided by CUDA, researchers and developers can train and deploy increasingly complex models with high efficiency.

As the field of AI continues to evolve, the importance of GPUs and CUDA will only grow. With even more powerful hardware and software optimizations, we can expect to see further breakthroughs in the development and deployment of  AI systems, pushing the boundaries of what is possible.