PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Due to their exceptional content creation capabilities, Generative Large Language Models are now at the forefront of the AI revolution, with ongoing efforts to enhance their generative abilities. However, despite rapid advancements, these models require substantial computational power and resources. This is largely because they consist of hundreds of billions of parameters. Moreover, to operate smoothly, generative AI models rely on thousands of GPUs, leading to significant operational costs. The high operational demands are a key reason why generative AI models are not yet effectively deployed on personal-grade devices.

In this article, we will discuss PowerInfer, a high-speed LLM inference engine designed for standard computers powered by a single consumer-grade GPU. The PowerInfer framework seeks to utilize the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activations. This means that at any given time, a small subset of ‘hot’ neurons are consistently active across inputs, while the rest, termed ‘cold’ neurons, activate based on specific inputs or requirements. This approach enables the PowerInfer framework to reduce the computing power needed for generative AI to produce desired outputs.

We will delve into the PowerInfer framework in detail, exploring its methodology, pipeline, and practical application results. Let’s begin.

PowerInfer: Fast Large Language Model with Consumer-Grade GPU

Generative Large Language Models, such as ChatGPT and DALL-E, are known for sophisticated generative and natural language processing tasks. Due to their high computational requirements, these models are typically deployed in data centers with advanced GPUs. The need for such high computational power limits their deployment to data centers, highlighting the necessity to deploy large language models on more accessible local platforms like personal computers.

Increasing the accessibility of large language models could reduce inference and content generation costs, enhance data privacy, and allow for model customization. Furthermore, while data center deployments prioritize high throughput, local LLM deployments could focus on low latency due to smaller batch sizes.

However, deploying these models on local devices poses significant challenges due to their substantial memory requirements. Large language models, functioning as autoregressive transformers, generate text token-by-token, with each token requiring access to the entire model, comprising hundreds of billions of parameters. This necessitates numerous high-end GPUs for low-latency output generation. Additionally, local deployments typically process individual requests sequentially, limiting the potential for parallel processing.

To address the complex memory requirements of the generative AI framework, existing solutions employ methods like model offloading and compression. Techniques like distillation, pruning, and quantization reduce the model size but are still too large for standard-grade GPUs in personal computers. Model offloading, which partitions the model at the Transformer Layer between CPUs and GPUs, allows for distributed layer processing across CPU and GPU memories. However, this method is limited by the slow PCIe interconnection and the CPUs’ limited computational capabilities, leading to high inference latency.

The PowerInference framework posits that the mismatch between LLM inference characteristics and hardware structure is the primary cause of memory issues in LLM inference. Ideally, data accessed frequently should be stored in high-bandwidth, limited-capacity GPUs, while less frequently accessed data should be in low-bandwidth, high-capacity CPUs. However, the large parameter volume of each LLM inference iteration makes the working set too large for a single GPU, resulting in inefficient exploitation of locality.

The inference process in large language models demonstrates high locality, with each iteration activating a limited number of neurons. The PowerInference framework aims to exploit this locality by managing a small number of hot neurons with the GPU, while the CPU handles the cold neurons. It preselects and preloads hot neurons in the GPU and identifies activated neurons during runtime. This approach minimizes costly PCIe data transfers, allowing GPUs and CPUs to independently process their assigned neurons.

However, deploying LLMs on local devices faces obstacles. Online predictors, crucial for identifying active neurons, consume considerable GPU memory. The PowerInfer framework uses an adaptive method to construct small predictors for layers with higher activation skewness and sparsity, maintaining accuracy while reducing size. Additionally, LLM frameworks require specialized sparse operators. The PowerInfer framework employs neuron-aware sparse operators that directly communicate with neurons, eliminating the need for specific sparse format conversions.

Lastly, optimally placing activated neurons between the CPU and GPU is challenging. The PowerInfer framework uses an offline stage to create a neuron placement policy, measuring each neuron’s impact on LLM inference outcomes and framing it as an integer linear problem.

Architecture and Methodology

The following figure elaborates the architecture of the PowerInfer framework consisting of offline and online components in the pipeline. 

Thanks to the variation observed in the locality properties amongst different large language models, the offline component profiles the activation sparsity of the LLM framework allowing it to differentiate between hot and cold neurons. On the other hand, in the offline phase, two types of neurons are loaded by the inference engine into both CPU and GPU, thus serving LLM requests during runtime with low latency. 

Offline Phase : Policy Solver and LLM Profiler

In the offline phase, a LLM profiler component uses requests derived from general dataset to collect activation data from the inference process. In the first step, it monitors the activation of neurons across all the layers in the framework, and proceeds to use a policy solver component to categorize the neurons as either hot or cold. The primary aim of the policy solver is to allocate neurons activated more frequently to the GPU layers while allocating the remainder to the CPU layers. In the second stage, the policy solver component uses neuron impact metrics and hardware specifications to balance the workload between the layers, and maximizes the impact metric of GPU for neurons by utilizing integer linear programming. 

Online Phase : Neuron Aware LLM Inference Engine

Once the offline stage is executed successfully, the framework proceeds to execute the online phase. In the third step of the process, the online engine assigns hot and cold neurons to their respective processing units before processing the user requests, depending as per the output of the offline policy solver. During runtime, and in step 4, the online engine manages GPU-CPU computations by creating CPU and GPU executors that are threads running on the CPU side. The engine then predicts the activated neurons and proceeds to skip the non-activated neurons. The activated neurons are then preloaded into the GPU for processing. In the meanwhile, the CPU calculates and transfers the results for its neurons to be integrated with the GPU. The online engine is able to focus on individual neurons rows and columns within matrices because it uses sparse neuron aware operators on CPUs as well as on GPUs. 

Adaptive Sparsity Predictors

The primary concept behind reducing computational loads by online inference engine in the PowerInfer framework is that it only processes neurons that it predicts to be activated. Traditionally, within each Transformer layer, a framework utilizes two different predictors to predict the activation of neurons in the MLP and self-attention blocks, as a result of which the inference computation is limited to the neurons predicted to be active. However, it is difficult to design effective predictors for local deployment because the limited amount of resources make it difficult to balance the model size and the prediction accuracy. Since these predictors are deployed by the framework frequently to predict active neurons, they need to be stored in the GPU to enable faster access. However, frameworks generally deploy a large number of predictors that occupy considerable memory, even the one needed to store LLM parameters. 

Furthermore, the size of predictors is generally determined by two factors: Internal Skewness and Sparsity of LLM layers. 

To optimize for these factors, the PowerInfer framework makes use of an iterative training method for each predictor in the Transformer layer without a fixed-size. In the first step of this training method, the size of the baseline model is established on the basis of the sparsity profile of the model, and the size of the model is adjusted iteratively by taking internal activation skewness into account to maintain accuracy. 

Neuron Placement and Management

As mentioned earlier, while the offline policy solver component is determining the neuron placement policy, the online inference engine component loads the model into the GPU and CPU memory as per the generated policy. For each layer that may or may not have multiple weight matrices, the PowerInfer framework assigns each neuron either to the CPU or the GPU on the basis of whether the neuron is hot-activated. Ensuring accurate computation of segmented neurons in the determined sequence is essential for precise results. To tackle this, the PowerInfer framework generates two neuron tables: one located in the GPU, and one located in the CPU memory, with each table correlating individual neurons to its original position in the matrix. 

Neuron Aware Operator

Given the activation sparsity observed in large language models, the inactive neurons and their weights can be bypassed by matrix multiplication operations, thus creating a need for the use of sparse operators. Instead of employing sparse operators that have several limitations, the PowerInfer framework employs neuron-aware operators that compute activated neurons and their weights directly on the GPU and CPU without requiring conversion to dense format during runtime. The neuron aware operators differ from traditional sparse operators as they focus on individual row and column vectors within a single matrix rather than focussing on the entire matrix. 

Neuron Placement Policy

To exploit the computational capabilities of CPUs and GPUs, the offline component in the PowerInfer framework generates a placement policy that guides the framework when allocating neurons to either the CPU or the GPU layers. The policy solver generates this policy, and controls neuron placement within each layer, which helps in determining the computational workload for individual processing units. When generating the placement policy, the policy solver component considers different factors including the activation frequency for each neuron, the communication overhead, and the computational capabilities like bandwidths and memory size of each processing unit. 

Results and Implementation

To demonstrate the generalization capabilities of the PowerInfer framework across devices with different hardware configurations, the experiments are conducted on two distinct personal computers: one equipped with Intel i9-13900K processor, NVIDIA RTX 4090 GPU and 192 GB host memory while the other operates on Intel i7-12700K processor, NVIDIA RTX 2080Ti GPU and 64 GB of host memory. 

The end to end performance of the PowerInfer framework is compared against llama.cpp with a batch size of 1, and default deployment settings. The framework then samples prompts from ChatGPT and Alpaca datasets given the length variability observed in real-world dialogue input and output. The following figure demonstrates the generation speeds for different models. 

As it can be observed, the PowerInfer framework generates 8.32 tokens per second, and reaches up to 16 tokens generated per second , thus outperforming the llama.cpp framework by a significant margin. Furthermore, as the number of output tokens increase, the performance of the PowerInfer framework also improves as the generation phase impacts the overall inference time significantly. 

Furthermore, as it can be observed in the above image, the PowerInfer framework outperforms the llama.cpp framework on low-end PCs with a peak generation rate of 7 tokens per second, and an average token generation speed of 5 tokens per second. 

The above image demonstrates the distribution of neuron loads between the GPU and CPU for the two frameworks. As it can be seen, the PowerInfer framework increases the GPU’s share of neuron load significantly, from 20 to 70 %. 

The above image compares the performance of the two frameworks on two PCs with different specifications. As it can be seen, the PowerInfer framework consistently delivers a high output token generation speed when compared against the llama.cpp framework. 

Final Thoughts

In this article, we have talked about PowerInfer, a high-speed LLM inference engine for a standard computer powered by a single consumer-grade GP. At its core, the PowerInfer framework attempts to exploit the high locality inherent inference in LLMs, a method characterized by neuron activation’s power-law distribution. The PowerInfer framework is a fast interference system designed for large language models that utilizes adaptive predictors and neuron-aware operators to activate the neurons and the computational sparsity.