Generative LLM PowerInfer: Runs on a single GPU, increasing machine learning model inference speed by 11 times

Author：Eve Cole Update Time：2025-01-17 17:00:02

Generative large language models (LLM) are increasingly widely used, and their efficient operation relies on powerful computing power. PowerInfer came into being. It is an innovative GPU-CPU hybrid inference engine designed to improve the running speed and efficiency of LLM on ordinary computers. PowerInfer cleverly takes advantage of the advantages of CPU and GPU to preload cold-activated neurons on the CPU and hot-activated neurons on the GPU, thereby achieving fast access and calculation. This technology breaks through the performance bottleneck of LLM on devices with limited computing resources, providing users with a more convenient and efficient experience.

Generative large language models are known for their outstanding performance in a variety of tasks, including complex natural language processing, creative writing, question answering, and code generation. LLM has been run on easy-to-use local systems, including home PCs with consumer-grade GPUs. It is understood that PowerInfer is a GPU-CPU hybrid inference engine that takes advantage of this understanding. It preloads cold-activated neurons onto the CPU for calculation and hot-activated neurons onto the GPU for immediate access. Upon evaluation, PowerInfer also showed that it runs 11.69 times faster than the current llama.cpp system while maintaining model fidelity. In summary, PowerInfer significantly improves LLM inference speed, demonstrating its performance as a desktop computer with limited GPU capabilities.

The emergence of PowerInfer marks a new milestone in the application of LLM on ordinary computers. Its significant performance improvement and maintenance of model fidelity bring a smoother and more convenient AI experience to users, and also heralds more possibilities for LLM applications in the future.