NVIDIA joins forces with Hugging Face to launch efficient inference service, increasing the token processing efficiency of AI models by five times

Author：Eve Cole Update Time：2024-12-12 20:00:02

Hugging Face and NVIDIA join hands to launch revolutionary Inference-as-a-Service, which uses NVIDIA's NIM technology to greatly speed up the deployment and prototyping of AI models. The service was officially released at the SIGGRAPH2024 conference, marking a significant improvement in the efficiency of AI model deployment. Developers can easily access and deploy powerful open source AI models, such as Llama2 and Mistral AI models, through Hugging Face Hub, while NVIDIA's NIM microservices ensure optimal performance of these models.

Recently, the open source platform Hugging Face and NVIDIA announced an exciting new service - Inference-as-a-Service, which will be driven by NVIDIA's NIM technology. The launch of the new service allows developers to prototype more quickly, use the open source AI models provided on Hugging Face Hub, and deploy them efficiently.

This news was announced at the ongoing SIGGRAPH2024 conference. This conference gathered a large number of experts in computer graphics and interactive technology. The cooperation between NVIDIA and Hugging Face was announced at this time, bringing new opportunities to developers. Through this service, developers can easily deploy powerful large language models (LLMs), such as Llama2 and Mistral AI models, and NVIDIA's NIM microservices provide optimization for these models.

Specifically, when accessed as a NIM, a 7 billion-parameter Llama3 model can be processed five times faster than when deployed on a standard NVIDIA H100 Tensor Core GPU system, which is undoubtedly a huge improvement. In addition, this new service also supports "Train on DGX Cloud" (Train on DGX Cloud), which is currently available on Hugging Face.

NVIDIA's NIM is a set of AI microservices optimized for inference, covering NVIDIA's AI basic models and open source community models. It significantly improves token processing efficiency through standard APIs and enhances the infrastructure of NVIDIA DGX Cloud, accelerating the response speed and stability of AI applications.

The NVIDIA DGX Cloud platform is specifically tailored for generative AI, providing reliable and accelerated computing infrastructure to help developers move from prototype to production without long-term commitments. The collaboration between Hugging Face and NVIDIA will further strengthen the developer community, and Hugging Face also recently announced that its team has achieved profitability, reaching a team size of 220 people, and launched the SmolLM series of small language models.

Highlights:

Hugging Face and NVIDIA launch inference-as-a-service to improve the token processing efficiency of AI models by five times.

The new service supports the rapid deployment of powerful LLM models and optimizes the development process.

The NVIDIA DGX Cloud platform provides accelerated infrastructure for generative AI, simplifying the production process for developers.

The cooperation between Hugging Face and NVIDIA provides AI developers with an efficient and convenient model deployment and training environment through inference as a service and the NVIDIA DGX Cloud platform, significantly lowering the threshold for AI application development and accelerating the application of AI technology. The implementation has promoted the vigorous development of the AI industry.