Google’s “ace” TPU Trillium is open for use! Performance surges, AI model training efficiency reaches new highs

Author：Eve Cole Update Time：2024-12-21 08:48:01

Google officially released the sixth generation TPU, Trillium, and opened it to Google Cloud customers. Trillium is Google's most powerful TPU to date and is used to train Gemini 2.0, its most powerful AI model. It significantly improves training performance, inference throughput, energy efficiency, and achieves lower costs. This article will take an in-depth look at Trillium TPU’s performance improvements, key features, and outstanding performance in different AI workloads, and demonstrate its practical application cases among customers such as AI21Labs.

Earlier this year, Google released Trillium, its sixth-generation and most powerful TPU to date. Today, Trillium is officially available to Google Cloud customers.

Google used Trillium TPU to train the latest Gemini2.0, which is Google’s most powerful AI model to date. Now, enterprises and startups alike can take advantage of the same robust, efficient and sustainable infrastructure.

The core of the AI supercomputer: Trillium TPU

Trillium TPU is a key component of the Google Cloud AI Hypercomputer. AI Hypercomputer is a breakthrough supercomputer architecture that leverages performance-optimized hardware, open software, leading ML frameworks, and flexible consumption model integration systems. With the official launch of Trillium TPU, Google has also made key enhancements to the open software layer of the AI Hypercomputer, including optimizing the XLA compiler and popular frameworks such as JAX, PyTorch and TensorFlow to achieve leading price/performance in AI training, tuning and services .

In addition, features such as host offload using massive host DRAM (supplemental high-bandwidth memory, or HBM) provide higher levels of efficiency. AI Hypercomputer enables you to extract maximum value from an unprecedented deployment of more than 100,000 Trillium chips per Jupiter network architecture with 13 Petabits/second of bidirectional bandwidth and the ability to scale a single distributed training job to hundreds of thousands of accelerators .

Customers like AI21Labs are already using Trillium to deliver meaningful AI solutions to their customers faster:

Barak Lenz, CTO of AI21Labs, said: "At AI21, we are constantly working to improve the performance and efficiency of Mamba and Jamba language models. As long-term users of TPU v4, we are impressed by the capabilities of Google Cloud's Trillium. At scale, speed and cost The improvements in efficiency are significant. We believe Trillium will play a vital role in accelerating the development of our next generation of complex language models, allowing us to deliver more powerful and accessible AI solutions to our customers."

Trillium's performance has been greatly improved, and many indicators have set new records.

Compared to the previous generation, Trillium has made significant improvements in:

Training performance improved by more than 4 times

3x improvement in inference throughput

Energy efficiency increased by 67%

4.7x improvement in peak computing performance per chip

High-bandwidth memory (HBM) doubles capacity

Inter-Chip Interconnect (ICI) Doubles Bandwidth

A single Jupiter network architecture contains 100,000 Trillium chips

2.5x improvement in training performance per dollar and 1.4x improvement in inference performance per dollar

These enhancements enable Trillium to perform well across a variety of AI workloads, including:

Scaling AI training workloads

Train LLMs, including dense models and Mix of Experts (MoE) models

Inference performance and collective scheduling

Embedding dense models

Provide training and inference cost-effectiveness

How does Trillium perform across different workloads?

Scaling AI training workloads

Training a large model like Gemini2.0 requires a lot of data and computation. Trillium's near-linear scalability allows these models to be trained significantly faster by effectively and efficiently distributing workloads across multiple Trillium hosts, connected via high-speed chip-to-chip interconnects in 256-chip pods and our state-of-the-art in the Jupiter data center network. This is achieved through TPU multi-chip, full-stack technology for large-scale training, and further optimized through Titanium, a dynamic data center-level offload system that ranges from host adapters to network architecture.

Trillium achieved 99% scaling efficiency in a deployment of 12 pods with 3072 chips and demonstrated 94% scaling efficiency in 24 pods with 6144 chips to pre-train gpt3-175b, even in The same is true when running across data center networks.

Train LLMs, including dense models and Mix of Experts (MoE) models

LLMs like Gemini are inherently powerful and complex, with billions of parameters. Training such an intensive LLM requires enormous computing power as well as co-designed software optimization. Trillium is 4 times faster than the previous generation Cloud TPU v5e when training intensive LLMs such as Llama-2-70b and gpt3-175b.

In addition to intensive LLM, an increasingly popular approach is to train LLM using a mixed expert (MoE) architecture, which combines multiple “expert” neural networks, each specialized in different aspects of the AI task. Managing and coordinating these experts during training adds complexity compared to training a single monolithic model. Trillium is 3.8 times faster than the previous generation Cloud TPU v5e when training MoE models.

In addition, Trillium TPU provides 3x the host dynamic random access memory (DRAM) compared to Cloud TPU v5e. This offloads some computation to the host, helping to maximize large-scale performance and good throughput. Trillium's host offloading feature provides over 50% performance improvement in model FLOP utilization (MFU) when training the Llama-3.1-405B model.

Inference performance and collective scheduling

The increasing importance of multi-step inference requires accelerators that can efficiently handle the increased computational demands. Trillium provides significant advancements for inference workloads, enabling faster and more efficient deployment of AI models. In fact, Trillium delivers our best TPU inference performance for image diffusion and dense LLM. Our tests show that Stable Diffusion XL has more than 3x higher relative inference throughput (images per second) compared to Cloud TPU v5e, while Llama2-70B has nearly 2 times.

Trillium is our highest performing TPU for offline and server inference use cases. The figure below shows that compared to Cloud TPU v5e, Stable Diffusion XL's relative throughput (images per second) for offline inference is 3.1 times higher, and its relative throughput for server inference is 2.9 times higher.

In addition to better performance, Trillium also introduces new collective scheduling capabilities. This feature allows Google's scheduling system to make intelligent job scheduling decisions to improve the overall availability and efficiency of inference workloads when multiple replicas exist in the collection. It provides a way to manage multiple TPU slices running single-host or multi-host inference workloads, including through Google Kubernetes Engine (GKE). Grouping these pieces into a collection makes it easy to adjust the number of replicas to match demand.

Embedding dense models

By adding third-generation SparseCore, Trillium improves the performance of embedding-intensive models by 2x and DLRM DCNv2 by 5x.

SparseCore is a data flow processor that provides a more adaptable architectural foundation for embedded-intensive workloads. Trillium's third-generation SparseCore excels at accelerating dynamic and data-related operations such as scatter collection, sparse segment sums, and partitioning.

Provide training and inference cost-effectiveness

In addition to the sheer performance and scale required to train some of the largest AI workloads in the world, Trillium is designed to optimize performance per dollar. To date, Trillium has achieved 2.1x better performance per dollar than Cloud TPU v5e and 2.5x better than Cloud TPU v5p when training intensive LLMs such as Llama2-70b and Llama3.1-405b.

Trillium excels at cost-effectively processing large models in parallel. It is designed to enable researchers and developers to deliver powerful and efficient image models at a much lower cost than before. The cost of generating a thousand images on Trillium is 27% lower than Cloud TPU v5e for offline inference, and 22% lower than Cloud TPU v5e for server inference on SDXL.

Taking AI innovation to the next level

Trillium represents a major leap forward in Google Cloud AI infrastructure, delivering incredible performance, scalability and efficiency for a variety of AI workloads. With its ability to scale to hundreds of thousands of chips using world-class co-design software, Trillium enables you to achieve faster breakthroughs and deliver superior AI solutions. Additionally, Trillium's exceptional price/performance makes it a cost-effective choice for organizations looking to maximize the value of their AI investments. As the AI landscape continues to evolve, Trillium demonstrates Google Cloud's commitment to providing cutting-edge infrastructure to help enterprises unlock the full potential of AI.

Official introduction: https://cloud.google.com/blog/products/compute/trillium-tpu-is-ga

All in all, the emergence of Trillium TPU marks a significant improvement in cloud AI computing capabilities. Its powerful performance, scalability and economic benefits will promote faster development in the AI field and provide more powerful AI solutions for enterprises and research institutions.