The decoding speed of the Transformer model has always been a key bottleneck restricting its application. Recently, researchers from the Korea Institute of Science and Technology, LG, and DeepMind jointly overcame this problem. They proposed a new architecture called Block Transformer, which increased the decoding speed by an astonishing 10 to 20 times! The core of this breakthrough development lies in the clever "slicing" of Transformer's attention mechanism, which effectively solves the problem of low GPU utilization in traditional Transformer and significantly reduces memory overhead.
Although the Transformer model is powerful, its efficiency in decoding has always been a headache. However, researchers from the Korea Institute of Science and Technology, LG and DeepMind brought us a surprise this time - they proposed a new Transformer architecture called Block Transformer, which directly increased the decoding speed by 10 to 20 times!
How is this done? It turns out that they "cut up" the Transformer's attention mechanism. In this way, the original Transformer's inefficient method of accessing the global KV cache every time a Token is generated has been completely overturned.
The researchers analyzed the shortcomings of the original Transformer: the effective utilization of the GPU was less than 1%, and the remaining 99% was used for memory access. This is obviously unreasonable, so they proposed Block Transformer. This new architecture uses the decomposition of block-level attention and intra-block attention to make the model's reasoning throughput directly explode.
Specifically, the workflow of Block Transformer is as follows: first cut the sequence into blocks, and then use Embedder to convert each block into an embedding vector. Block Decoder is responsible for processing block embedding vectors and capturing global dependencies between blocks; Token Decoder is responsible for processing local dependencies between Tokens and generating Token sequences.
This method not only improves the inference speed, but also greatly reduces the memory overhead. Some netizens said that they had a similar idea before, but the performance of the resulting model was insufficient. Now this method seems to have effectively reduced the KV cache.
Moreover, the accuracy of Block Transformer on multiple zero-shot tasks is comparable to or even slightly higher than that of the original Transformer of the same size, which proves that it improves efficiency without sacrificing quality.
The implications of this research don't stop there. It also reduces the training cost of the model, the secondary memory access overhead of global attention is reduced by 16 times, and the GPU utilization is also increased from 1% to 44%.
Paper address: https://arxiv.org/abs/2406.02657
The emergence of Block Transformer has expanded new possibilities for the application of Transformer models, and also provided a new direction for the efficiency optimization of large-scale language models in the future. Its significant improvement in speed and efficiency is expected to promote the further development and application of AI technology.