In recent years, the training cost of large language models has remained high, which has become an important factor restricting the development of AI. How to reduce training costs and improve efficiency has become the focus of industry attention. Researchers at Harvard and Stanford University have taken a different approach and started with model training accuracy to explore a cost-effective training method. They found that by reducing model accuracy, the amount of computation can be effectively reduced and even improved model performance in some cases. This study provides new ideas for optimizing language model training and also points out the direction for future AI development.
In the field of artificial intelligence, the larger the scale, the stronger the ability. In order to pursue a more powerful language model, major technology companies are crazily stacking model parameters and training data, but they find that the cost has also risen. Isn't there a cost-effective and efficient way to train language models?
Researchers from Harvard and Stanford University recently published a paper that they found that the precision of model training is like a hidden key that unlocks the "cost password" of language model training.
What is model accuracy? Simply put, it refers to the number of digits used in the calculation process. Traditional deep learning models are usually trained using 32-bit floating point numbers (FP32), but in recent years, with the development of hardware, use lower precision numeric types, such as 16-bit floating point numbers (FP16) or 8-bit integers (INT8) Training has become possible.
So, what impact will reduce model accuracy have on model performance? This is exactly the question this paper wants to explore. Through a large number of experiments, the researchers analyzed the cost and performance changes of model training and inference at different accuracy, and proposed a new set of "precision perception" scaling rules.
They found that training with lower precision can effectively reduce the "number of effective parameters" of the model, thereby reducing the amount of computation required for training. This means that under the same computational budget, we can train larger models, or at the same scale, using lower accuracy can save a lot of computing resources.
Even more surprisingly, the researchers also found that in some cases, training with lower accuracy can actually improve the performance of the model! For example, for those who need "post-training quantization" If the model is used with lower accuracy during the training phase, the model will be more robust to the reduction in quantization accuracy, thereby showing better performance in the inference phase.
So, which precision should we choose to train the model? The researchers came to some interesting conclusions by analyzing their scaling rules:
Traditional 16-bit precision training may not be the best choice. Their research shows that 7-8-bit accuracy may be a more cost-effective option.
It is not a wise move to pursue ultra-low precision (such as 4-bit) training. Because at extremely low accuracy, the number of effective parameters of the model will drop sharply, in order to maintain performance, we need to significantly increase the model size, which will lead to higher computing costs.
The optimal training accuracy may vary for models of different sizes. For models that require a lot of "overtraining", such as the Llama-3 and Gemma-2 series, training with higher accuracy may be more cost-effective.
This study provides a completely new perspective for us to understand and optimize language model training. It tells us that the choice of accuracy is not static, but needs to be traded down based on the specific model size, training data volume and application scenarios.
Of course, this study also has some limitations. For example, the model they use is relatively small in size and the experimental results may not be directly generalized to larger models. Furthermore, they only focused on the loss function of the model and did not evaluate the performance of the model on downstream tasks.
Nevertheless, this study is of great significance. It reveals the complex relationship between model accuracy and model performance and training costs, and provides valuable insights for us to design and train stronger and more economical language models in the future.
Paper: https://arxiv.org/pdf/2411.04330
In short, this study provides new ideas and methods to reduce the training costs of large language models, and provides important reference value for the future development of artificial intelligence field. Although there are some limitations in the research, the "precision perception" scaling rule it proposes and the in-depth discussion of the relationship between model accuracy and cost and performance have important theoretical and practical guiding significance.