Apple recently released a research result that significantly improves the efficiency of large language models on memory-constrained devices. This research cleverly stores model parameters in flash memory and loads them to DRAM on demand when needed, effectively solving the memory bottleneck problem and achieving a huge leap in inference speed through a series of optimization strategies. This technological breakthrough paves the way for the application of large language models in resource-constrained environments such as mobile devices and embedded systems, and has important practical significance.
Apple's latest research points out that when device memory is limited, by storing model parameters in flash memory and loading them into DRAM on demand during inference, the operating efficiency of large language models has been successfully improved, and the inference speed has increased by 25 times. This method optimizes the inference cost model, reduces the amount of data transmission, and introduces windowing strategies and row-column bundling technology, making it possible to run models twice larger than the available DRAM capacity on devices with limited memory capacity. Compared with the naive loading method, the inference speed of CPU and GPU is increased by 4-5 times and 20-25 times respectively. At the same time, it combines sparse awareness, context-adaptive loading and hardware-oriented design to facilitate the inference of large language models on devices with limited memory. Bring new possibilities. The detailed paper can be found [here](https://arxiv.org/pdf/2312.11514.pdf).This research result not only achieves a significant improvement in speed, but more importantly, provides the possibility for the application of large language models on a wider range of devices, indicating that AI technology will be more popular and convenient in the future. This innovation from Apple brings a new direction to the development of the AI field.