This project is still under development. Some features may not be implemented yet, and documentation may be incomplete.
A tiny yet powerful LLM inference system tailored for researching purpose.
vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
There are so many open source frameworks for LLM serving, including HuggingFace Transformers, vLLM, LightLLM, DistServe and DeepSpeed-MII. Why SwiftLLM?
The reason is that, those frameworks are tailored for production, instead of researching. They are equipped with numerous features, such as 100+ model supports, various hardward supports, LoRA, quantization, multimodal, prefix caching, beam search, and so on. While being an all-in-one solution for production, their codebase is too big and complex to understand and modify (for example, vLLM has 100k+ lines of code), making it hard to use them for researching purpose. Also, their historical burden is also a problem.
SwiftLLM is designed to be a tiny yet powerful LLM inference system tailored for researching purpose. "Tiny" means that it only keeps features that are essential for researching, "powerful" means that it has no compromise on performance, and finally "swift" means that it is easy to understand and modify. While supporting basic features (see the list below) and being able to achieve equivalent performance to vLLM, the codebase of SwiftLLM is less than 2k lines of code (~2% of vLLM), written in Python and OpenAI Triton (a DSL for writing CUDA kernels), making it easy to read, modify, debug, test, extend, and can be easily integrated with your novel and brilliant research ideas.
Currently, SwiftLLM supports the following features:
And we plan to add support for the following features in the future:
To keep the codebase tiny, we will not support the following features. If you want to use them in your research project, you may need to implement them by yourself:
Remember that SwiftLLM is NOT an all-in-one solution for production. It's advised to think it as a "foundation" for your research project, and you may need to implement some features by yourself. We encourage you, my dear researcher, to read the code, understand it, modify it, and extend it to fit your research needs.
SwiftLLM's architecture can be divided into two major parts: the control plane and the data plane.
Briefly speaking, the control plane decides "what to compute" or "how to schedule", while the data plane decides "how to compute" or "how to implement" and performs the concrete computation. They work in a master-worker manner: the control plane acts like a master, who performs the high-level scheduling and coordination and sends jobs to the data plane, which acts like a worker, who performs the low-level computation.
The code for the control plane resides in the swiftllm/server
directory, including components like Engine
, Scheduler
, the API server, and TokenizationEngine
. The code for the data plane resides in the swiftllm/worker
directory, including descriptions of the computation graph (in swiftllm/worker/model.py
), implementation of layers in the model (in swiftllm/layers
), and the OpenAI Triton kernels (you can imagine "kernels" as functions executed on the GPU) (in swiftllm/kernels
).
Let's take the toy API server (located in swiftllm/server/api_server.py
) as an example:
EngineConfig
to create an Engine
.Engine.initialize
, where it creates the Scheduler
, the TokenizationEngine
, and a set of (currently only one since Tensor Parallelism is not supported) workers. Then it commands the worker to execute profile_num_blocks
to calculate the number of GPU blocks, after which the engine commands all workers to allocate their KV cache and KV swap.Engine.start_all_event_loops
. In each step of the loop, the engine queries the scheduler for the next batch of requests to compute, commands the worker to perform swap in/out, then sends the batch to the worker to compute.Currently the control plane (Engine
) and the data plane (LlamaModel
) resides on the same node. After Tensor Parallelism / Pipeline Parallelism is implemented, the data plane may be distributed to multiple nodes.
We offer two ways to use SwiftLLM: using both the control plane and the data plane, or using only the data plane.
If your idea is simple or elegant enough that can be seamlessly integrated into the existing control plane, you may use both the control plane and the data plane. In another case, where you would like to implement a splendid ide, you may only leverage the data plane, and implement a new control plane by yourself.
First let's set up the environment:
packaging
via pip install packaging
And then comes the installation:
git clone https://github.com/interestingLSY/swiftLLM.git
cd
into the repo (cd swiftLLM
) and install other dependencies via pip install -r requirements.txt
.pip install -e .
to install SwiftLLM into your environment.pip install -e csrc
Here are some examples:
.bin
format and .safetensors
format are supported. Assume your model weight is stored at /data/to/weight/
.python3 examples/offline.py --model-path /data/to/weight
. This example utilizes the data plane only. If you plan to use SwiftLLM without the control plane, this is a good starting point.Engine
, you can try python3 examples/online.py --model-path /data/to/weight
. This is a great example if you plan to use both the control plane and the data plane.swiftllm/server/api_server.py
. It launches an API server and provides a vLLM-like interface for online serving.Despite being tiny (Tiny ones can be adorable too!), SwiftLLM has no compromise on performance. We have evaluated SwiftLLM on several scenarios, and demonstrate that SwiftLLM can achieve equivalent performance, or even better, compared to vLLM.
The first scenario is "a single forward operation", where we feed the model with a batch of inputs and let it generate one output token (equivelant to one "forward" operation). This is the basic operation of LLM inference (both online and offline) so its performance is crucial.
Here we use LLaMA-3 7B model with NVIDIA A100 80G PCIE / RTX 4090 GPU under FP16 precision. The results are shown below (lower is better):
It can be seen that SwiftLLM can achieve equivalent performance (or even outperform) to vLLM under the same settings.
The second scenario is "online serving", where we start an API server, sample prompts from a real-world dataset, and let the model generate completions. This is the scenario where LLM is used in real-world applications like chatbots or code completions.
Here we use the ShareGPT dataset to sample prompts, and use a poisson process with different lambdas to simulate different request arrival rates. The results are shown below (lower is better):
It can be seen that on A100 80G PCIE, SwiftLLM can achieve equivalent performance to vLLM, while on RTX 4090, SwiftLLM significantly outperforms vLLM (mainly because of that our control plane has a lower overhead).