YaLM 100B is a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
The model leverages 100 billion parameters. It took 65 days to train the model on a cluster of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources in both English and Russian.
Training details and best practices on acceleration and stabilizations can be found on Medium (English) and Habr (Russian) articles.
We used DeepSpeed to train the model and drew inspiration from Megatron-LM example. However, the code in this repo is not the same code that was used to train the model. Rather it is stock example from DeepSpeed repo with minimal changes needed to infer our model.
Make sure to have 200GB of free disk space before downloading weights. The model (code is based on microsoft/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3) is supposed to run on multiple GPUs with tensor parallelism. It was tested on 4 (A100 80g) and 8 (V100 32g) GPUs, but is able to work with different configurations with ≈200GB of GPU memory in total which divide weight dimensions correctly (e.g. 16, 64, 128).
bash download/download.sh
to download model weights and vocabulary../yalm100b_checkpoint/weights/
, and vocabulary will be downloaded to ./yalm100b_checkpoint/vocab/
.docker/pull.sh
. It is compatible with A100 and V100.docker/build.sh
(which will just build docker image from docker/Dockerfile
).docker/run.sh
(volumes, name and other parameters can be changed).You can start with the following scripts:
examples/generate_interactive.sh
: interactive generation from command line, the simplest way to try the model.examples/generate_conditional_sampling.sh
: conditional generation with sampling strategy. Top-p is used by default, feel free to change temperature or use top-k. Input is jsonlines (example: examples/example_cond_input.json
), output will be the same jsonlines with generated text field added to each line.examples/generate_conditional_greedy.sh
: same as previous, but generation is greedy. Suitable for solving problems with few-shot.examples/generate_unconditional.sh
: unconditional generation. No input is used, output will be jsonlines.The model is published under the Apache 2.0 license that permits both research and commercial use, Megatron-LM is licensed under the Megatron-LM license.
Dataset used for the training of YaLM-100B is comprised of the following parts (rough percentages are measured in tokens seen by the model):
25% The Pile — open English dataset by Eleuther AI team
75% Texts in Russian collected by our team (percentages of the whole dataset are given)
49% Russian web pages from Yandex Search index filtered from ~100Tb to ~1Tb by the following heuristics:
12% News from various sources from Yandex Search index
10% Books from the dataset used in Russian Distributional Thesarus
3% Misc texts from the Taiga Dataset
1.5% Dialogues from social media preprocessed in a manner similar to how Reddit is proccessed in The Pile
0.5% Russian portion of Wikipedia
Some subsets were traversed up to 3 times during the training.
Model was trained on a cluster of 800 A100 for ~65 days. In that time it consumed 300B tokens. You can see TensorBoard with LR and ramp up schedule, training metrics and our "thermometers" on the HF page.