This repository contains the code and extended results for the paper Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
The most interesting thing in this repository is probably the detailed reports and summary table found in results/. For each model, there is a 'full' and 'mini' report. The 'mini' version can always be opened on github, but the full version may require downloading and viewing locally due to file size limitations.
In these reports:
▁
is a space (but not _
)¿entry?
represents tokens with a vocabulary entry
which was not encoded as expected.poetry shell # make/activate your virtual environment
poetry install # only the first time or on updates
For some newer models you may need to install a newer transformers version using pip install git+https://github.com/huggingface/transformers.git
See run_verification.sh
for some example commands for running new models. The script itself is mainly a reference for reproducibility and it is not recommended to run.
For models with tied embeddings, or for nicer visualizations and results, you will need to hard-code some unused token ids in magikarp/unused_tokens.py
.
[0]
, or use the tokenizer vocabulary to pick some.magikarp/fishing.py
script and kill it when it starts verifying.results/verifications/yourmodel.jsonl
which allows you to look at the vocabulary and update suitable tokens.generate_results.py
: Generates plots and markdown reports. This happens automatically after verification, but to regenerate you can python generate_results.py [your_model_id]
and then look in results
.If you want to contribute results for additional models, please include:
UNUSED_TOKENS
entry
pytest
) pass for the new model, which uses this array as a model registry.run_verification.sh
results
that are not .gitignore
'dIf you know of a model that may be interesting to analyze, but do not have the resources to run it yourself, feel free to open an issue. Please add the Hugging Face id, some information on how it is interesting in terms of tokenization, and keep in mind that the larger the model is, the less likely it is to be prioritized.