magikarp Download - magikarp Source code download

magikarp

Other source code

Download

Code for the paper "Fishing for Magikarp"

This repository contains the code and extended results for the paper Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

Exploring Results

The most interesting thing in this repository is probably the detailed reports and summary table found in results/. For each model, there is a 'full' and 'mini' report. The 'mini' version can always be opened on github, but the full version may require downloading and viewing locally due to file size limitations.

In these reports:

▁ is a space (but not _)
¿entry? represents tokens with a vocabulary entry which was not encoded as expected.

Running on other models

Setup

This is a standard poetry project.

poetry shell   # make/activate your virtual environment
poetry install # only the first time or on updates

For some newer models you may need to install a newer transformers version using pip install git+https://github.com/huggingface/transformers.git

Running

See run_verification.sh for some example commands for running new models. The script itself is mainly a reference for reproducibility and it is not recommended to run.

For models with tied embeddings, or for nicer visualizations and results, you will need to hard-code some unused token ids in magikarp/unused_tokens.py.

If a related model already exists, copying the token ids is likely to work just fine.
For non-tied embeddings you can typically just let verification finish, and update unused tokens after you get the results.
For the rare case of new model families with tied embeddings:
- Take a guess, like [0], or use the tokenizer vocabulary to pick some.
- Run the magikarp/fishing.py script and kill it when it starts verifying.
- You now have results/verifications/yourmodel.jsonl which allows you to look at the vocabulary and update suitable tokens.
- Update your unused tokens, and run verification.

Generating results

generate_results.py: Generates plots and markdown reports. This happens automatically after verification, but to regenerate you can python generate_results.py [your_model_id] and then look in results.

Contributing

If you want to contribute results for additional models, please include:

The UNUSED_TOKENS entry
- ensure tokenization tests (via pytest) pass for the new model, which uses this array as a model registry.
A line in run_verification.sh
All files in results that are not .gitignore'd

Model requests

If you know of a model that may be interesting to analyze, but do not have the resources to run it yourself, feel free to open an issue. Please add the Hugging Face id, some information on how it is interesting in terms of tokenization, and keep in mind that the larger the model is, the less likely it is to be prioritized.

Expand

Additional Information

Version
Type Other source code
Update Time 2024-11-23
size 50MB
From Github

Related Applications

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
MySchedule.py

2024-12-15
viptools for eslam

2024-12-15
VITAident

2024-12-15

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
wp functions

Other categories

1.0.0
termwind

Other categories

v2.3.0

Related Information All