tiktoken下載 - tiktoken原始碼下載

tiktoken

其他源碼

0.8.0

下載

⏳ 抖音

tiktoken 是一種快速 BPE 標記器，可與 OpenAI 模型一起使用。

 import tiktoken
enc = tiktoken . get_encoding ( "o200k_base" )
assert enc . decode ( enc . encode ( "hello world" )) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken . encoding_for_model ( "gpt-4o" )

開源版本的tiktoken可以從 PyPI 安裝：

 pip install tiktoken

tokeniser API 記錄在tiktoken/core.py中。

使用tiktoken的範例程式碼可以在 OpenAI Cookbook 中找到。

表現

tiktoken比同類開源 tokeniser 快 3-6 倍：

使用 GPT-2 tokeniser、來自tokenizers==0.13.2 、 transformers==4.24.0和tiktoken==0.2.0 GPT2TokenizerFast在 1GB 文字上測量效能。

尋求協助

請在問題追蹤器中發布問題。

如果您在 OpenAI 工作，請務必查看內部文件或隨時聯絡 @shantanu。

BPE到底是什麼？

語言模型不像你我那樣看到文本，而是看到一系列數字（稱為標記）。位元組對編碼（BPE）是一種將文字轉換為標記的方法。它有幾個理想的特性：

它是可逆且無損的，因此您可以將標記轉換回原始文本
它適用於任意文本，甚至是不在標記器訓練資料中的文本
它壓縮文字：令牌序列比原始文字對應的位元組短。實際上，平均而言，每個令牌對應大約 4 個位元組。
它試圖讓模型看到常見的子詞。例如，“ing”是英語中的常見子詞，因此 BPE 編碼通常會將“encoding”拆分為“encod”和“ing”等標記（而不是“enc”和“oding”）。因為模型將在不同的上下文中一次又一次地看到“ing”標記，所以它有助於模型泛化並更好地理解語法。

tiktoken包含一個教育子模組，如果您想了解更多有關 BPE 的詳細信息，該子模組會更友好，包括幫助可視化 BPE 過程的程式碼：

 from tiktoken . _educational import *

# Train a BPE tokeniser on a small amount of text
enc = train_simple_encoding ()

# Visualise how the GPT-4 encoder encodes text
enc = SimpleBytePairEncoding . from_tiktoken ( "cl100k_base" )
enc . encode ( "hello world aaaaaaaaaaaa" )

擴充抖音

您可能希望擴展tiktoken以支援新的編碼。有兩種方法可以做到這一點。

完全按照您想要的方式建立Encoding物件並簡單地傳遞它。

 cl100k_base = tiktoken . get_encoding ( "cl100k_base" )

# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken . Encoding (
    # If you're changing the set of special tokens, make sure to use a different name
    # It should be clear from the name what behaviour to expect.
    name = "cl100k_im" ,
    pat_str = cl100k_base . _pat_str ,
    mergeable_ranks = cl100k_base . _mergeable_ranks ,
    special_tokens = {
        ** cl100k_base . _special_tokens ,
        "<|im_start|>" : 100264 ,
        "<|im_end|>" : 100265 ,
    }
)

使用tiktoken_ext插件機制向tiktoken註冊您的Encoding物件。

只有當您需要tiktoken.get_encoding來尋找編碼時，這才有用，否則更喜歡選項 1。

為此，您需要在tiktoken_ext下建立一個命名空間包。

像這樣佈局您的項目，確保省略tiktoken_ext/__init__.py檔案：

 my_tiktoken_extension
├── tiktoken_ext
│   └── my_encodings.py
└── setup.py

my_encodings.py應該是一個包含名為ENCODING_CONSTRUCTORS的變數的模組。這是一個從編碼名稱到函數的字典，該函數不帶參數並傳回可以傳遞給tiktoken.Encoding來建構該編碼的參數。有關範例，請參閱tiktoken_ext/openai_public.py 。詳情請參閱tiktoken/registry.py 。

你的setup.py應該看起來像這樣：

 from setuptools import setup , find_namespace_packages

setup (
    name = "my_tiktoken_extension" ,
    packages = find_namespace_packages ( include = [ 'tiktoken_ext*' ]),
    install_requires = [ "tiktoken" ],
    ...
)

然後只需pip install ./my_tiktoken_extension ，您應該就可以使用自訂編碼了！確保不要使用可編輯的安裝。

展開

附加信息

版本 0.8.0
類型其他源碼
更新時間 2025-01-08
大小 50MB
來自於 Github

相關應用

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
MySchedule.py

2024-12-15
viptools for eslam

2024-12-15
VITAident

2024-12-15

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
waymo open dataset

其他源碼

December 2023 Update
SmartTube

其他源碼

24.71 Stable
Sunamu

其他源碼

Release 2.2.0
waymo open dataset

其他源碼

December 2023 Update
wp functions

其他類別

1.0.0
termwind

其他類別

v2.3.0

相關資訊全部