ai21 tokenizer下载 - ai21 tokenizer源代码下载

ai21 tokenizer

其他源码

v0.12.0

下载

AI21 实验室分词器

基于 SentencePiece 的分词器，用于 AI21 模型的生产使用

先决条件

如果您希望使用Jamba 1.5 Mini或Jamba 1.5 Large的标记器，您将需要请求访问相关模型的 HuggingFace 存储库：
- 坚巴1.5迷你
- 坚巴 1.5 大号

安装

点

pip install ai21-tokenizer

诗

poetry add ai21-tokenizer

用法

分词器创建

Jamba 1.5 迷你分词器

 from ai21_tokenizer import Tokenizer , PreTrainedTokenizers

tokenizer = Tokenizer . get_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_MINI_TOKENIZER )
# Your code here

另一种方法是直接使用我们的 Jamba 1.5 Mini tokenizer：

 from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer ( model_path = model_path )
# Your code here

异步使用

 from ai21_tokenizer import Tokenizer , PreTrainedTokenizers

tokenizer = await Tokenizer . get_async_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_MINI_TOKENIZER )
# Your code here

Jamba 1.5 大型分词器

 from ai21_tokenizer import Tokenizer , PreTrainedTokenizers

tokenizer = Tokenizer . get_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_LARGE_TOKENIZER )
# Your code here

另一种方法是直接使用我们的 Jamba 1.5 Large tokenizer：

 from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer ( model_path = model_path )
# Your code here

异步使用

 from ai21_tokenizer import Tokenizer , PreTrainedTokenizers

tokenizer = await Tokenizer . get_async_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_LARGE_TOKENIZER )
# Your code here

Jamba 指令分词器

 from ai21_tokenizer import Tokenizer , PreTrainedTokenizers

tokenizer = Tokenizer . get_tokenizer ( PreTrainedTokenizers . JAMBA_INSTRUCT_TOKENIZER )
# Your code here

另一种方法是直接使用我们的 Jamba 标记器：

 from ai21_tokenizer import JambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer ( model_path = model_path )
# Your code here

异步使用

 from ai21_tokenizer import Tokenizer , PreTrainedTokenizers

tokenizer = await Tokenizer . get_async_tokenizer ( PreTrainedTokenizers . JAMBA_INSTRUCT_TOKENIZER )
# Your code here

另一种方法是使用我们的异步 Jamba tokenizer 类方法 create：

 from ai21_tokenizer import AsyncJambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer . create ( model_path = model_path )
# Your code here

J2 分词器

 from ai21_tokenizer import Tokenizer

tokenizer = Tokenizer . get_tokenizer ()
# Your code here

另一种方法是直接使用我们的侏罗纪模型：

 from ai21_tokenizer import JurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer ( model_path = model_path , config = config )

异步使用

 from ai21_tokenizer import Tokenizer

tokenizer = await Tokenizer . get_async_tokenizer ()
# Your code here

另一种方法是使用我们的异步 Jamba tokenizer 类方法 create：

 from ai21_tokenizer import AsyncJurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer . create ( model_path = model_path , config = config )
# Your code here

功能

编码和解码

这些函数允许您将文本编码为令牌 ID 列表并返回明文

 text_to_encode = "apple orange banana"
encoded_text = tokenizer . encode ( text_to_encode )
print ( f"Encoded text: { encoded_text } " )

decoded_text = tokenizer . decode ( encoded_text )
print ( f"Decoded text: { decoded_text } " )

异步

 # Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer . encode ( text_to_encode )
print ( f"Encoded text: { encoded_text } " )

decoded_text = await tokenizer . decode ( encoded_text )
print ( f"Decoded text: { decoded_text } " )

如果您想将令牌转换为 id 或反之亦然怎么办？

 tokens = tokenizer . convert_ids_to_tokens ( encoded_text )
print ( f"IDs corresponds to Tokens: { tokens } " )

ids = tokenizer . convert_tokens_to_ids ( tokens )

异步

 # Assuming you have created an async tokenizer
tokens = await tokenizer . convert_ids_to_tokens ( encoded_text )
print ( f"IDs corresponds to Tokens: { tokens } " )

ids = tokenizer . convert_tokens_to_ids ( tokens )