ai21 tokenizer
v0.12.0
AI21의 모델과 함께 프로덕션 용도로 사용되는 SentencePiece 기반 토크나이저
Jamba 1.5 Mini
또는 Jamba 1.5 Large
용 토크나이저를 사용하려면 관련 모델의 HuggingFace 저장소에 대한 액세스를 요청해야 합니다.pip install ai21-tokenizer
poetry add ai21-tokenizer
from ai21_tokenizer import Tokenizer , PreTrainedTokenizers
tokenizer = Tokenizer . get_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_MINI_TOKENIZER )
# Your code here
또 다른 방법은 Jamba 1.5 Mini 토크나이저를 직접 사용하는 것입니다.
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer ( model_path = model_path )
# Your code here
from ai21_tokenizer import Tokenizer , PreTrainedTokenizers
tokenizer = await Tokenizer . get_async_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_MINI_TOKENIZER )
# Your code here
from ai21_tokenizer import Tokenizer , PreTrainedTokenizers
tokenizer = Tokenizer . get_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_LARGE_TOKENIZER )
# Your code here
또 다른 방법은 Jamba 1.5 Large 토크나이저를 직접 사용하는 것입니다.
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer ( model_path = model_path )
# Your code here
from ai21_tokenizer import Tokenizer , PreTrainedTokenizers
tokenizer = await Tokenizer . get_async_tokenizer ( PreTrainedTokenizers . JAMBA_1_5_LARGE_TOKENIZER )
# Your code here
from ai21_tokenizer import Tokenizer , PreTrainedTokenizers
tokenizer = Tokenizer . get_tokenizer ( PreTrainedTokenizers . JAMBA_INSTRUCT_TOKENIZER )
# Your code here
또 다른 방법은 Jamba 토크나이저를 직접 사용하는 것입니다.
from ai21_tokenizer import JambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer ( model_path = model_path )
# Your code here
from ai21_tokenizer import Tokenizer , PreTrainedTokenizers
tokenizer = await Tokenizer . get_async_tokenizer ( PreTrainedTokenizers . JAMBA_INSTRUCT_TOKENIZER )
# Your code here
또 다른 방법은 비동기 Jamba 토크나이저 클래스 메소드 생성을 사용하는 것입니다.
from ai21_tokenizer import AsyncJambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer . create ( model_path = model_path )
# Your code here
from ai21_tokenizer import Tokenizer
tokenizer = Tokenizer . get_tokenizer ()
# Your code here
또 다른 방법은 Jurassic 모델을 직접 사용하는 것입니다.
from ai21_tokenizer import JurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer ( model_path = model_path , config = config )
from ai21_tokenizer import Tokenizer
tokenizer = await Tokenizer . get_async_tokenizer ()
# Your code here
또 다른 방법은 비동기 Jamba 토크나이저 클래스 메소드 생성을 사용하는 것입니다.
from ai21_tokenizer import AsyncJurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer . create ( model_path = model_path , config = config )
# Your code here
이 기능을 사용하면 텍스트를 토큰 ID 목록으로 인코딩하고 다시 일반 텍스트로 인코딩할 수 있습니다.
text_to_encode = "apple orange banana"
encoded_text = tokenizer . encode ( text_to_encode )
print ( f"Encoded text: { encoded_text } " )
decoded_text = tokenizer . decode ( encoded_text )
print ( f"Decoded text: { decoded_text } " )
# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer . encode ( text_to_encode )
print ( f"Encoded text: { encoded_text } " )
decoded_text = await tokenizer . decode ( encoded_text )
print ( f"Decoded text: { decoded_text } " )
tokens = tokenizer . convert_ids_to_tokens ( encoded_text )
print ( f"IDs corresponds to Tokens: { tokens } " )
ids = tokenizer . convert_tokens_to_ids ( tokens )
# Assuming you have created an async tokenizer
tokens = await tokenizer . convert_ids_to_tokens ( encoded_text )
print ( f"IDs corresponds to Tokens: { tokens } " )
ids = tokenizer . convert_tokens_to_ids ( tokens )
더 많은 예제를 보려면 예제 폴더를 참조하세요.