GPTEncoder
1.0.4
適用於 OpenAI GPT 模型的 Swift BPE 編碼器/解碼器。用於對 OpenAI GPT API 文字進行標記的程式設計介面。
GPT 系列模型使用標記處理文本,標記是文本中常見的字元序列。這些模型了解這些標記之間的統計關係,並且擅長產生標記序列中的下一個標記。
您可以使用下面的工具來了解 API 如何對一段文字進行標記,以及該文字中的標記總數。
該庫基於nodeJS gpt-3-encoder和OpenAI官方Python GPT編碼器/解碼器
我還創建了 GPTTokenizerUI,這是一個 SPM 庫,您可以將其整合到您的應用程式中,以提供 GUI 來輸入文字並顯示 GPT API 使用的標記化結果。
platform :ios , '15.0'
use_frameworks!
target 'MyApp' do
pod 'GPTEncoder' , '~> 1.0.3'
end
let encoder = SwiftGPTEncoder ( )
let str = " The GPT family of models process text using tokens, which are common sequences of characters found in text. "
let encoded = encoder . encode ( text : str )
print ( " String: ( str ) " )
print ( " Encoded this string looks like: ( encoded ) " )
print ( " Total number of token(s): ( encoded . count ) and character(s): ( str . count ) " )
print ( " We can look at each token and what it represents " )
encoded . forEach { print ( " Token: ( encoder . decode ( tokens : [ $0 ] ) ) " ) }
print ( encoded )
let decoded = encoder . decode ( tokens : encoded )
print ( " We can decode it back into: n ( decoded ) " )
要將String
編碼為Int
標記數組,您只需呼叫傳遞字串encode
即可。
let encoded = encoder . encode ( text : " The GPT family of models process text using tokens, which are common sequences of characters found in text. " )
// Output: [464, 402, 11571, 1641, 286, 4981, 1429, 2420, 1262, 16326, 11, 543, 389, 2219, 16311, 286, 3435, 1043, 287, 2420, 13]
若要將Int
標記陣列解碼回String
您可以呼叫decode
並傳遞標記陣列。
let decoded = encoder . decode ( tokens : [ 464 , 402 , 11571 , 1641 , 286 , 4981 , 1429 , 2420 , 1262 , 16326 , 11 , 543 , 389 , 2219 , 16311 , 286 , 3435 , 1043 , 287 , 2420 , 13 ] )
// Output: "The GPT family of models process text using tokens, which are common sequences of characters found in text."
在內部,快取用於提高對令牌進行編碼時的效能,您也可以重置快取。
encoder . clearCache ( )