Rust 原生最先進的自然語言處理模式和管道。 Hugging Face 的 Transformers 庫的端口,使用 tch-rs 或 onnxruntime 綁定以及來自 rust-tokenizer 的預處理。支援多執行緒標記化和GPU推理。該儲存庫公開了模型基礎架構、特定於任務的頭(見下文)和即用型管道。本文檔末尾提供了基準測試。
只需幾行程式碼即可開始執行包括問答、命名實體識別、翻譯、摘要、文字生成、對話代理等在內的任務:
let qa_model = QuestionAnsweringModel :: new ( Default :: default ( ) ) ? ;
let question = String :: from ( "Where does Amy live ?" ) ;
let context = String :: from ( "Amy lives in Amsterdam" ) ;
let answers = qa_model . predict ( & [ QaInput { question , context } ] , 1 , 32 ) ;
輸出:
[Answer { score: 0.9976, start: 13, end: 21, answer: "Amsterdam" }]
目前支援的任務包括:
序列分類 | 代幣分類 | 問答 | 文字生成 | 總結 | 翻譯 | 蒙面LM | 句子嵌入 | |
---|---|---|---|---|---|---|---|---|
蒸餾伯特 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
移動BERT | ✅ | ✅ | ✅ | ✅ | ||||
德貝爾塔 | ✅ | ✅ | ✅ | ✅ | ||||
德伯特 (v2) | ✅ | ✅ | ✅ | ✅ | ||||
金融網 | ✅ | ✅ | ✅ | ✅ | ||||
伯特 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
羅伯塔 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
GPT | ✅ | |||||||
GPT2 | ✅ | |||||||
GPT-Neo | ✅ | |||||||
GPT-J | ✅ | |||||||
捷運 | ✅ | ✅ | ✅ | |||||
瑪麗安 | ✅ | |||||||
巴特 | ✅ | ✅ | ||||||
M2M100 | ✅ | |||||||
NLLB | ✅ | |||||||
伊萊克特拉 | ✅ | ✅ | ||||||
阿爾伯特 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
T5 | ✅ | ✅ | ✅ | ✅ | ||||
龍T5 | ✅ | ✅ | ||||||
XL網 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
塑身機 | ✅ | ✅ | ✅ | ✅ | ||||
先知網 | ✅ | ✅ | ||||||
長形器 | ✅ | ✅ | ✅ | ✅ | ||||
飛馬座 | ✅ |
該函式庫依賴 tch crate 來綁定到 C++ Libtorch API。所需的libtorch庫可以自動或手動下載。以下提供了有關如何設定環境以使用這些綁定的參考,請參閱 tch 以取得詳細資訊或支援。
此外,該庫依賴快取資料夾來下載預先訓練的模型。此快取位置預設為~/.cache/.rustbert
,但可以透過設定RUSTBERT_CACHE
環境變數進行變更。請注意,該庫使用的語言模型約為數百 MB 到 GB。
libtorch
。該軟體包需要v2.4
:如果「開始」頁面上不再提供該版本,則應該可以透過修改目標連結來存取該文件,例如https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.4.0%2Bcu124.zip
適用於 CUDA12 的 Linux 版本。注意:當使用rust-bert
作為 crates.io 的依賴項時,請檢查已發佈的套件自述文件中所需的LIBTORCH
因為它可能與此處記錄的版本不同(適用於目前儲存庫版本)。 export LIBTORCH=/path/to/libtorch
export LD_LIBRARY_PATH= ${LIBTORCH} /lib: $LD_LIBRARY_PATH
$ Env: LIBTORCH = " X:pathtolibtorch "
$ Env: Path += " ;X:pathtolibtorchlib "
brew install pytorch jq
export LIBTORCH= $( brew --cellar pytorch ) / $( brew info --json pytorch | jq -r ' .[0].installed[0].version ' )
export LD_LIBRARY_PATH= ${LIBTORCH} /lib: $LD_LIBRARY_PATH
或者,您可以讓build
腳本自動為您下載libtorch
庫。需要啟用download-libtorch
功能標誌。預設會下載CPU版本的libtorch。若要下載 CUDA 版本,請將環境變數TORCH_CUDA_VERSION
設定為cu124
。請注意,libtorch 庫很大(對於啟用 CUDA 的版本而言,大約有幾個 GB),因此第一次建置可能需要幾分鐘才能完成。
透過將rust-bert
依賴項新增至Cargo.toml
或複製 rust-bert 來源並執行範例來驗證您的安裝(並與 libtorch 連結):
git clone [email protected]:guillaume-be/rust-bert.git
cd rust-bert
cargo run --example sentence_embeddings
ONNX 支援可以透過選購的onnx
功能啟用。然後,該套件利用 ort 套件與 onnxruntime C++ 函式庫的綁定。我們建議使用者造訪此頁面項目以獲得進一步的安裝說明/支援。
onnx
功能。 rust-bert
箱不包含ort
的任何可選依賴項,最終使用者應選擇足以拉取所需onnxruntime
C++ 庫的功能集。ort
的load-dynamic
貨物功能。ORT_DYLIB_PATH
設定為指向下載的 onnxruntime 函式庫的位置( onnxruntime.dll
/ libonnxruntime.so
/ libonnxruntime.dylib
取決於作業系統)。這些可以從 onnxruntime 專案的發布頁面下載支援大多數架構(包括編碼器、解碼器和編碼器-解碼器)。該庫旨在保持與使用 Optimum 庫導出的模型的兼容性。有關如何使用 Optimum 將 Transformer 模型匯出至 ONNX 的詳細指南,請造訪 https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model 用於建立 ONNX 模型的資源類似於基於Pytorch的,用ONNX模型取代pytorch。由於 ONNX 模型在處理可選參數方面不如 Pytorch 模型靈活,因此將解碼器或編碼器-解碼器模型匯出到 ONNX 通常會產生多個檔案。這些文件預計(但並非全部都是必需的)用於此庫,如下表所示:
建築學 | 編碼器文件 | 沒有過去檔案的解碼器 | 使用過去的文件進行解碼 |
---|---|---|---|
編碼器(例如BERT) | 必需的 | 未使用 | 未使用 |
解碼器(例如GPT2) | 未使用 | 必需的 | 選修的 |
編碼器-解碼器(例如BART) | 必需的 | 必需的 | 選修的 |
請注意,當decoder with past
是可選的但未提供時,計算效率將會下降,因為模型不會將快取的過去鍵和值用於注意機制,從而導致大量冗餘計算。 Optimum 庫提供匯出選項,以確保建立decoder with past
。基本編碼器和解碼器模型架構分別在encoder
和decoder
模組中可用(並為了方便而公開)。
models
模組中提供了生成模型(純解碼器或編碼器/解碼器架構)。大部分管道可用於 ONNX 模型檢查點,包括序列分類、零樣本分類、標記分類(包括命名實體識別和詞性標記)、問答、文字產生、摘要和翻譯。當在管道中使用時,這些模型使用與其 Pytorch 對應項相同的配置和標記器檔案。 ./examples
目錄中給出了利用 ONNX 模型的範例
基於 Hugging Face 的管道,可立即使用的端對端 NLP 管道作為此套件的一部分提供。目前可以使用以下功能:
免責聲明該儲存庫的貢獻者不對第三方使用本文提出的預訓練系統產生的任何結果負責。
從給定的問題和上下文中提取問題回答。 DistilBERT 模型在 SQuAD(史丹佛問答資料集)上進行了微調
let qa_model = QuestionAnsweringModel :: new ( Default :: default ( ) ) ? ;
let question = String :: from ( "Where does Amy live ?" ) ;
let context = String :: from ( "Amy lives in Amsterdam" ) ;
let answers = qa_model . predict ( & [ QaInput { question , context } ] , 1 , 32 ) ;
輸出:
[Answer { score: 0.9976, start: 13, end: 21, answer: "Amsterdam" }]
支援多種源語言和目標語言的翻譯管道。利用兩種主要架構來執行翻譯任務:
庫中隨時提供以下語言對的基於 Marian 的預訓練模型 - 但使用者可以導入任何基於 Pytorch 的模型進行預測
對於建議的預訓練 Marian 模型不支援的語言,使用者可以利用支援 100 種語言之間直接翻譯的 M2M100 模型(無需中間英語翻譯)。
use rust_bert :: pipelines :: translation :: { Language , TranslationModelBuilder } ;
fn main ( ) -> anyhow :: Result < ( ) > {
let model = TranslationModelBuilder :: new ( )
. with_source_languages ( vec ! [ Language :: English ] )
. with_target_languages ( vec ! [ Language :: Spanish , Language :: French , Language :: Italian ] )
. create_model ( ) ? ;
let input_text = "This is a sentence to be translated" ;
let output = model . translate ( & [ input_text ] , None , Language :: French ) ? ;
for sentence in output {
println ! ( "{}" , sentence ) ;
}
Ok ( ( ) )
}
輸出:
Il s'agit d'une phrase à traduire
使用預先訓練的 BART 模型進行抽象總結。
let summarization_model = SummarizationModel :: new ( Default :: default ( ) ) ? ;
let input = [ "In findings published Tuesday in Cornell University's arXiv by a team of scientists
from the University of Montreal and a separate report published Wednesday in Nature Astronomy by a team
from University College London (UCL), the presence of water vapour was confirmed in the atmosphere of K2-18b,
a planet circling a star in the constellation Leo. This is the first such discovery in a planet in its star's
habitable zone — not too hot and not too cold for liquid water to exist. The Montreal team, led by Björn Benneke,
used data from the NASA's Hubble telescope to assess changes in the light coming from K2-18b's star as the planet
passed between it and Earth. They found that certain wavelengths of light, which are usually absorbed by water,
weakened when the planet was in the way, indicating not only does K2-18b have an atmosphere, but the atmosphere
contains water in vapour form. The team from UCL then analyzed the Montreal team's data using their own software
and confirmed their conclusion. This was not the first time scientists have found signs of water on an exoplanet,
but previous discoveries were made on planets with high temperatures or other pronounced differences from Earth.
" This is the first potentially habitable planet where the temperature is right and where we now know there is water, "
said UCL astronomer Angelos Tsiaras. " It's the best candidate for habitability right now. " " It's a good sign " ,
said Ryan Cloutier of the Harvard–Smithsonian Center for Astrophysics, who was not one of either study's authors.
" Overall, " he continued, " the presence of water in its atmosphere certainly improves the prospect of K2-18b being
a potentially habitable planet, but further observations will be required to say for sure. "
K2-18b was first identified in 2015 by the Kepler space telescope. It is about 110 light-years from Earth and larger
but less dense. Its star, a red dwarf, is cooler than the Sun, but the planet's orbit is much closer, such that a year
on K2-18b lasts 33 Earth days. According to The Guardian, astronomers were optimistic that NASA's James Webb space
telescope — scheduled for launch in 2021 — and the European Space Agency's 2028 ARIEL program, could reveal more
about exoplanets like K2-18b." ] ;
let output = summarization_model . summarize ( & input ) ;
(範例來自:維基新聞)
輸出:
"Scientists have found water vapour on K2-18b, a planet 110 light-years from Earth.
This is the first such discovery in a planet in its star's habitable zone.
The planet is not too hot and not too cold for liquid water to exist."
基於微軟DialoGPT的對話模型。該管道允許在人類和模型之間產生單輪或多輪對話。 DialoGPT 的頁面指出
人類評估結果表明,DialoGPT 生成的反應在單輪對話圖靈測試下與人類反應品質相當。 (DialogoGPT 儲存庫)
該模型使用ConversationManager
來追蹤活動對話並產生對它們的回應。
use rust_bert :: pipelines :: conversation :: { ConversationModel , ConversationManager } ;
let conversation_model = ConversationModel :: new ( Default :: default ( ) ) ;
let mut conversation_manager = ConversationManager :: new ( ) ;
let conversation_id = conversation_manager . create ( "Going to the movies tonight - any suggestions?" ) ;
let output = conversation_model . generate_responses ( & mut conversation_manager ) ;
輸出範例:
"The Big Lebowski."
根據提示產生語言。 GPT2 和 GPT 可作為基本模型。包括波束搜尋、top-k 和核採樣、溫度設定和重複懲罰等技術。支援根據多個提示批量產生句子。序列將使用模型的填充標記(如果存在)進行左側填充,否則使用未知標記。這可能會影響結果,建議提交類似長度的提示以獲得最佳結果
let model = GPT2Generator :: new ( Default :: default ( ) ) ? ;
let input_context_1 = "The dog" ;
let input_context_2 = "The cat was" ;
let generate_options = GenerateOptions {
max_length : 30 ,
.. Default :: default ( )
} ;
let output = model . generate ( Some ( & [ input_context_1 , input_context_2 ] ) , generate_options ) ;
輸出範例:
[
"The dog's owners, however, did not want to be named. According to the lawsuit, the animal's owner, a 29-year"
"The dog has always been part of the family. "He was always going to be my dog and he was always looking out for me"
"The dog has been able to stay in the home for more than three months now. "It's a very good dog. She's"
"The cat was discovered earlier this month in the home of a relative of the deceased. The cat's owner, who wished to remain anonymous,"
"The cat was pulled from the street by two-year-old Jazmine."I didn't know what to do," she said"
"The cat was attacked by two stray dogs and was taken to a hospital. Two other cats were also injured in the attack and are being treated."
]
使用針對自然語言推理進行微調的模型,對具有提供的標籤的輸入句子執行零樣本分類。
let sequence_classification_model = ZeroShotClassificationModel :: new ( Default :: default ( ) ) ? ;
let input_sentence = "Who are you voting for in 2020?" ;
let input_sequence_2 = "The prime minister has announced a stimulus package which was widely criticized by the opposition." ;
let candidate_labels = & [ "politics" , "public health" , "economics" , "sports" ] ;
let output = sequence_classification_model . predict_multilabel (
& [ input_sentence , input_sequence_2 ] ,
candidate_labels ,
None ,
128 ,
) ;
輸出:
[
[ Label { "politics", score: 0.972 }, Label { "public health", score: 0.032 }, Label {"economics", score: 0.006 }, Label {"sports", score: 0.004 } ],
[ Label { "politics", score: 0.975 }, Label { "public health", score: 0.0818 }, Label {"economics", score: 0.852 }, Label {"sports", score: 0.001 } ],
]
預測句子的二元情感。 DistilBERT 模型在 SST-2 上進行了微調。
let sentiment_classifier = SentimentModel :: new ( Default :: default ( ) ) ? ;
let input = [
"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring." ,
"This film tried to be too many things all at once: stinging political satire, Hollywood blockbuster, sappy romantic comedy, family values promo..." ,
"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it." ,
] ;
let output = sentiment_classifier . predict ( & input ) ;
(範例由 IMDb 提供)
輸出:
[
Sentiment { polarity: Positive, score: 0.9981985493795946 },
Sentiment { polarity: Negative, score: 0.9927982091903687 },
Sentiment { polarity: Positive, score: 0.9997248985164333 }
]
從文本中提取實體(人員、位置、組織、雜項)。 BERT 在 CoNNL03 上微調了大型模型,由巴伐利亞州立圖書館的 MDZ 數位圖書館團隊貢獻。目前提供英語、德語、西班牙語和荷蘭語版本。
let ner_model = NERModel :: new ( default :: default ( ) ) ? ;
let input = [
"My name is Amy. I live in Paris." ,
"Paris is a city in France."
] ;
let output = ner_model . predict ( & input ) ;
輸出:
[
[
Entity { word: "Amy", score: 0.9986, label: "I-PER" }
Entity { word: "Paris", score: 0.9985, label: "I-LOC" }
],
[
Entity { word: "Paris", score: 0.9988, label: "I-LOC" }
Entity { word: "France", score: 0.9993, label: "I-LOC" }
]
]
從輸入文件中提取關鍵字和關鍵字詞
fn main ( ) -> anyhow :: Result < ( ) > {
let keyword_extraction_model = KeywordExtractionModel :: new ( Default :: default ( ) ) ? ;
let input = "Rust is a multi-paradigm, general-purpose programming language.
Rust emphasizes performance, type safety, and concurrency. Rust enforces memory safety—that is,
that all references point to valid memory—without requiring the use of a garbage collector or
reference counting present in other memory-safe languages. To simultaneously enforce
memory safety and prevent concurrent data races, Rust's borrow checker tracks the object lifetime
and variable scope of all references in a program during compilation. Rust is popular for
systems programming but also offers high-level features including functional programming constructs." ;
let output = keyword_extraction_model . predict ( & [ input ] ) ? ;
}
輸出:
"rust" - 0.50910604
"programming" - 0.35731024
"concurrency" - 0.33825397
"concurrent" - 0.31229728
"program" - 0.29115444
從文本中提取詞性標籤(名詞、動詞、形容詞...)。
let pos_model = POSModel :: new ( default :: default ( ) ) ? ;
let input = [ "My name is Bob" ] ;
let output = pos_model . predict ( & input ) ;
輸出:
[
Entity { word: "My", score: 0.1560, label: "PRP" }
Entity { word: "name", score: 0.6565, label: "NN" }
Entity { word: "is", score: 0.3697, label: "VBZ" }
Entity { word: "Bob", score: 0.7460, label: "NNP" }
]
產生句子嵌入(向量表示)。這些可用於包括密集資訊檢索的應用。
let model = SentenceEmbeddingsBuilder :: remote (
SentenceEmbeddingsModelType :: AllMiniLmL12V2
) . create_model ( ) ? ;
let sentences = [
"this is an example sentence" ,
"each sentence is converted"
] ;
let output = model . encode ( & sentences ) ? ;
輸出:
[
[-0.000202666, 0.08148022, 0.03136178, 0.002920636 ...],
[0.064757116, 0.048519745, -0.01786038, -0.0479775 ...]
]
預測輸入句子中的屏蔽詞。
let model = MaskedLanguageModel :: new ( Default :: default ( ) ) ? ;
let sentences = [
"Hello I am a <mask> student" ,
"Paris is the <mask> of France. It is <mask> in Europe." ,
] ;
let output = model . predict ( & sentences ) ;
輸出:
[
[MaskedToken { text: "college", id: 2267, score: 8.091}],
[
MaskedToken { text: "capital", id: 3007, score: 16.7249},
MaskedToken { text: "located", id: 2284, score: 9.0452}
]
]
對於簡單的管道(序列分類、標記分類、問題回答),Python 和 Rust 之間的性能預計具有可比性。這是因為這些管道中最昂貴的部分是語言模型本身,在 Torch 後端共享一個通用的實作。 Rust 中的端對端 NLP 管道提供了涵蓋所有管道的基準部分。
對於文字生成任務(摘要、翻譯、對話、自由文字生成),預計會帶來顯著的好處(處理速度提高 2 到 4 倍,取決於輸入和應用程式)。使用 Rust 加速文字生成一文重點介紹了這些文字生成應用程序,並提供了與 Python 的效能比較的更多詳細資訊。
基礎模型和特定任務的頭也可供希望公開自己的基於變壓器的模型的用戶使用。有關如何使用本機分詞器 Rust 庫準備日期的範例,請參閱./examples
for BERT、DistilBERT、RoBERTa、GPT、GPT2 和 BART。請注意,從 Pytorch 匯入模型時,參數命名約定需要與 Rust 模式保持一致。如果在權重檔案中找不到任何模型參數權重,則預訓練權重的載入將會失敗。如果要跳過此品質檢查,則可以從變數儲存中呼叫替代方法load_partial
。
Hugging Face 的模型中心提供了預訓練模型,並且可以使用此程式庫中定義的RemoteResources
進行載入。
./utils
中包含一個轉換實用程式腳本,用於將 Pytorch 權重轉換為與該程式庫相容的一組權重。該腳本需要安裝Python和torch