Rust 原生最先进的自然语言处理模型和管道。 Hugging Face 的 Transformers 库的端口,使用 tch-rs 或 onnxruntime 绑定以及来自 rust-tokenizer 的预处理。支持多线程标记化和GPU推理。该存储库公开了模型基础架构、特定于任务的头(见下文)和即用型管道。本文档末尾提供了基准。
只需几行代码即可开始执行包括问答、命名实体识别、翻译、摘要、文本生成、对话代理等在内的任务:
let qa_model = QuestionAnsweringModel :: new ( Default :: default ( ) ) ? ;
let question = String :: from ( "Where does Amy live ?" ) ;
let context = String :: from ( "Amy lives in Amsterdam" ) ;
let answers = qa_model . predict ( & [ QaInput { question , context } ] , 1 , 32 ) ;
输出:
[Answer { score: 0.9976, start: 13, end: 21, answer: "Amsterdam" }]
目前支持的任务包括:
序列分类 | 代币分类 | 问答 | 文本生成 | 总结 | 翻译 | 蒙面LM | 句子嵌入 | |
---|---|---|---|---|---|---|---|---|
蒸馏伯特 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
移动BERT | ✅ | ✅ | ✅ | ✅ | ||||
德贝尔塔 | ✅ | ✅ | ✅ | ✅ | ||||
德伯特 (v2) | ✅ | ✅ | ✅ | ✅ | ||||
金融网 | ✅ | ✅ | ✅ | ✅ | ||||
伯特 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
罗伯塔 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
GPT | ✅ | |||||||
GPT2 | ✅ | |||||||
GPT-Neo | ✅ | |||||||
GPT-J | ✅ | |||||||
捷运 | ✅ | ✅ | ✅ | |||||
玛丽安 | ✅ | |||||||
巴特 | ✅ | ✅ | ||||||
M2M100 | ✅ | |||||||
NLLB | ✅ | |||||||
伊莱克特拉 | ✅ | ✅ | ||||||
阿尔伯特 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
T5 | ✅ | ✅ | ✅ | ✅ | ||||
龙T5 | ✅ | ✅ | ||||||
XL网 | ✅ | ✅ | ✅ | ✅ | ✅ | |||
塑身机 | ✅ | ✅ | ✅ | ✅ | ||||
先知网 | ✅ | ✅ | ||||||
长形器 | ✅ | ✅ | ✅ | ✅ | ||||
飞马座 | ✅ |
该库依赖 tch crate 来绑定到 C++ Libtorch API。所需的libtorch库可以自动或手动下载。以下提供了有关如何设置环境以使用这些绑定的参考,请参阅 tch 以获取详细信息或支持。
此外,该库依赖于缓存文件夹来下载预训练的模型。此缓存位置默认为~/.cache/.rustbert
,但可以通过设置RUSTBERT_CACHE
环境变量进行更改。请注意,该库使用的语言模型约为数百 MB 到 GB。
libtorch
。该软件包需要v2.4
:如果“开始”页面上不再提供该版本,则应该可以通过修改目标链接来访问该文件,例如https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.4.0%2Bcu124.zip
适用于带有 CUDA12 的 Linux 版本。注意:当使用rust-bert
作为 crates.io 的依赖项时,请检查已发布的包自述文件中所需的LIBTORCH
因为它可能与此处记录的版本不同(适用于当前存储库版本)。 export LIBTORCH=/path/to/libtorch
export LD_LIBRARY_PATH= ${LIBTORCH} /lib: $LD_LIBRARY_PATH
$ Env: LIBTORCH = " X:pathtolibtorch "
$ Env: Path += " ;X:pathtolibtorchlib "
brew install pytorch jq
export LIBTORCH= $( brew --cellar pytorch ) / $( brew info --json pytorch | jq -r ' .[0].installed[0].version ' )
export LD_LIBRARY_PATH= ${LIBTORCH} /lib: $LD_LIBRARY_PATH
或者,您可以让build
脚本自动为您下载libtorch
库。需要启用download-libtorch
功能标志。默认情况下会下载CPU版本的libtorch。要下载 CUDA 版本,请将环境变量TORCH_CUDA_VERSION
设置为cu124
。请注意,libtorch 库很大(对于启用 CUDA 的版本来说有几个 GB),因此第一次构建可能需要几分钟才能完成。
通过将rust-bert
依赖项添加到Cargo.toml
或克隆 rust-bert 源并运行示例来验证您的安装(并与 libtorch 链接):
git clone [email protected]:guillaume-be/rust-bert.git
cd rust-bert
cargo run --example sentence_embeddings
ONNX 支持可以通过可选的onnx
功能启用。然后,该包利用 ort 包与 onnxruntime C++ 库的绑定。我们建议用户访问此页面项目以获得进一步的安装说明/支持。
onnx
功能。 rust-bert
箱不包含ort
的任何可选依赖项,最终用户应选择足以拉取所需onnxruntime
C++ 库的功能集。ort
的load-dynamic
货物功能。ORT_DYLIB_PATH
设置为指向下载的 onnxruntime 库的位置( onnxruntime.dll
/ libonnxruntime.so
/ libonnxruntime.dylib
取决于操作系统)。这些可以从 onnxruntime 项目的发布页面下载支持大多数架构(包括编码器、解码器和编码器-解码器)。该库旨在保持与使用 Optimum 库导出的模型的兼容性。有关如何使用 Optimum 将 Transformer 模型导出到 ONNX 的详细指南,请访问 https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model 用于创建 ONNX 模型的资源类似于基于Pytorch的,用ONNX模型替换pytorch。由于 ONNX 模型在处理可选参数方面不如 Pytorch 模型灵活,因此将解码器或编码器-解码器模型导出到 ONNX 通常会生成多个文件。这些文件预计(但并非全部都是必需的)用于此库,如下表所示:
建筑学 | 编码器文件 | 没有过去文件的解码器 | 使用过去的文件进行解码 |
---|---|---|---|
编码器(例如BERT) | 必需的 | 未使用 | 未使用 |
解码器(例如GPT2) | 未使用 | 必需的 | 选修的 |
编码器-解码器(例如BART) | 必需的 | 必需的 | 选修的 |
请注意,当decoder with past
是可选的但未提供时,计算效率将会下降,因为模型不会将缓存的过去键和值用于注意机制,从而导致大量冗余计算。 Optimum 库提供导出选项,以确保创建decoder with past
。基本编码器和解码器模型架构分别在encoder
和decoder
模块中可用(并为了方便而公开)。
models
模块中提供了生成模型(纯解码器或编码器/解码器架构)。大部分管道可用于 ONNX 模型检查点,包括序列分类、零样本分类、标记分类(包括命名实体识别和词性标记)、问答、文本生成、摘要和翻译。当在管道中使用时,这些模型使用与其 Pytorch 对应项相同的配置和标记器文件。 ./examples
目录中给出了利用 ONNX 模型的示例
基于 Hugging Face 的管道,可立即使用的端到端 NLP 管道作为此包的一部分提供。目前可以使用以下功能:
免责声明此存储库的贡献者不对第三方使用本文提出的预训练系统产生的任何结果负责。
从给定的问题和上下文中提取问题回答。 DistilBERT 模型在 SQuAD(斯坦福问答数据集)上进行了微调
let qa_model = QuestionAnsweringModel :: new ( Default :: default ( ) ) ? ;
let question = String :: from ( "Where does Amy live ?" ) ;
let context = String :: from ( "Amy lives in Amsterdam" ) ;
let answers = qa_model . predict ( & [ QaInput { question , context } ] , 1 , 32 ) ;
输出:
[Answer { score: 0.9976, start: 13, end: 21, answer: "Amsterdam" }]
支持多种源语言和目标语言的翻译管道。利用两种主要架构来执行翻译任务:
库中随时提供以下语言对的基于 Marian 的预训练模型 - 但用户可以导入任何基于 Pytorch 的模型进行预测
对于建议的预训练 Marian 模型不支持的语言,用户可以利用支持 100 种语言之间直接翻译的 M2M100 模型(无需中间英语翻译)。支持的语言的完整列表可在 crate 文档中找到
use rust_bert :: pipelines :: translation :: { Language , TranslationModelBuilder } ;
fn main ( ) -> anyhow :: Result < ( ) > {
let model = TranslationModelBuilder :: new ( )
. with_source_languages ( vec ! [ Language :: English ] )
. with_target_languages ( vec ! [ Language :: Spanish , Language :: French , Language :: Italian ] )
. create_model ( ) ? ;
let input_text = "This is a sentence to be translated" ;
let output = model . translate ( & [ input_text ] , None , Language :: French ) ? ;
for sentence in output {
println ! ( "{}" , sentence ) ;
}
Ok ( ( ) )
}
输出:
Il s'agit d'une phrase à traduire
使用预训练的 BART 模型进行抽象总结。
let summarization_model = SummarizationModel :: new ( Default :: default ( ) ) ? ;
let input = [ "In findings published Tuesday in Cornell University's arXiv by a team of scientists
from the University of Montreal and a separate report published Wednesday in Nature Astronomy by a team
from University College London (UCL), the presence of water vapour was confirmed in the atmosphere of K2-18b,
a planet circling a star in the constellation Leo. This is the first such discovery in a planet in its star's
habitable zone — not too hot and not too cold for liquid water to exist. The Montreal team, led by Björn Benneke,
used data from the NASA's Hubble telescope to assess changes in the light coming from K2-18b's star as the planet
passed between it and Earth. They found that certain wavelengths of light, which are usually absorbed by water,
weakened when the planet was in the way, indicating not only does K2-18b have an atmosphere, but the atmosphere
contains water in vapour form. The team from UCL then analyzed the Montreal team's data using their own software
and confirmed their conclusion. This was not the first time scientists have found signs of water on an exoplanet,
but previous discoveries were made on planets with high temperatures or other pronounced differences from Earth.
" This is the first potentially habitable planet where the temperature is right and where we now know there is water, "
said UCL astronomer Angelos Tsiaras. " It's the best candidate for habitability right now. " " It's a good sign " ,
said Ryan Cloutier of the Harvard–Smithsonian Center for Astrophysics, who was not one of either study's authors.
" Overall, " he continued, " the presence of water in its atmosphere certainly improves the prospect of K2-18b being
a potentially habitable planet, but further observations will be required to say for sure. "
K2-18b was first identified in 2015 by the Kepler space telescope. It is about 110 light-years from Earth and larger
but less dense. Its star, a red dwarf, is cooler than the Sun, but the planet's orbit is much closer, such that a year
on K2-18b lasts 33 Earth days. According to The Guardian, astronomers were optimistic that NASA's James Webb space
telescope — scheduled for launch in 2021 — and the European Space Agency's 2028 ARIEL program, could reveal more
about exoplanets like K2-18b." ] ;
let output = summarization_model . summarize ( & input ) ;
(示例来自:维基新闻)
输出:
"Scientists have found water vapour on K2-18b, a planet 110 light-years from Earth.
This is the first such discovery in a planet in its star's habitable zone.
The planet is not too hot and not too cold for liquid water to exist."
基于微软DialoGPT的对话模型。该管道允许在人类和模型之间生成单轮或多轮对话。 DialoGPT 的页面指出
人类评估结果表明,DialoGPT 生成的响应在单轮对话图灵测试下与人类响应质量相当。 (DialogoGPT 存储库)
该模型使用ConversationManager
来跟踪活动对话并生成对它们的响应。
use rust_bert :: pipelines :: conversation :: { ConversationModel , ConversationManager } ;
let conversation_model = ConversationModel :: new ( Default :: default ( ) ) ;
let mut conversation_manager = ConversationManager :: new ( ) ;
let conversation_id = conversation_manager . create ( "Going to the movies tonight - any suggestions?" ) ;
let output = conversation_model . generate_responses ( & mut conversation_manager ) ;
输出示例:
"The Big Lebowski."
根据提示生成语言。 GPT2 和 GPT 可作为基本模型。包括波束搜索、top-k 和核采样、温度设置和重复惩罚等技术。支持根据多个提示批量生成句子。序列将使用模型的填充标记(如果存在)进行左侧填充,否则使用未知标记。这可能会影响结果,建议提交类似长度的提示以获得最佳结果
let model = GPT2Generator :: new ( Default :: default ( ) ) ? ;
let input_context_1 = "The dog" ;
let input_context_2 = "The cat was" ;
let generate_options = GenerateOptions {
max_length : 30 ,
.. Default :: default ( )
} ;
let output = model . generate ( Some ( & [ input_context_1 , input_context_2 ] ) , generate_options ) ;
输出示例:
[
"The dog's owners, however, did not want to be named. According to the lawsuit, the animal's owner, a 29-year"
"The dog has always been part of the family. "He was always going to be my dog and he was always looking out for me"
"The dog has been able to stay in the home for more than three months now. "It's a very good dog. She's"
"The cat was discovered earlier this month in the home of a relative of the deceased. The cat's owner, who wished to remain anonymous,"
"The cat was pulled from the street by two-year-old Jazmine."I didn't know what to do," she said"
"The cat was attacked by two stray dogs and was taken to a hospital. Two other cats were also injured in the attack and are being treated."
]
使用针对自然语言推理进行微调的模型,对具有提供的标签的输入句子执行零样本分类。
let sequence_classification_model = ZeroShotClassificationModel :: new ( Default :: default ( ) ) ? ;
let input_sentence = "Who are you voting for in 2020?" ;
let input_sequence_2 = "The prime minister has announced a stimulus package which was widely criticized by the opposition." ;
let candidate_labels = & [ "politics" , "public health" , "economics" , "sports" ] ;
let output = sequence_classification_model . predict_multilabel (
& [ input_sentence , input_sequence_2 ] ,
candidate_labels ,
None ,
128 ,
) ;
输出:
[
[ Label { "politics", score: 0.972 }, Label { "public health", score: 0.032 }, Label {"economics", score: 0.006 }, Label {"sports", score: 0.004 } ],
[ Label { "politics", score: 0.975 }, Label { "public health", score: 0.0818 }, Label {"economics", score: 0.852 }, Label {"sports", score: 0.001 } ],
]
预测句子的二元情感。 DistilBERT 模型在 SST-2 上进行了微调。
let sentiment_classifier = SentimentModel :: new ( Default :: default ( ) ) ? ;
let input = [
"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring." ,
"This film tried to be too many things all at once: stinging political satire, Hollywood blockbuster, sappy romantic comedy, family values promo..." ,
"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it." ,
] ;
let output = sentiment_classifier . predict ( & input ) ;
(示例由 IMDb 提供)
输出:
[
Sentiment { polarity: Positive, score: 0.9981985493795946 },
Sentiment { polarity: Negative, score: 0.9927982091903687 },
Sentiment { polarity: Positive, score: 0.9997248985164333 }
]
从文本中提取实体(人员、位置、组织、杂项)。 BERT 在 CoNNL03 上微调了大型模型,由巴伐利亚州立图书馆的 MDZ 数字图书馆团队贡献。目前提供英语、德语、西班牙语和荷兰语版本。
let ner_model = NERModel :: new ( default :: default ( ) ) ? ;
let input = [
"My name is Amy. I live in Paris." ,
"Paris is a city in France."
] ;
let output = ner_model . predict ( & input ) ;
输出:
[
[
Entity { word: "Amy", score: 0.9986, label: "I-PER" }
Entity { word: "Paris", score: 0.9985, label: "I-LOC" }
],
[
Entity { word: "Paris", score: 0.9988, label: "I-LOC" }
Entity { word: "France", score: 0.9993, label: "I-LOC" }
]
]
从输入文档中提取关键字和关键短语
fn main ( ) -> anyhow :: Result < ( ) > {
let keyword_extraction_model = KeywordExtractionModel :: new ( Default :: default ( ) ) ? ;
let input = "Rust is a multi-paradigm, general-purpose programming language.
Rust emphasizes performance, type safety, and concurrency. Rust enforces memory safety—that is,
that all references point to valid memory—without requiring the use of a garbage collector or
reference counting present in other memory-safe languages. To simultaneously enforce
memory safety and prevent concurrent data races, Rust's borrow checker tracks the object lifetime
and variable scope of all references in a program during compilation. Rust is popular for
systems programming but also offers high-level features including functional programming constructs." ;
let output = keyword_extraction_model . predict ( & [ input ] ) ? ;
}
输出:
"rust" - 0.50910604
"programming" - 0.35731024
"concurrency" - 0.33825397
"concurrent" - 0.31229728
"program" - 0.29115444
从文本中提取词性标签(名词、动词、形容词...)。
let pos_model = POSModel :: new ( default :: default ( ) ) ? ;
let input = [ "My name is Bob" ] ;
let output = pos_model . predict ( & input ) ;
输出:
[
Entity { word: "My", score: 0.1560, label: "PRP" }
Entity { word: "name", score: 0.6565, label: "NN" }
Entity { word: "is", score: 0.3697, label: "VBZ" }
Entity { word: "Bob", score: 0.7460, label: "NNP" }
]
生成句子嵌入(向量表示)。这些可用于包括密集信息检索在内的应用。
let model = SentenceEmbeddingsBuilder :: remote (
SentenceEmbeddingsModelType :: AllMiniLmL12V2
) . create_model ( ) ? ;
let sentences = [
"this is an example sentence" ,
"each sentence is converted"
] ;
let output = model . encode ( & sentences ) ? ;
输出:
[
[-0.000202666, 0.08148022, 0.03136178, 0.002920636 ...],
[0.064757116, 0.048519745, -0.01786038, -0.0479775 ...]
]
预测输入句子中的屏蔽词。
let model = MaskedLanguageModel :: new ( Default :: default ( ) ) ? ;
let sentences = [
"Hello I am a <mask> student" ,
"Paris is the <mask> of France. It is <mask> in Europe." ,
] ;
let output = model . predict ( & sentences ) ;
输出:
[
[MaskedToken { text: "college", id: 2267, score: 8.091}],
[
MaskedToken { text: "capital", id: 3007, score: 16.7249},
MaskedToken { text: "located", id: 2284, score: 9.0452}
]
]
对于简单的管道(序列分类、标记分类、问题回答),Python 和 Rust 之间的性能预计具有可比性。这是因为这些管道中最昂贵的部分是语言模型本身,在 Torch 后端共享一个通用的实现。 Rust 中的端到端 NLP 管道提供了涵盖所有管道的基准部分。
对于文本生成任务(摘要、翻译、对话、自由文本生成),预计会带来显着的好处(处理速度提高 2 到 4 倍,具体取决于输入和应用程序)。使用 Rust 加速文本生成一文重点介绍了这些文本生成应用程序,并提供了与 Python 的性能比较的更多详细信息。
基础模型和特定于任务的头也可供希望公开自己的基于变压器的模型的用户使用。有关如何使用本机分词器 Rust 库准备日期的示例,请参阅./examples
for BERT、DistilBERT、RoBERTa、GPT、GPT2 和 BART。请注意,从 Pytorch 导入模型时,参数命名约定需要与 Rust 模式保持一致。如果在权重文件中找不到任何模型参数权重,则预训练权重的加载将失败。如果要跳过此质量检查,则可以从变量存储中调用替代方法load_partial
。
Hugging Face 的模型中心提供了预训练模型,并且可以使用此库中定义的RemoteResources
进行加载。
./utils
中包含一个转换实用程序脚本,用于将 Pytorch 权重转换为与该库兼容的一组权重。该脚本需要安装Python和torch