instructor embedding下载 - instructor embedding源代码下载

我的个人叉子

这是 Instructor 模型的一个分支，因为原始存储库不再保留。我还对其源代码进行了一些改进：

修复它以与 2.2.2 以上的sentence-transformers库一起使用。
使用新的“快照下载”API 从 Huggingface 正确下载模型。
能够使用“cache_dir”参数指定模型的下载位置。

以下是原始存储库的自述文件。然而，请忽略量化部分，因为 pytorch 从那时起已经更改了其 API。

一个嵌入器，任何任务：指令微调文本嵌入

该存储库包含我们的论文《One Embedder，Any Task：Instruction-Finetuned Text Embeddings》的代码和预训练模型。请参阅我们的项目页面以快速了解项目概述。

我们引入了Instructor ?‍?，一种指令微调的文本嵌入模型，可以生成适合任何任务（例如分类、检索、聚类、文本评估等）和领域（例如科学、金融等）的文本嵌入。通过简单地提供任务指令，无需任何微调。讲师？‍ 在 70 种不同的嵌入任务上取得了出色成绩！

**************************更新********************** ******

01/21：我们更新了代码结构，支持轻松安装包。
12/28：我们用硬否定更新了检查点。
12/20：我们发布了论文、代码、项目页面和检查点。检查一下！

快速链接

一个嵌入器，任何任务：指令微调文本嵌入
- 快速链接
- 安装
  - 环境设置
- 入门
  - encode函数
- 型号列表
- 使用案例
  - 计算自定义文本的嵌入
  - 计算文本之间的相似度
  - 使用定制嵌入进行信息检索
  - 使用自定义嵌入进行聚类
- 训练
  - 数据
  - 火车教练
- 评估
  - MTEB
  - 广告牌
  - 及时检索
- 量化
- 错误或问题？
- 引文
- 其他地方的导师

安装

使用 INSTRUCTOR 进行任何文本嵌入都非常容易。您可以在 Colab 笔记本中轻松尝试。在您的本地计算机中，我们建议首先创建一个虚拟环境：

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

这将创建我们使用的环境instructor 。要使用嵌入工具，请首先从 PyPI 安装InstructorEmbedding包

pip install InstructorEmbedding

或者直接从我们的代码安装它

pip install -e .

环境设置

通过运行激活环境

conda activate instructor

入门

首先下载预训练模型（有关可用模型的完整列表，请参阅模型列表）

 from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR ( 'hkunlp/instructor-large' )

然后向模型提供句子和定制指令。

 # prepare texts with instructions
text_instruction_pairs = [
    { "instruction" : "Represent the Science title:" , "text" : "3D ActionSLAM: wearable person tracking in multi-floor environments" },
    { "instruction" : "Represent the Medicine sentence for retrieving a duplicate sentence:" , "text" : "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear." }
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs :
    texts_with_instructions . append ([ pair [ "instruction" ], pair [ "text" ]])

# calculate embeddings
customized_embeddings = model . encode ( texts_with_instructions )

就这样了。现在我们有了一个带有嵌入的 numpy 数组列表。

 for pair , embedding in zip ( text_instruction_pairs , customized_embeddings ):
    print ( "Instruction: " , pair [ "instruction" ])
    print ( "text: " , pair [ "text" ])
    print ( "Embedding: " , embedding )
    print ( "" )

`encode`函数

模型的用户只需要使用encode函数：

 model . encode ( sentences ,
              batch_size : int = 32 ,
              show_progress_bar : bool = None ,
              output_value : str = 'sentence_embedding' ,
              convert_to_numpy : bool = True ,
              convert_to_tensor : bool = False ,
              device : str = None ,
              normalize_embeddings : bool = False )

sentences ：要嵌入的句子。格式应为[["instruction prompt 0", "text to be embedded 0], ["instruction prompt 1", "text to be embedded 1], ...] 。
batch_size （默认值：32）：用于计算的批量大小。它确定每批中一起处理的句子数量。
show_progress_bar （默认值：None）：如果设置为True ，则在编码句子时显示进度条，提供编码进度的视觉指示。
output_value （默认值：'sentence_embedding'）：指定所需的输出类型。默认值“sentence_embedding”返回句子嵌入。将其设置为“token_embeddings”会返回单词标记嵌入。将其设置为 None 会返回所有输出值。
convert_to_numpy （默认值： True ）：如果设置为True ，则输出是 numpy 向量列表。如果设置为False ，输出是 PyTorch 张量列表。
convert_to_tensor （默认值： False ）：如果设置为True ，该函数将返回堆叠张量作为单个输出。此参数会覆盖由convert_to_numpy指定的任何设置。
device （默认值：None）：指定用于计算的 torch.device。如果未指定，该函数将使用默认设备。
normalize_embeddings （默认值： False ）：如果设置为True ，则返回的向量的长度将为 1，表示它们已标准化。在这种情况下，相似性搜索将使用更快的点积 ( util.dot_score )，而不是余弦相似性。

型号列表

我们发布了一系列不同大小的 INSTRUCTOR 检查点。您可以使用InstructorEmbedding包轻松加载这些模型。

模型	平均。分数
hkunlp/讲师基地	55.9
hkunlp/大教练	58.4
hkunlp/讲师-xl	58.8

使用案例

我们在下面提供一些具体的用例。有关更多示例和应用，请参阅我们的论文

计算自定义文本的嵌入

如果你想计算特定句子的定制嵌入，你可以按照统一的模板编写指令：

表示task_objective的domain text_type ：

domain是可选的，它指定文本的领域，例如科学、金融、医学等。
text_type为必填项，指定编码单元，例如句子、文档、段落等。
task_objective是可选的，它指定嵌入的目标，例如检索文档、对句子进行分类等。

计算文本之间的相似度

您可以使用INSTRUCTOR通过自定义嵌入来计算两组句子之间的相似性。

 from sklearn . metrics . pairwise import cosine_similarity
sentences_a = [[ 'Represent the Science sentence: ' , 'Parton energy loss in QCD matter' ], 
               [ 'Represent the Financial statement: ' , 'The Federal Reserve on Wednesday raised its benchmark interest rate.' ]]
sentences_b = [[ 'Represent the Science sentence: ' , 'The Chiral Phase Transition in Dissipative Dynamics' ],
               [ 'Represent the Financial statement: ' , 'The funds rose less than 0.5 per cent on Friday' ]]
embeddings_a = model . encode ( sentences_a )
embeddings_b = model . encode ( sentences_b )
similarities = cosine_similarity ( embeddings_a , embeddings_b )

使用定制嵌入进行信息检索

 import numpy as np
from sklearn . metrics . pairwise import cosine_similarity
query  = [[ 'Represent the Wikipedia question for retrieving supporting documents: ' , 'where is the food stored in a yam plant' ]]
corpus = [[ 'Represent the Wikipedia document for retrieval: ' , 'Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.' ],
          [ 'Represent the Wikipedia document for retrieval: ' , "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession" ],
          [ 'Represent the Wikipedia document for retrieval: ' , 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.' ]]
query_embeddings = model . encode ( query )
corpus_embeddings = model . encode ( corpus )
similarities = cosine_similarity ( query_embeddings , corpus_embeddings )
retrieved_doc_id = np . argmax ( similarities )
print ( retrieved_doc_id )

使用自定义嵌入进行聚类

 import sklearn . cluster
sentences = [[ 'Represent the Medicine sentence for clustering: ' , 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Comparison of Atmospheric Neutrino Flux Calculations at Low Energies' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Fermion Bags in the Massive Gross-Neveu Model' ],
             [ 'Represent the Medicine sentence for clustering: ' , "QCD corrections to Associated t-tbar-H production at the Tevatron" ],
             [ 'Represent the Medicine sentence for clustering: ' , 'A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium' ]]
embeddings = model . encode ( sentences )
clustering_model = sklearn . cluster . MiniBatchKMeans ( n_clusters = 2 )
clustering_model . fit ( embeddings )
cluster_assignment = clustering_model . labels_
print ( cluster_assignment )

训练

数据

我们构建了带有指令的多任务嵌入数据（MEDI），由来自 Super-NI（Super-NaturalInstructions）、句子变换器嵌入训练数据、KILT 和 MedMCQA 的 330 个数据集组成，涵盖广泛的领域和任务。如果没有提供，我们构造正负对，并以统一格式存储：

 [
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show's renewal for a second season. Critical reviews of the series have been generally positive, citing the show's positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band's most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John's single "Physical" for nine consecutive weeks, and then by Hall & Oates' "I Can't Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band's biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini AppletininAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).nVarieties of aromatised wine.nVarieties of aromatised wine Vermouth.nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

每个实例由一个查询、一个正对、一个负对和任务 ID 组成，用于确保同一训练批次中的数据来自同一任务。 MEDI 数据可从此链接下载。

火车教练

我们提供用于培训 INSTRUCTOR 的示例脚本。您可能需要首先下载 MEDI 数据，解压文件夹并将medi-data.json放在--cache_dir下。

 python train . py - - model_name_or_path sentence - transformers / gtr - t5 - large - - output_dir { output_directory } - - cache_dir { cache_directory } - - max_source_length 512 - - num_train_epochs 10 - - save_steps 500 - - cl_temperature 0.1 - - warmup_ratio 0.1 - - learning_rate 2e-5 - - overwrite_output_dir

我们对论点的解释如下：

--model_name_or_path ：首先预训练的检查点。我们支持模型ID（例如， sentence-transformers/gtr-t5-large 、 sentence-transformers/sentence-t5-large ）或检查点路径（例如，由Transformers训练器保存的检查点）。
--cl_temperature ：对比损失的温度
--cache_dir ：缓存下载的模型和数据的目录。下载的 MEDI 数据（ medi-data.json ）应放在目录--cache_dir下。
--output_dir ：存储用于评估的训练模型（检查点）的目录。

所有其他参数都是标准Huggingface's transformers训练参数，例如--overwrite_output_dir 、 --num_train_epochs 、 --learning_rate 。有关详细信息，请参阅 Huggingface 变压器

评估

我们针对 70 项不同的任务对 INSTRUCTOR 进行了大规模评估，涵盖了广泛的任务和领域。具体来说，我们基于三个基准进行评估：MTEB、Billboard 和 Prompt Retrieval。我们在下面解释有关运行评估脚本的详细信息。

MTEB

要评估 MTEB 基准数据集上的模型性能，首先安装 MTEB 库

 cd evaluation / MTEB
pip install - e .

然后运行以下命令：

 python examples / evaluate_model . py - - model_name hkunlp / instructor - large - - output_dir outputs - - task_name ArguAna - - result_file results

您可以通过指定--model_name来评估经过训练的模型检查点，并通过更改--task_name来运行所有 MTEB 数据集。查看我们的论文或 MTEB 基准，了解所有任务的评估指标。

广告牌

要评估 Billboard 上的模型性能，请运行以下命令：

 cd evaluation / text_evaluation
python main . py - - model_name hkunlp / instructor - large - - task mscoco - - add_prompt

您可以通过指定--model_name来评估经过训练的模型检查点，并通过更改--task来运行所有 Billboard 数据集。在 Billboard 的所有三个数据集中，我们报告了皮尔逊相关性。

及时检索

要评估提示检索的模型性能，请运行以下命令：

 cd evaluation / prompt_retrieval
python main . py - - embedding_model hkunlp / instructor - large - - task rte - - model_cache_dir { cache_dir } - - output_dir { output_dir } - - add_prompt

您可以通过指定--model_name评估训练后的模型检查点，并通过更改--task运行提示检索数据集。为了获得一致的指标，我们将提示检索中的所有任务转换为“文本到文本”格式，并报告 Rouge-L 分数。

量化

要量化instructor embedding模型，请运行以下代码：

 # imports 
import torch
from InstructorEmbedding import INSTRUCTOR

# load the model 
model = INSTRUCTOR ( 'hkunlp/instructor-large' , device = 'cpu' )  # you can use GPU

# quantize the model 
qmodel = torch . quantization . quantize_dynamic (
model , { torch . nn . Linear }, dtype = torch . qint8 )

# Inference 
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel . encode ([[ instruction , sentence ]])  
# you can also normalize the embeddings:  normalize_embeddings=True 

print ( f"Quantized Embeddings: n { embeddings } " )

它将模型大小减少了 10 倍，推理时间将比普通模型少:)

错误或问题？

如果您对代码或论文有任何疑问，请随时发送电子邮件至 Hongjin ( [email protected] ) 和 Weijia ( [email protected] )。请尝试详细说明问题，以便我们更好更快地为您提供帮助。

引文

如果您发现我们的工作有帮助，请引用我们：

 @inproceedings { INSTRUCTOR ,
  title = { One Embedder, Any Task: Instruction-Finetuned Text Embeddings } ,
  author = { Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao } ,
  url = { https://arxiv.org/abs/2212.09741 } ,
  year = { 2022 } ,
}