instructor embedding下載 - instructor embedding原始碼下載

我的個人叉子

這是 Instructor 模型的一個分支，因為原始儲存庫不再保留。我還對其原始程式碼進行了一些改進：

修復它以與 2.2.2 以上的sentence-transformers庫一起使用。
使用新的「快照下載」API 從 Huggingface 正確下載模型。
能夠使用“cache_dir”參數指定模型的下載位置。

以下是原始儲存庫的自述文件。然而，請忽略量化部分，因為 pytorch 從那時起已經更改了其 API。

一個嵌入器，任何任務：指令微調文字嵌入

該儲存庫包含我們的論文《One Embedder，Any Task：Instruction-Finetuned Text Embeddings》的程式碼和預訓練模型。請參閱我們的項目頁面以快速了解項目概述。

我們引入了Instructor ?‍?，一種指令微調的文本嵌入模型，可以產生適合任何任務（例如分類、檢索、聚類、文本評估等）和領域（例如科學、金融等）的文本嵌入。地提供任務指令，無需任何微調。講師？

**************************更新********************** * *****

01/21：我們更新了程式碼結構，支援輕鬆安裝套件。
12/28：我們用硬否定更新了檢查點。
12/20：我們發布了論文、程式碼、專案頁面和檢查點。檢查一下！

快速連結

一個嵌入器，任何任務：指令微調文字嵌入
- 快速連結
- 安裝
  - 環境設定
- 入門
  - encode函數
- 型號列表
- 使用案例
  - 計算自訂文字的嵌入
  - 計算文字之間的相似度
  - 使用客製化嵌入進行資訊檢索
  - 使用自訂嵌入進行聚類
- 訓練
  - 數據
  - 火車教練
- 評估
  - MTEB
  - 廣告看板
  - 及時檢索
- 量化
- 錯誤或問題？
- 引文
- 其他地方的導師

安裝

使用 INSTRUCTOR 進行任何文字嵌入都非常容易。您可以在 Colab 筆記本中輕鬆嘗試。在您的本機電腦中，我們建議先建立一個虛擬環境：

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

這將創建我們使用的環境instructor 。若要使用嵌入工具，請先從 PyPI 安裝InstructorEmbedding套件

pip install InstructorEmbedding

或直接從我們的程式碼安裝它

pip install -e .

環境設定

透過運行激活環境

conda activate instructor

入門

首先下載預訓練模型（有關可用模型的完整列表，請參閱模型列表）

 from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR ( 'hkunlp/instructor-large' )

然後向模型提供句子和自訂指令。

 # prepare texts with instructions
text_instruction_pairs = [
    { "instruction" : "Represent the Science title:" , "text" : "3D ActionSLAM: wearable person tracking in multi-floor environments" },
    { "instruction" : "Represent the Medicine sentence for retrieving a duplicate sentence:" , "text" : "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear." }
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs :
    texts_with_instructions . append ([ pair [ "instruction" ], pair [ "text" ]])

# calculate embeddings
customized_embeddings = model . encode ( texts_with_instructions )

就這樣了。現在我們有了一個帶有嵌入的 numpy 陣列列表。

 for pair , embedding in zip ( text_instruction_pairs , customized_embeddings ):
    print ( "Instruction: " , pair [ "instruction" ])
    print ( "text: " , pair [ "text" ])
    print ( "Embedding: " , embedding )
    print ( "" )

`encode`函數

模型的使用者只需要使用encode函數：

 model . encode ( sentences ,
              batch_size : int = 32 ,
              show_progress_bar : bool = None ,
              output_value : str = 'sentence_embedding' ,
              convert_to_numpy : bool = True ,
              convert_to_tensor : bool = False ,
              device : str = None ,
              normalize_embeddings : bool = False )

sentences ：要嵌入的句子。格式應為[["instruction prompt 0", "text to be embedded 0], ["instruction prompt 1", "text to be embedded 1], ...] 。
batch_size （預設值：32）：用於計算的批次大小。它決定每批中一起處理的句子數量。
show_progress_bar （預設值：None）：如果設定為True ，則在編碼句子時顯示進度條，提供編碼進度的視覺指示。
output_value （預設值：'sentence_embedding'）：指定所需的輸出類型。預設值“sentence_embedding”返回句子嵌入。將其設為“token_embeddings”會傳回單字標記嵌入。將其設為 None 會傳回所有輸出值。
convert_to_numpy （預設值： True ）：如果設定為True ，則輸出是 numpy 向量清單。如果設定為False ，輸出是 PyTorch 張量列表。
convert_to_tensor （預設值： False ）：如果設定為True ，函數將傳回堆疊張量作為單一輸出。此參數會覆寫由convert_to_numpy指定的任何設定。
device （預設值：None）：指定用於計算的 torch.device。如果未指定，函數將使用預設設備。
normalize_embeddings （預設值： False ）：如果設定為True ，則傳回的向量的長度將為 1，表示它們已標準化。在這種情況下，相似性搜尋將使用更快的點積 ( util.dot_score )，而不是餘弦相似性。

型號列表

我們發布了一系列不同大小的 INSTRUCTOR 檢查點。您可以使用InstructorEmbedding套件輕鬆載入這些模型。

模型	平均。分數
hkunlp/講師基地	55.9
hkunlp/大教練	58.4
hkunlp/講師-xl	58.8

使用案例

我們在下面提供一些具體的用例。有關更多範例和應用，請參閱我們的論文

計算自訂文字的嵌入

如果你想計算特定句子的客製化嵌入，你可以按照統一的模板寫指令：

表示task_objective的domain text_type ：

domain是可選的，它指定文本的領域，例如科學、金融、醫學等。
text_type為必填項，指定編碼單元，例如句子、文件、段落等。
task_objective是可選的，它指定嵌入的目標，例如檢索文件、對句子進行分類等。

計算文字之間的相似度

您可以使用INSTRUCTOR透過自訂嵌入來計算兩組句子之間的相似性。

 from sklearn . metrics . pairwise import cosine_similarity
sentences_a = [[ 'Represent the Science sentence: ' , 'Parton energy loss in QCD matter' ], 
               [ 'Represent the Financial statement: ' , 'The Federal Reserve on Wednesday raised its benchmark interest rate.' ]]
sentences_b = [[ 'Represent the Science sentence: ' , 'The Chiral Phase Transition in Dissipative Dynamics' ],
               [ 'Represent the Financial statement: ' , 'The funds rose less than 0.5 per cent on Friday' ]]
embeddings_a = model . encode ( sentences_a )
embeddings_b = model . encode ( sentences_b )
similarities = cosine_similarity ( embeddings_a , embeddings_b )

使用客製化嵌入進行資訊檢索

 import numpy as np
from sklearn . metrics . pairwise import cosine_similarity
query  = [[ 'Represent the Wikipedia question for retrieving supporting documents: ' , 'where is the food stored in a yam plant' ]]
corpus = [[ 'Represent the Wikipedia document for retrieval: ' , 'Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.' ],
          [ 'Represent the Wikipedia document for retrieval: ' , "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession" ],
          [ 'Represent the Wikipedia document for retrieval: ' , 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.' ]]
query_embeddings = model . encode ( query )
corpus_embeddings = model . encode ( corpus )
similarities = cosine_similarity ( query_embeddings , corpus_embeddings )
retrieved_doc_id = np . argmax ( similarities )
print ( retrieved_doc_id )

使用自訂嵌入進行聚類

 import sklearn . cluster
sentences = [[ 'Represent the Medicine sentence for clustering: ' , 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Comparison of Atmospheric Neutrino Flux Calculations at Low Energies' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Fermion Bags in the Massive Gross-Neveu Model' ],
             [ 'Represent the Medicine sentence for clustering: ' , "QCD corrections to Associated t-tbar-H production at the Tevatron" ],
             [ 'Represent the Medicine sentence for clustering: ' , 'A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium' ]]
embeddings = model . encode ( sentences )
clustering_model = sklearn . cluster . MiniBatchKMeans ( n_clusters = 2 )
clustering_model . fit ( embeddings )
cluster_assignment = clustering_model . labels_
print ( cluster_assignment )

訓練

數據

我們建立了帶有指令的多任務嵌入資料（MEDI），由來自 Super-NI（Super-NaturalInstructions）、句子變換器嵌入訓練資料、KILT 和 MedMCQA 的 330 個資料集組成，涵蓋廣泛的領域和任務。如果沒有提供，我們建構正負對，並以統一格式儲存：

 [
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show's renewal for a second season. Critical reviews of the series have been generally positive, citing the show's positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band's most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John's single "Physical" for nine consecutive weeks, and then by Hall & Oates' "I Can't Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band's biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini AppletininAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).nVarieties of aromatised wine.nVarieties of aromatised wine Vermouth.nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

每個實例由一個查詢、一個正對、一個負對和任務 ID 組成，用於確保同一訓練批次中的資料來自相同任務。 MEDI 資料可從此連結下載。

火車教練

我們提供用於培訓 INSTRUCTOR 的範例腳本。您可能需要先下載 MEDI 數據，解壓縮資料夾並將medi-data.json放在--cache_dir下。

 python train . py - - model_name_or_path sentence - transformers / gtr - t5 - large - - output_dir { output_directory } - - cache_dir { cache_directory } - - max_source_length 512 - - num_train_epochs 10 - - save_steps 500 - - cl_temperature 0.1 - - warmup_ratio 0.1 - - learning_rate 2e-5 - - overwrite_output_dir

我們對論點的解釋如下：

--model_name_or_path ：首先預訓練的檢查點。我們支援模型ID（例如， sentence-transformers/gtr-t5-large 、 sentence-transformers/sentence-t5-large ）或檢查點路徑（例如，由Transformers訓練器儲存的檢查點）。
--cl_temperature ：對比損失的溫度
--cache_dir ：快取下載的模型和資料的目錄。下載的 MEDI 資料（ medi-data.json ）應放在目錄--cache_dir下。
--output_dir ：儲存用於評估的訓練模型（檢查點）的目錄。

所有其他參數都是標準Huggingface's transformers訓練參數，例如--overwrite_output_dir 、 --num_train_epochs 、 --learning_rate 。有關詳細信息，請參閱 Huggingface 變壓器

評估

我們針對 70 項不同的任務對 INSTRUCTOR 進行了大規模評估，涵蓋了廣泛的任務和領域。具體來說，我們基於三個基準進行評估：MTEB、Billboard 和 Prompt Retrieval。我們在下面解釋有關運行評估腳本的詳細資訊。

MTEB

要評估 MTEB 基準資料集上的模型效能，請先安裝 MTEB 庫

 cd evaluation / MTEB
pip install - e .

然後運行以下命令：

 python examples / evaluate_model . py - - model_name hkunlp / instructor - large - - output_dir outputs - - task_name ArguAna - - result_file results

您可以透過指定--model_name來評估經過訓練的模型檢查點，並透過變更--task_name來執行所有 MTEB 資料集。查看我們的論文或 MTEB 基準，以了解所有任務的評估指標。

廣告看板

若要評估 Billboard 上的模型效能，請執行以下命令：

 cd evaluation / text_evaluation
python main . py - - model_name hkunlp / instructor - large - - task mscoco - - add_prompt

您可以透過指定--model_name來評估經過訓練的模型檢查點，並透過變更--task來執行所有 Billboard 資料集。在 Billboard 的所有三個數據集中，我們報告了皮爾森相關性。

及時檢索

若要評估提示檢索的模型效能，請執行以下命令：

 cd evaluation / prompt_retrieval
python main . py - - embedding_model hkunlp / instructor - large - - task rte - - model_cache_dir { cache_dir } - - output_dir { output_dir } - - add_prompt

您可以透過指定--model_name評估訓練後的模型檢查點，並透過變更--task執行提示檢索資料集。為了獲得一致的指標，我們將提示檢索中的所有任務轉換為「文字到文字」格式，並報告 Rouge-L 分數。

量化

要量化instructor embedding模型，請執行以下程式碼：

 # imports 
import torch
from InstructorEmbedding import INSTRUCTOR

# load the model 
model = INSTRUCTOR ( 'hkunlp/instructor-large' , device = 'cpu' )  # you can use GPU

# quantize the model 
qmodel = torch . quantization . quantize_dynamic (
model , { torch . nn . Linear }, dtype = torch . qint8 )

# Inference 
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel . encode ([[ instruction , sentence ]])  
# you can also normalize the embeddings:  normalize_embeddings=True 

print ( f"Quantized Embeddings: n { embeddings } " )

它將模型大小減少了 10 倍，推理時間將比普通模型少:)

錯誤或問題？

如果您對程式碼或論文有任何疑問，請隨時發送電子郵件至 Hongjin ( [email protected] ) 和 Weijia ( [email protected] )。請嘗試詳細說明問題，以便我們更好更快地為您提供協助。

引文

如果您發現我們的工作有幫助，請引用我們：

 @inproceedings { INSTRUCTOR ,
  title = { One Embedder, Any Task: Instruction-Finetuned Text Embeddings } ,
  author = { Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao } ,
  url = { https://arxiv.org/abs/2212.09741 } ,
  year = { 2022 } ,
}