Unduhan instructor embedding - pengunduhan kode sumber instructor embedding

Garpu Pribadi Saya

Ini adalah fork untuk model Instruktur karena repositori aslinya tidak lagi dipertahankan. Saya juga melakukan beberapa perbaikan pada kode sumbernya:

Memperbaikinya agar berfungsi dengan pustaka sentence-transformers di atas 2.2.2.
Unduh model dengan benar dari huggingface menggunakan API "unduh snapshot" yang baru.
Kemampuan untuk menentukan di mana Anda ingin model diunduh dengan parameter "cache_dir".

Berikut ini adalah file readme repositori asli. Abaikan bagian kuantisasi, karena pytorch telah mengubah API-nya sejak saat itu.

Satu Penyemat, Tugas Apa Pun: Penyematan Teks yang Diselesaikan dengan Instruksi

Repositori ini berisi kode dan model terlatih untuk makalah kami Satu Penyemat, Tugas Apa Pun: Penyematan Teks yang Diselesaikan dengan Instruksi. Silakan merujuk ke halaman proyek kami untuk ikhtisar proyek singkat.

Kami memperkenalkan Instruktur ?‍?, model penyematan teks yang disempurnakan dengan instruksi yang dapat menghasilkan penyematan teks yang disesuaikan dengan tugas apa pun (misalnya, klasifikasi, pengambilan, pengelompokan, evaluasi teks, dll.) dan domain (misalnya, sains, keuangan, dll.) dengan hanya memberikan instruksi tugas, tanpa penyesuaian apa pun . Instruktur?‍ mencapai sota pada 70 tugas penyematan yang beragam!

**************************** Pembaruan ********************* *******

21/01: Kami memperbarui struktur kode, yang mendukung instalasi paket yang mudah.
28/12: Kami memperbarui pos pemeriksaan dengan negatif keras.
20/12: Kami merilis makalah, kode, halaman proyek, dan pos pemeriksaan kami. Coba lihat!

Tautan Cepat

Satu Penyemat, Tugas Apa Pun: Penyematan Teks yang Diselesaikan dengan Instruksi
- Tautan Cepat
- Instalasi
  - Pengaturan lingkungan
- Memulai
  - Fungsi encode
- Daftar Model
- Kasus Penggunaan
  - Hitung penyematan untuk teks kustom Anda
  - Hitung persamaan antar teks
  - Gunakan penyematan khusus untuk pengambilan informasi
  - Gunakan penyematan khusus untuk pengelompokan
- Pelatihan
  - Data
  - INSTRUKTUR Kereta Api
- Evaluasi
  - MTB
  - Papan iklan
  - Pengambilan Cepat
- Kuantisasi
- Bug atau pertanyaan?
- Kutipan
- INSTRUKTUR Di tempat lain

Instalasi

Sangat mudah untuk menggunakan INSTRUCTOR untuk penyematan teks apa pun. Anda dapat dengan mudah mencobanya di notebook Colab. Di mesin lokal Anda, sebaiknya buat terlebih dahulu lingkungan virtual:

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

Itu akan menciptakan instructor lingkungan yang kami gunakan. Untuk menggunakan alat penyematan, pertama-tama instal paket InstructorEmbedding dari PyPI

pip install InstructorEmbedding

atau langsung menginstalnya dari kode kami

pip install -e .

Pengaturan lingkungan

Aktifkan lingkungan dengan menjalankan

conda activate instructor

Memulai

Pertama-tama unduh model yang telah dilatih sebelumnya (Lihat daftar model untuk daftar lengkap model yang tersedia)

 from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR ( 'hkunlp/instructor-large' )

Kemudian berikan kalimat dan instruksi yang disesuaikan dengan model.

 # prepare texts with instructions
text_instruction_pairs = [
    { "instruction" : "Represent the Science title:" , "text" : "3D ActionSLAM: wearable person tracking in multi-floor environments" },
    { "instruction" : "Represent the Medicine sentence for retrieving a duplicate sentence:" , "text" : "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear." }
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs :
    texts_with_instructions . append ([ pair [ "instruction" ], pair [ "text" ]])

# calculate embeddings
customized_embeddings = model . encode ( texts_with_instructions )

Dan itu sudah. Kami sekarang memiliki daftar array numpy dengan embeddingsnya.

 for pair , embedding in zip ( text_instruction_pairs , customized_embeddings ):
    print ( "Instruction: " , pair [ "instruction" ])
    print ( "text: " , pair [ "text" ])
    print ( "Embedding: " , embedding )
    print ( "" )

Fungsi `encode`

Pengguna model hanya perlu menggunakan fungsi encode :

 model . encode ( sentences ,
              batch_size : int = 32 ,
              show_progress_bar : bool = None ,
              output_value : str = 'sentence_embedding' ,
              convert_to_numpy : bool = True ,
              convert_to_tensor : bool = False ,
              device : str = None ,
              normalize_embeddings : bool = False )

sentences : Kalimat-kalimat yang akan disisipkan. Itu harus dalam format [["instruction prompt 0", "text to be embedded 0], ["instruction prompt 1", "text to be embedded 1], ...] .
batch_size (default: 32): Ukuran batch yang digunakan untuk komputasi. Ini menentukan jumlah kalimat yang diproses bersama dalam setiap batch.
show_progress_bar (default: Tidak Ada): Jika diatur ke True , ini akan menampilkan bilah kemajuan saat menyandikan kalimat, memberikan indikasi visual tentang kemajuan penyandian.
output_value (default: 'sentence_embedding'): Menentukan jenis output yang diinginkan. Nilai default 'sentence_embedding' mengembalikan penyematan kalimat. Menyetelnya ke 'token_embeddings' akan mengembalikan penyematan token kata. Menyetelnya ke Tidak Ada akan mengembalikan semua nilai keluaran.
convert_to_numpy (default: True ): Jika disetel ke True , outputnya adalah daftar vektor numpy. Jika disetel ke False , outputnya adalah daftar tensor PyTorch.
convert_to_tensor (default: False ): Jika disetel ke True , fungsi akan mengembalikan tensor bertumpuk sebagai output tunggal. Parameter ini mengesampingkan pengaturan apa pun yang ditentukan oleh convert_to_numpy .
device (default: Tidak Ada): Menentukan perangkat obor yang akan digunakan untuk komputasi. Jika tidak ditentukan, fungsi tersebut menggunakan perangkat default.
normalize_embeddings (default: False ): Jika disetel ke True , vektor yang dikembalikan akan memiliki panjang 1, yang menunjukkan bahwa vektor tersebut dinormalisasi. Dalam hal ini, pencarian kesamaan akan menggunakan produk titik yang lebih cepat ( util.dot_score ), bukan kesamaan kosinus.

Daftar Model

Kami merilis serangkaian pos pemeriksaan INSTRUKTUR dengan ukuran berbeda. Anda dapat dengan mudah memuat model ini dengan paket InstructorEmbedding .

Model	Rata-rata Skor
hkunlp/instructor-base	55.9
hkunlp/instruktur-besar	58.4
hkunlp/instruktur-xl	58.8

Kasus Penggunaan

Kami menyediakan beberapa kasus penggunaan spesifik sebagai berikut. Untuk lebih banyak contoh dan penerapan, lihat makalah kami

Hitung penyematan untuk teks kustom Anda

Jika Anda ingin menghitung penyematan khusus untuk kalimat tertentu, Anda dapat mengikuti templat terpadu untuk menulis petunjuk:

Mewakili domain text_type untuk task_objective :

domain bersifat opsional, dan menentukan domain teks, misalnya sains, keuangan, kedokteran, dll.
text_type diperlukan, dan ini menentukan unit pengkodean, misalnya kalimat, dokumen, paragraf, dll.
task_objective bersifat opsional, dan menentukan tujuan penyematan, misalnya, mengambil dokumen, mengklasifikasikan kalimat, dll.

Hitung persamaan antar teks

Anda dapat menggunakan INSTRUCTOR untuk menghitung kesamaan antara dua kelompok kalimat, dengan penyematan yang disesuaikan .

 from sklearn . metrics . pairwise import cosine_similarity
sentences_a = [[ 'Represent the Science sentence: ' , 'Parton energy loss in QCD matter' ], 
               [ 'Represent the Financial statement: ' , 'The Federal Reserve on Wednesday raised its benchmark interest rate.' ]]
sentences_b = [[ 'Represent the Science sentence: ' , 'The Chiral Phase Transition in Dissipative Dynamics' ],
               [ 'Represent the Financial statement: ' , 'The funds rose less than 0.5 per cent on Friday' ]]
embeddings_a = model . encode ( sentences_a )
embeddings_b = model . encode ( sentences_b )
similarities = cosine_similarity ( embeddings_a , embeddings_b )

Gunakan penyematan khusus untuk pengambilan informasi

 import numpy as np
from sklearn . metrics . pairwise import cosine_similarity
query  = [[ 'Represent the Wikipedia question for retrieving supporting documents: ' , 'where is the food stored in a yam plant' ]]
corpus = [[ 'Represent the Wikipedia document for retrieval: ' , 'Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.' ],
          [ 'Represent the Wikipedia document for retrieval: ' , "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession" ],
          [ 'Represent the Wikipedia document for retrieval: ' , 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.' ]]
query_embeddings = model . encode ( query )
corpus_embeddings = model . encode ( corpus )
similarities = cosine_similarity ( query_embeddings , corpus_embeddings )
retrieved_doc_id = np . argmax ( similarities )
print ( retrieved_doc_id )

Gunakan penyematan khusus untuk pengelompokan

 import sklearn . cluster
sentences = [[ 'Represent the Medicine sentence for clustering: ' , 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Comparison of Atmospheric Neutrino Flux Calculations at Low Energies' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Fermion Bags in the Massive Gross-Neveu Model' ],
             [ 'Represent the Medicine sentence for clustering: ' , "QCD corrections to Associated t-tbar-H production at the Tevatron" ],
             [ 'Represent the Medicine sentence for clustering: ' , 'A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium' ]]
embeddings = model . encode ( sentences )
clustering_model = sklearn . cluster . MiniBatchKMeans ( n_clusters = 2 )
clustering_model . fit ( embeddings )
cluster_assignment = clustering_model . labels_
print ( cluster_assignment )

Pelatihan

Data

Kami membuat Data Penyematan Multitugas dengan Instruksi (MEDI), yang terdiri dari kumpulan 330 kumpulan data dari Super-NI (Super-NaturalInstructions), data pelatihan penyematan pengubah kalimat, KILT dan MedMCQA, yang mencakup berbagai domain dan tugas. Kami membuat pasangan positif dan negatif jika tidak disediakan, dan menyimpannya dalam format terpadu:

 [
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show's renewal for a second season. Critical reviews of the series have been generally positive, citing the show's positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band's most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John's single "Physical" for nine consecutive weeks, and then by Hall & Oates' "I Can't Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band's biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini AppletininAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).nVarieties of aromatised wine.nVarieties of aromatised wine Vermouth.nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

Setiap instance terdiri dari kueri, pasangan positif, pasangan negatif, dan id tugas, yang digunakan untuk memastikan data dalam kumpulan pelatihan yang sama berasal dari tugas yang sama. Data MEDI dapat diunduh pada tautan ini.

INSTRUKTUR Kereta Api

Kami memberikan contoh script untuk pelatihan INSTRUKTUR. Anda mungkin perlu mengunduh data MEDI terlebih dahulu, mengekstrak foldernya dan meletakkan medi-data.json di bawah --cache_dir .

 python train . py - - model_name_or_path sentence - transformers / gtr - t5 - large - - output_dir { output_directory } - - cache_dir { cache_directory } - - max_source_length 512 - - num_train_epochs 10 - - save_steps 500 - - cl_temperature 0.1 - - warmup_ratio 0.1 - - learning_rate 2e-5 - - overwrite_output_dir

Kami menjelaskan argumennya sebagai berikut:

--model_name_or_path : Pos pemeriksaan yang telah dilatih sebelumnya untuk memulai. Kami mendukung id model (misalnya, sentence-transformers/gtr-t5-large , sentence-transformers/sentence-t5-large ) atau jalur pos pemeriksaan (misalnya, pos pemeriksaan disimpan oleh pelatih transformator).
--cl_temperature : Suhu untuk kehilangan kontras
--cache_dir : Direktori untuk menyimpan model dan data yang diunduh dalam cache. Data MEDI yang diunduh ( medi-data.json ) harus diletakkan di bawah direktori --cache_dir .
--output_dir : Direktori untuk menyimpan model terlatih (pos pemeriksaan) untuk evaluasi.

Semua argumen lainnya adalah argumen pelatihan Huggingface's transformers standar, seperti --overwrite_output_dir , --num_train_epochs , --learning_rate . Untuk detailnya, lihat transformator Huggingface

Evaluasi

Kami mengevaluasi INSTRUKTUR secara besar-besaran pada 70 tugas yang beragam, yang mencakup berbagai tugas dan domain. Secara khusus, kami membangun evaluasi kami berdasarkan tiga tolok ukur, MTEB, Billboard, dan Prompt Retrieval. Kami menjelaskan detail tentang menjalankan skrip evaluasi berikut ini.

MTB

Untuk mengevaluasi performa model pada kumpulan data benchmark MTEB, instal pustaka MTEB terlebih dahulu

 cd evaluation / MTEB
pip install - e .

Kemudian jalankan perintah berikut:

 python examples / evaluate_model . py - - model_name hkunlp / instructor - large - - output_dir outputs - - task_name ArguAna - - result_file results

Anda dapat mengevaluasi pos pemeriksaan model terlatih dengan menentukan --model_name dan menjalankan semua kumpulan data MTEB dengan mengubah --task_name . Periksa makalah kami atau tolok ukur MTEB untuk metrik evaluasi semua tugas.

Papan iklan

Untuk mengevaluasi performa model di Billboard, jalankan perintah berikut:

 cd evaluation / text_evaluation
python main . py - - model_name hkunlp / instructor - large - - task mscoco - - add_prompt

Anda dapat mengevaluasi pos pemeriksaan model terlatih dengan menentukan --model_name dan menjalankan semua kumpulan data Billboard dengan mengubah --task . Di ketiga kumpulan data di Billboard, kami melaporkan korelasi Pearson.

Pengambilan Cepat

Untuk mengevaluasi performa model pada Prompt Retrieval, jalankan perintah berikut:

 cd evaluation / prompt_retrieval
python main . py - - embedding_model hkunlp / instructor - large - - task rte - - model_cache_dir { cache_dir } - - output_dir { output_dir } - - add_prompt

Anda dapat mengevaluasi pos pemeriksaan model terlatih dengan menentukan --model_name dan menjalankan kumpulan data pengambilan cepat dengan mengubah --task . Untuk mendapatkan metrik yang konsisten, kami memasukkan semua tugas di Pengambilan Cepat ke dalam format "teks-ke-teks", dan melaporkan skor Rouge-L.

Kuantisasi

Untuk Mengkuantisasi model instructor embedding , jalankan kode berikut:

 # imports 
import torch
from InstructorEmbedding import INSTRUCTOR

# load the model 
model = INSTRUCTOR ( 'hkunlp/instructor-large' , device = 'cpu' )  # you can use GPU

# quantize the model 
qmodel = torch . quantization . quantize_dynamic (
model , { torch . nn . Linear }, dtype = torch . qint8 )

# Inference 
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel . encode ([[ instruction , sentence ]])  
# you can also normalize the embeddings:  normalize_embeddings=True 

print ( f"Quantized Embeddings: n { embeddings } " )

Ini mengurangi ukuran model sebanyak 10x dan waktu inferensi akan lebih singkat dari model normal :)

Bug atau pertanyaan?

Jika Anda memiliki pertanyaan terkait kode atau kertas, silakan kirim email ke Hongjin ( [email protected] ) dan Weijia ( [email protected] ). Silakan coba tentukan masalahnya dengan detail sehingga kami dapat membantu Anda dengan lebih baik dan lebih cepat.

Kutipan

Jika Anda merasa pekerjaan kami bermanfaat, harap kutip kami:

 @inproceedings { INSTRUCTOR ,
  title = { One Embedder, Any Task: Instruction-Finetuned Text Embeddings } ,
  author = { Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao } ,
  url = { https://arxiv.org/abs/2212.09741 } ,
  year = { 2022 } ,
}

INSTRUKTUR Di tempat lain

Kami berterima kasih kepada upaya komunitas untuk memperluas INSTRUKTUR!

LangChain mendukung InstructEmbeddings, yang menggunakan model INSTRUCTOR.
MosesML telah menyertakan Instruktur-Besar dan Instruktur-XL
embaas terintegrasi Instruktur-Besar
Haystack menyertakan komponen InstructorTextEmbedder dan InstructorDocumentEmbedder .

Memperluas