instructor embedding Download - instructor embedding Download do código-fonte

Meu garfo pessoal

Este é um fork para o modelo Instructor porque o repositório original não é mais mantido. Também fiz algumas melhorias no código-fonte:

Corrigindo-o para funcionar com a biblioteca sentence-transformers acima de 2.2.2.
Baixe corretamente os modelos do huggingface usando a nova API de "download de instantâneo".
Capacidade de especificar onde você deseja que o modelo seja baixado com o parâmetro "cache_dir".

A seguir está o arquivo leia-me do repositório original. Ignore a seção de quantização, entretanto, porque o pytorch mudou sua API desde então.

Um incorporador, qualquer tarefa: incorporações de texto ajustadas por instruções

Este repositório contém o código e modelos pré-treinados para nosso artigo One Embedder, Any Task: Instruction-Finetuned Text Embeddings. Consulte nossa página do projeto para uma rápida visão geral do projeto.

Apresentamos o Instructor ?‍?, um modelo de incorporação de texto ajustado por instrução que pode gerar incorporações de texto adaptadas a qualquer tarefa (por exemplo, classificação, recuperação, agrupamento, avaliação de texto, etc.) e domínios (por exemplo, ciências, finanças, etc.) simplesmente fornecendo as instruções da tarefa, sem qualquer ajuste fino . Instrutor?‍ realiza sota em 70 tarefas de incorporação diversas!

**************************** Atualizações ********************* *******

21/01: Atualizamos a estrutura do código, que suporta fácil instalação de pacotes.
28/12: Atualizamos o ponto de verificação com negativos concretos.
20/12: Lançamos nosso artigo, código, página do projeto e ponto de verificação. Confira!

Links rápidos

Um incorporador, qualquer tarefa: incorporações de texto ajustadas por instruções
- Links rápidos
- Instalação
  - Configuração do ambiente
- Começando
  - A função encode
- Lista de modelos
- Casos de uso
  - Calcule embeddings para seus textos personalizados
  - Calcular semelhanças entre textos
  - Use embeddings personalizados para recuperação de informações
  - Use incorporações personalizadas para clustering
- Treinamento
  - Dados
  - INSTRUTOR DE TREM
- Avaliação
  - MTEB
  - Painel publicitário
  - Recuperação imediata
- Quantização
- Bugs ou dúvidas?
- Citação
- INSTRUTOR Em outro lugar

Instalação

É muito fácil usar o INSTRUCTOR para qualquer incorporação de texto. Você pode experimentá-lo facilmente no notebook Colab. Na sua máquina local, recomendamos primeiro criar um ambiente virtual:

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

Isso criará o ambiente instructor que usamos. Para usar a ferramenta de incorporação, primeiro instale o pacote InstructorEmbedding do PyPI

pip install InstructorEmbedding

ou instale-o diretamente do nosso código

pip install -e .

Configuração do ambiente

Ative o ambiente executando

conda activate instructor

Começando

Primeiro baixe um modelo pré-treinado (consulte a lista de modelos para obter uma lista completa de modelos disponíveis)

 from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR ( 'hkunlp/instructor-large' )

Em seguida, forneça a frase e as instruções personalizadas ao modelo.

 # prepare texts with instructions
text_instruction_pairs = [
    { "instruction" : "Represent the Science title:" , "text" : "3D ActionSLAM: wearable person tracking in multi-floor environments" },
    { "instruction" : "Represent the Medicine sentence for retrieving a duplicate sentence:" , "text" : "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear." }
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs :
    texts_with_instructions . append ([ pair [ "instruction" ], pair [ "text" ]])

# calculate embeddings
customized_embeddings = model . encode ( texts_with_instructions )

E já é isso. Agora temos uma lista de arrays numpy com os embeddings.

 for pair , embedding in zip ( text_instruction_pairs , customized_embeddings ):
    print ( "Instruction: " , pair [ "instruction" ])
    print ( "text: " , pair [ "text" ])
    print ( "Embedding: " , embedding )
    print ( "" )

A função `encode`

Os usuários do modelo precisam usar apenas a função encode :

 model . encode ( sentences ,
              batch_size : int = 32 ,
              show_progress_bar : bool = None ,
              output_value : str = 'sentence_embedding' ,
              convert_to_numpy : bool = True ,
              convert_to_tensor : bool = False ,
              device : str = None ,
              normalize_embeddings : bool = False )

sentences : as sentenças a serem incorporadas. Deve estar no formato de [["instruction prompt 0", "text to be embedded 0], ["instruction prompt 1", "text to be embedded 1], ...] .
batch_size (padrão: 32): O tamanho do lote usado para o cálculo. Determina o número de sentenças processadas juntas em cada lote.
show_progress_bar (padrão: None): Se definido como True , exibe uma barra de progresso durante a codificação de frases, fornecendo uma indicação visual do progresso da codificação.
output_value (padrão: 'sentence_embedding'): Especifica o tipo de saída desejado. O valor padrão 'sentence_embedding' retorna embeddings de frases. Configurá-lo como 'token_embeddings' retorna embeddings de token de texto. Configurá-lo como Nenhum retorna todos os valores de saída.
convert_to_numpy (padrão: True ): Se definido como True , a saída será uma lista de vetores numpy. Se definido como False , a saída será uma lista de tensores PyTorch.
convert_to_tensor (padrão: False ): Se definido como True , a função retorna um tensor empilhado como uma única saída. Este parâmetro substitui qualquer configuração especificada por convert_to_numpy .
device (padrão: Nenhum): Especifica o torch.device a ser usado para o cálculo. Se não for especificado, a função usará o dispositivo padrão.
normalize_embeddings (padrão: False ): Se definido como True , os vetores retornados terão comprimento 1, indicando que estão normalizados. Nesse caso, a pesquisa por similaridade usaria o produto escalar mais rápido ( util.dot_score ), em vez da similaridade de cosseno.

Lista de modelos

Lançamos uma série de pontos de verificação INSTRUTORES com tamanhos diferentes. Você pode carregar facilmente esses modelos com o pacote InstructorEmbedding .

Modelo	Média Pontuação
hkunlp/instrutor-base	55,9
hkunlp/instrutor-grande	58,4
hkunlp/instrutor-xl	58,8

Casos de uso

Fornecemos alguns casos de uso específicos a seguir. Para mais exemplos e aplicações, consulte nosso artigo

Calcule embeddings para seus textos personalizados

Se quiser calcular embeddings personalizados para frases específicas, você pode seguir o modelo unificado para escrever instruções:

Represente o domain text_type para task_objective :

domain é opcional e especifica o domínio do texto, por exemplo, ciência, finanças, medicina, etc.
text_type é obrigatório e especifica a unidade de codificação, por exemplo, frase, documento, parágrafo, etc.
task_objective é opcional e especifica o objetivo da incorporação, por exemplo, recuperar um documento, classificar a frase, etc.

Calcular semelhanças entre textos

Você pode usar o INSTRUCTOR para calcular semelhanças entre dois grupos de sentenças, com embeddings personalizados .

 from sklearn . metrics . pairwise import cosine_similarity
sentences_a = [[ 'Represent the Science sentence: ' , 'Parton energy loss in QCD matter' ], 
               [ 'Represent the Financial statement: ' , 'The Federal Reserve on Wednesday raised its benchmark interest rate.' ]]
sentences_b = [[ 'Represent the Science sentence: ' , 'The Chiral Phase Transition in Dissipative Dynamics' ],
               [ 'Represent the Financial statement: ' , 'The funds rose less than 0.5 per cent on Friday' ]]
embeddings_a = model . encode ( sentences_a )
embeddings_b = model . encode ( sentences_b )
similarities = cosine_similarity ( embeddings_a , embeddings_b )

Use embeddings personalizados para recuperação de informações

 import numpy as np
from sklearn . metrics . pairwise import cosine_similarity
query  = [[ 'Represent the Wikipedia question for retrieving supporting documents: ' , 'where is the food stored in a yam plant' ]]
corpus = [[ 'Represent the Wikipedia document for retrieval: ' , 'Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.' ],
          [ 'Represent the Wikipedia document for retrieval: ' , "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession" ],
          [ 'Represent the Wikipedia document for retrieval: ' , 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.' ]]
query_embeddings = model . encode ( query )
corpus_embeddings = model . encode ( corpus )
similarities = cosine_similarity ( query_embeddings , corpus_embeddings )
retrieved_doc_id = np . argmax ( similarities )
print ( retrieved_doc_id )

Use incorporações personalizadas para clustering

 import sklearn . cluster
sentences = [[ 'Represent the Medicine sentence for clustering: ' , 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Comparison of Atmospheric Neutrino Flux Calculations at Low Energies' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Fermion Bags in the Massive Gross-Neveu Model' ],
             [ 'Represent the Medicine sentence for clustering: ' , "QCD corrections to Associated t-tbar-H production at the Tevatron" ],
             [ 'Represent the Medicine sentence for clustering: ' , 'A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium' ]]
embeddings = model . encode ( sentences )
clustering_model = sklearn . cluster . MiniBatchKMeans ( n_clusters = 2 )
clustering_model . fit ( embeddings )
cluster_assignment = clustering_model . labels_
print ( cluster_assignment )

Treinamento

Dados

Construímos dados de incorporação multitarefa com instruções (MEDI), consistindo em uma coleção de 330 conjuntos de dados de Super-NI (Super-NaturalInstructions), dados de treinamento de incorporação de transformador de frase, KILT e MedMCQA, abrangendo uma ampla gama de domínios e tarefas. Construímos pares positivos e negativos se eles não forem fornecidos e os armazenamos em um formato unificado:

 [
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show's renewal for a second season. Critical reviews of the series have been generally positive, citing the show's positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band's most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John's single "Physical" for nine consecutive weeks, and then by Hall & Oates' "I Can't Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band's biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini AppletininAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).nVarieties of aromatised wine.nVarieties of aromatised wine Vermouth.nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

Cada instância consiste em uma consulta, um par positivo, um par negativo e o ID da tarefa, que é usado para garantir que os dados no mesmo lote de treinamento sejam da mesma tarefa. Os dados do MEDI estão disponíveis para download neste link.

INSTRUTOR DE TREM

Fornecemos o script de exemplo para o treinamento do INSTRUTOR. Pode ser necessário primeiro baixar os dados MEDI, descompactar a pasta e colocar medi-data.json em --cache_dir .

 python train . py - - model_name_or_path sentence - transformers / gtr - t5 - large - - output_dir { output_directory } - - cache_dir { cache_directory } - - max_source_length 512 - - num_train_epochs 10 - - save_steps 500 - - cl_temperature 0.1 - - warmup_ratio 0.1 - - learning_rate 2e-5 - - overwrite_output_dir

Explicamos os argumentos a seguir:

--model_name_or_path : pontos de verificação pré-treinados para começar. Oferecemos suporte ao ID do modelo (por exemplo, sentence-transformers/gtr-t5-large , sentence-transformers/sentence-t5-large ) ou ao caminho do ponto de verificação (por exemplo, ponto de verificação salvo pelo treinador de transformadores).
--cl_temperature : Temperatura para perda contrastiva
--cache_dir : O diretório para armazenar em cache modelos e dados baixados. Os dados MEDI baixados ( medi-data.json ) devem ser colocados no diretório --cache_dir .
--output_dir : O diretório para armazenar os modelos treinados (pontos de verificação) para avaliação.

Todos os outros argumentos são argumentos de treinamento Huggingface's transformers , como --overwrite_output_dir , --num_train_epochs , --learning_rate . Para obter detalhes, consulte Transformadores Huggingface

Avaliação

Avaliamos massivamente o INSTRUCTOR em 70 tarefas diversas, abrangendo uma ampla gama de tarefas e domínios. Especificamente, construímos nossa avaliação em três benchmarks: MTEB, Billboard e Prompt Retrieval. Explicamos os detalhes sobre a execução de scripts de avaliação a seguir.

MTEB

Para avaliar o desempenho do modelo no conjunto de dados de benchmark MTEB, primeiro instale a biblioteca MTEB

 cd evaluation / MTEB
pip install - e .

Em seguida, execute o seguinte comando:

 python examples / evaluate_model . py - - model_name hkunlp / instructor - large - - output_dir outputs - - task_name ArguAna - - result_file results

Você pode avaliar os pontos de verificação do modelo treinado especificando --model_name e executar todos os conjuntos de dados MTEB alterando --task_name . Verifique nosso artigo ou benchmark MTEB para métricas de avaliação de todas as tarefas.

Painel publicitário

Para avaliar o desempenho do modelo no Billboard, execute o seguinte comando:

 cd evaluation / text_evaluation
python main . py - - model_name hkunlp / instructor - large - - task mscoco - - add_prompt

Você pode avaliar os pontos de verificação do modelo treinado especificando --model_name e executar todos os conjuntos de dados do Billboard alterando --task . Em todos os três conjuntos de dados da Billboard, relatamos a correlação de Pearson.

Recuperação imediata

Para avaliar o desempenho do modelo na recuperação imediata, execute o seguinte comando:

 cd evaluation / prompt_retrieval
python main . py - - embedding_model hkunlp / instructor - large - - task rte - - model_cache_dir { cache_dir } - - output_dir { output_dir } - - add_prompt

Você pode avaliar os pontos de verificação do modelo treinado especificando --model_name e executar conjuntos de dados de recuperação imediata alterando --task . Para ter uma métrica consistente, colocamos todas as tarefas no Prompt Retrieval em um formato "texto para texto" e relatamos a pontuação do Rouge-L.

Quantização

Para quantizar o modelo instructor embedding , execute o seguinte código:

 # imports 
import torch
from InstructorEmbedding import INSTRUCTOR

# load the model 
model = INSTRUCTOR ( 'hkunlp/instructor-large' , device = 'cpu' )  # you can use GPU

# quantize the model 
qmodel = torch . quantization . quantize_dynamic (
model , { torch . nn . Linear }, dtype = torch . qint8 )

# Inference 
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel . encode ([[ instruction , sentence ]])  
# you can also normalize the embeddings:  normalize_embeddings=True 

print ( f"Quantized Embeddings: n { embeddings } " )

Reduz o tamanho do modelo em 10x e o tempo de inferência será menor que o modelo normal :)

Bugs ou dúvidas?

Se você tiver alguma dúvida relacionada ao código ou ao artigo, sinta-se à vontade para enviar um e-mail para Hongjin ( [email protected] ) e Weijia ( [email protected] ). Tente especificar o problema com detalhes para que possamos ajudá-lo melhor e mais rápido.

Citação

Se você achar nosso trabalho útil, cite-nos:

 @inproceedings { INSTRUCTOR ,
  title = { One Embedder, Any Task: Instruction-Finetuned Text Embeddings } ,
  author = { Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao } ,
  url = { https://arxiv.org/abs/2212.09741 } ,
  year = { 2022 } ,
}

INSTRUTOR Em outro lugar

Agradecemos aos esforços da comunidade pela extensão do INSTRUCTOR!

LangChain suporta InstructEmbeddings, que usa o modelo INSTRUCTOR.
MosaicML incluiu Instructor-Large e Instructor-XL
embaas integrado Instructor-Large
Haystack inclui os componentes InstructorTextEmbedder e InstructorDocumentEmbedder .

Expandir