Загрузка instructor embedding - instructor embedding Загрузка исходного кода

Моя личная вилка

Это форк модели Instructor, поскольку исходный репозиторий больше не поддерживается. Я также внес некоторые улучшения в их исходный код:

Исправление для работы с библиотекой sentence-transformers выше 2.2.2.
Правильно загружайте модели из Huggingface, используя новый API «загрузки снимков».
Возможность указать, куда вы хотите загрузить модель, с помощью параметра «cache_dir».

Далее следует файл readme исходного репозитория. Однако игнорируйте раздел квантования, поскольку с тех пор pytorch изменил свой API.

Один специалист по внедрению, любая задача: встраивание текста с точной настройкой инструкций

Этот репозиторий содержит код и предварительно обученные модели для нашей статьи «Один встраиватель, любая задача: встраивание текста с точной настройкой инструкций». Пожалуйста, посетите нашу страницу проекта для краткого обзора проекта.

Мы представляем Instructor ?‍?, модель встраивания текста с точной настройкой инструкций, которая может генерировать встраивание текста, адаптированное к любой задаче (например, классификация, поиск, кластеризация, оценка текста и т. д.) и областям (например, наука, финансы и т. д.). просто предоставив инструкцию задачи без какой-либо тонкой настройки . Инструктор?‍ справляется с 70 разнообразными задачами по встраиванию!

**************************** Обновления ********************** *******

21.01: Мы обновили структуру кода, которая поддерживает простую установку пакетов.
28 декабря: Мы обновили контрольно-пропускной пункт, добавив резкие негативы.
20 декабря: Мы выпустили документ, код, страницу проекта и контрольную точку. Проверьте их!

Быстрые ссылки

Один специалист по внедрению, любая задача: встраивание текста с точной настройкой инструкций
- Быстрые ссылки
- Установка
  - Настройка среды
- Начиная
  - Функция encode
- Список моделей
- Варианты использования
  - Рассчитайте встраивания для ваших индивидуальных текстов
  - Вычислить сходство между текстами
  - Используйте настраиваемые встраивания для поиска информации
  - Используйте индивидуальные внедрения для кластеризации
- Обучение
  - Данные
  - ИНСТРУКТОР ПОЕЗДА
- Оценка
  - МТЕБ
  - Рекламный щит
  - Быстрый поиск
- Квантование
- Баги или вопросы?
- Цитирование
- ИНСТРУКТОР

Установка

INSTRUCTOR очень легко использовать для встраивания любого текста. Вы можете легко опробовать это в блокноте Colab. На вашем локальном компьютере мы рекомендуем сначала создать виртуальную среду:

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

Это создаст instructor среды, который мы использовали. Чтобы использовать инструмент внедрения, сначала установите пакет InstructorEmbedding из PyPI.

pip install InstructorEmbedding

или установите его напрямую из нашего кода

pip install -e .

Настройка среды

Активируйте среду, запустив

conda activate instructor

Начиная

Сначала загрузите предварительно обученную модель (полный список доступных моделей см. в списке моделей).

 from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR ( 'hkunlp/instructor-large' )

Затем дайте модели предложение и индивидуальные инструкции.

 # prepare texts with instructions
text_instruction_pairs = [
    { "instruction" : "Represent the Science title:" , "text" : "3D ActionSLAM: wearable person tracking in multi-floor environments" },
    { "instruction" : "Represent the Medicine sentence for retrieving a duplicate sentence:" , "text" : "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear." }
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs :
    texts_with_instructions . append ([ pair [ "instruction" ], pair [ "text" ]])

# calculate embeddings
customized_embeddings = model . encode ( texts_with_instructions )

И это уже все. Теперь у нас есть список массивов numpy с вложениями.

 for pair , embedding in zip ( text_instruction_pairs , customized_embeddings ):
    print ( "Instruction: " , pair [ "instruction" ])
    print ( "text: " , pair [ "text" ])
    print ( "Embedding: " , embedding )
    print ( "" )

Функция `encode`

Пользователям модели необходимо использовать только функцию encode :

 model . encode ( sentences ,
              batch_size : int = 32 ,
              show_progress_bar : bool = None ,
              output_value : str = 'sentence_embedding' ,
              convert_to_numpy : bool = True ,
              convert_to_tensor : bool = False ,
              device : str = None ,
              normalize_embeddings : bool = False )

sentences : предложения, которые нужно вставить. Он должен быть в формате [["instruction prompt 0", "text to be embedded 0], ["instruction prompt 1", "text to be embedded 1], ...] .
batch_size (по умолчанию: 32): размер пакета, используемый для вычислений. Он определяет количество предложений, обработанных вместе в каждом пакете.
show_progress_bar (по умолчанию: None): если установлено значение True , при кодировании предложений отображается индикатор выполнения, обеспечивающий визуальную индикацию хода кодирования.
output_value (по умолчанию: 'sentence_embedding'): указывает желаемый тип вывода. Значение по умолчанию «sentence_embedding» возвращает встраивания предложений. Установка значения «token_embeddings» возвращает встраивания токенов wordpiece. Установка значения «Нет» возвращает все выходные значения.
convert_to_numpy (по умолчанию: True ): если установлено значение True , выходные данные представляют собой список числовых векторов. Если установлено значение False , выходные данные представляют собой список тензоров PyTorch.
convert_to_tensor (по умолчанию: False ): если установлено значение True , функция возвращает составной тензор как один выходной сигнал. Этот параметр переопределяет любые настройки, указанные в параметре convert_to_numpy .
device (по умолчанию: нет): указывает torch.device, который будет использоваться для вычислений. Если не указано, функция использует устройство по умолчанию.
normalize_embeddings (по умолчанию: False ): если установлено значение True , возвращаемые векторы будут иметь длину 1, что указывает на то, что они нормализованы. В этом случае при поиске сходства будет использоваться более быстрое скалярное произведение ( util.dot_score ) вместо косинусного сходства.

Список моделей

Мы выпустили серию КПП ИНСТРУКТОРА разных размеров. Вы можете легко загрузить эти модели с помощью пакета InstructorEmbedding .

Модель	Среднее Счет
hkunlp/база инструктора	55,9
hkunlp/инструктор-большой	58,4
hkunlp/инструктор-xl	58,8

Варианты использования

Ниже мы приведем несколько конкретных случаев использования. Дополнительные примеры и приложения можно найти в нашей статье.

Рассчитайте встраивания для ваших индивидуальных текстов

Если вы хотите рассчитать индивидуальные вложения для конкретных предложений, вы можете воспользоваться единым шаблоном для написания инструкций:

Представьте domain text_type для task_objective :

domain является необязательным и указывает домен текста, например, наука, финансы, медицина и т. д.
text_type является обязательным и указывает единицу кодирования, например, предложение, документ, абзац и т. д.
task_objective является необязательным и определяет цель встраивания, например, получение документа, классификация предложения и т. д.

Вычислить сходство между текстами

Вы можете использовать INSTRUCTOR для вычисления сходства между двумя группами предложений с помощью настраиваемых вложений .

 from sklearn . metrics . pairwise import cosine_similarity
sentences_a = [[ 'Represent the Science sentence: ' , 'Parton energy loss in QCD matter' ], 
               [ 'Represent the Financial statement: ' , 'The Federal Reserve on Wednesday raised its benchmark interest rate.' ]]
sentences_b = [[ 'Represent the Science sentence: ' , 'The Chiral Phase Transition in Dissipative Dynamics' ],
               [ 'Represent the Financial statement: ' , 'The funds rose less than 0.5 per cent on Friday' ]]
embeddings_a = model . encode ( sentences_a )
embeddings_b = model . encode ( sentences_b )
similarities = cosine_similarity ( embeddings_a , embeddings_b )

Используйте настраиваемые встраивания для поиска информации

 import numpy as np
from sklearn . metrics . pairwise import cosine_similarity
query  = [[ 'Represent the Wikipedia question for retrieving supporting documents: ' , 'where is the food stored in a yam plant' ]]
corpus = [[ 'Represent the Wikipedia document for retrieval: ' , 'Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.' ],
          [ 'Represent the Wikipedia document for retrieval: ' , "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession" ],
          [ 'Represent the Wikipedia document for retrieval: ' , 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.' ]]
query_embeddings = model . encode ( query )
corpus_embeddings = model . encode ( corpus )
similarities = cosine_similarity ( query_embeddings , corpus_embeddings )
retrieved_doc_id = np . argmax ( similarities )
print ( retrieved_doc_id )

Используйте индивидуальные внедрения для кластеризации

 import sklearn . cluster
sentences = [[ 'Represent the Medicine sentence for clustering: ' , 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Comparison of Atmospheric Neutrino Flux Calculations at Low Energies' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Fermion Bags in the Massive Gross-Neveu Model' ],
             [ 'Represent the Medicine sentence for clustering: ' , "QCD corrections to Associated t-tbar-H production at the Tevatron" ],
             [ 'Represent the Medicine sentence for clustering: ' , 'A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium' ]]
embeddings = model . encode ( sentences )
clustering_model = sklearn . cluster . MiniBatchKMeans ( n_clusters = 2 )
clustering_model . fit ( embeddings )
cluster_assignment = clustering_model . labels_
print ( cluster_assignment )

Обучение

Данные

Мы создаем данные многозадачного внедрения с инструкциями (MEDI), состоящие из коллекции 330 наборов данных из Super-NI (Super-NaturalInstructions), обучающих данных по внедрению преобразователей предложений, KILT и MedMCQA, охватывающих широкий спектр областей и задач. Мы конструируем положительные и отрицательные пары, если они не предусмотрены, и сохраняем их в едином формате:

 [
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show's renewal for a second season. Critical reviews of the series have been generally positive, citing the show's positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band's most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John's single "Physical" for nine consecutive weeks, and then by Hall & Oates' "I Can't Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band's biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini AppletininAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).nVarieties of aromatised wine.nVarieties of aromatised wine Vermouth.nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

Каждый экземпляр состоит из запроса, положительной пары, отрицательной пары и идентификатора задачи, который используется для обеспечения того, чтобы данные в одном обучающем пакете принадлежали одной и той же задаче. Данные MEDI доступны для скачивания по этой ссылке.

ИНСТРУКТОР ПОЕЗДА

Мы предоставляем пример сценария для обучения ИНСТРУКТОРА. Возможно, вам придется сначала загрузить данные MEDI, разархивировать папку и поместить medi-data.json в --cache_dir .

 python train . py - - model_name_or_path sentence - transformers / gtr - t5 - large - - output_dir { output_directory } - - cache_dir { cache_directory } - - max_source_length 512 - - num_train_epochs 10 - - save_steps 500 - - cl_temperature 0.1 - - warmup_ratio 0.1 - - learning_rate 2e-5 - - overwrite_output_dir

Мы объясняем аргументы следующим образом:

--model_name_or_path : предварительно обученные контрольные точки для начала. Мы поддерживаем как идентификатор модели (например, sentence-transformers/gtr-t5-large , sentence-transformers/sentence-t5-large ), так и путь контрольной точки (например, контрольная точка, сохраненная тренером трансформаторов).
--cl_temperature : Температура потери контрастности
--cache_dir : каталог для кэширования загруженных моделей и данных. Загруженные данные MEDI ( medi-data.json ) следует поместить в каталог --cache_dir .
--output_dir : каталог для хранения обученных моделей (контрольных точек) для оценки.

Все остальные аргументы являются стандартными аргументами обучения Huggingface's transformers , такими как --overwrite_output_dir , --num_train_epochs , --learning_rate . Подробнее см. в разделе Трансформеры Huggingface.

Оценка

Мы оцениваем INSTRUCTOR по 70 различным задачам, охватывающим широкий спектр задач и областей. В частности, мы строим нашу оценку на трех тестах: MTEB, Billboard и Prompt Retrival. Ниже мы объясним подробности запуска сценариев оценки.

МТЕБ

Чтобы оценить производительность модели на наборе эталонных данных MTEB, сначала установите библиотеку MTEB.

 cd evaluation / MTEB
pip install - e .

Затем выполните следующую команду:

 python examples / evaluate_model . py - - model_name hkunlp / instructor - large - - output_dir outputs - - task_name ArguAna - - result_file results

Вы можете оценить контрольные точки обученной модели, указав --model_name , и запустить все наборы данных MTEB, изменив --task_name . Проверьте нашу статью или тест MTEB для получения показателей оценки всех задач.

Рекламный щит

Чтобы оценить производительность модели на Billboard, выполните следующую команду:

 cd evaluation / text_evaluation
python main . py - - model_name hkunlp / instructor - large - - task mscoco - - add_prompt

Вы можете оценить контрольные точки обученной модели, указав --model_name , и запустить все наборы данных Billboard, изменив --task . Во всех трех наборах данных в Billboard мы сообщаем о корреляции Пирсона.

Быстрый поиск

Чтобы оценить производительность модели при быстром получении, выполните следующую команду:

 cd evaluation / prompt_retrieval
python main . py - - embedding_model hkunlp / instructor - large - - task rte - - model_cache_dir { cache_dir } - - output_dir { output_dir } - - add_prompt

Вы можете оценить контрольные точки обученной модели, указав --model_name , и запустить наборы данных быстрого извлечения, изменив --task . Чтобы иметь согласованные показатели, мы приводим все задачи в быстром поиске к формату «текст в текст» и сообщаем оценку Rouge-L.

Квантование

Чтобы квантовать модель instructor embedding , запустите следующий код:

 # imports 
import torch
from InstructorEmbedding import INSTRUCTOR

# load the model 
model = INSTRUCTOR ( 'hkunlp/instructor-large' , device = 'cpu' )  # you can use GPU

# quantize the model 
qmodel = torch . quantization . quantize_dynamic (
model , { torch . nn . Linear }, dtype = torch . qint8 )

# Inference 
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel . encode ([[ instruction , sentence ]])  
# you can also normalize the embeddings:  normalize_embeddings=True 

print ( f"Quantized Embeddings: n { embeddings } " )

Это уменьшит размер модели в 10 раз, а время вывода будет меньше, чем у обычной модели :)

Баги или вопросы?

Если у вас есть какие-либо вопросы, связанные с кодом или документом, напишите по электронной почте Hongjin ( [email protected] ) и Weijia ( [email protected] ). Пожалуйста, постарайтесь подробно описать проблему, чтобы мы могли помочь вам лучше и быстрее.

Цитирование

Если наша работа окажется для вас полезной, пожалуйста, укажите нас:

 @inproceedings { INSTRUCTOR ,
  title = { One Embedder, Any Task: Instruction-Finetuned Text Embeddings } ,
  author = { Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao } ,
  url = { https://arxiv.org/abs/2212.09741 } ,
  year = { 2022 } ,
}

ИНСТРУКТОР

Мы благодарим сообщество за расширение INSTRUCTOR!

LangChain поддерживает InstructEmbeddings, использующие модель INSTRUCTOR.
MosaicML включает Instructor-Large и Instructor-XL.
Embaas интегрированный инструктор-большой
Haystack включает компоненты InstructorTextEmbedder и InstructorDocumentEmbedder .

Расширять