instructor embedding ดาวน์โหลด - instructor embedding ดาวน์โหลดซอร์สโค้ด

ส้อมส่วนตัวของฉัน

นี่เป็นทางแยกสำหรับโมเดลผู้สอนเนื่องจากพื้นที่เก็บข้อมูลดั้งเดิมไม่ได้ถูกเก็บไว้อีกต่อไป ฉันยังได้ปรับปรุงซอร์สโค้ดของพวกเขาด้วย:

แก้ไขให้ทำงานกับไลบรารี่ของ sentence-transformers ที่สูงกว่า 2.2.2
ดาวน์โหลดโมเดลจาก Huggingface อย่างถูกต้องโดยใช้ API "การดาวน์โหลดสแนปชอต" ใหม่
ความสามารถในการระบุตำแหน่งที่คุณต้องการให้โมเดลดาวน์โหลดด้วยพารามิเตอร์ "cache_dir"

สิ่งต่อไปนี้คือไฟล์ readme ของที่เก็บดั้งเดิม อย่างไรก็ตาม ไม่ต้องสนใจส่วนการหาปริมาณ เนื่องจาก pytorch ได้เปลี่ยน API ของมันตั้งแต่นั้นมา

เครื่องมือฝังตัวเดียว งานใดก็ได้: การฝังข้อความที่ได้รับการปรับแต่งตามคำสั่ง

พื้นที่เก็บข้อมูลนี้ประกอบด้วยโค้ดและโมเดลที่ได้รับการฝึกอบรมล่วงหน้าสำหรับ One Embedder เอกสารของเรา งานใดก็ได้: การฝังข้อความที่ปรับแต่งตามคำสั่ง โปรดดูหน้าโครงการของเราเพื่อดูภาพรวมโครงการโดยย่อ

เราขอแนะนำ Instructor ?‍? ซึ่งเป็นโมเดลการฝังข้อความที่ได้รับการปรับปรุงคำสั่ง ซึ่งสามารถสร้างการฝังข้อความที่เหมาะกับงานใดๆ (เช่น การจัดหมวดหมู่ การดึงข้อมูล การจัดกลุ่ม การประเมินข้อความ ฯลฯ) และโดเมน (เช่น วิทยาศาสตร์ การเงิน ฯลฯ) โดยเพียงแค่ให้คำแนะนำงานโดยไม่ต้องปรับแต่งใดๆ ผู้สอน?‍ ประสบความสำเร็จในภารกิจการฝังที่หลากหลายถึง 70 งาน!

**************************** อัปเดต ********************* *******

21/01: เราได้อัปเดตโครงสร้างโค้ดซึ่งรองรับการติดตั้งแพ็คเกจที่ง่ายดาย
28/12: เราอัปเดตจุดตรวจด้วยเชิงลบอย่างหนัก
20/12: เราเผยแพร่เอกสาร รหัส หน้าโครงการ และจุดตรวจ ตรวจสอบพวกเขาออก!

ลิงค์ด่วน

เครื่องมือฝังตัวเดียว งานใดก็ได้: การฝังข้อความที่ได้รับการปรับแต่งตามคำสั่ง
- ลิงค์ด่วน
- การติดตั้ง
  - การตั้งค่าสภาพแวดล้อม
- เริ่มต้นใช้งาน
  - ฟังก์ชั่น encode
- รายการรุ่น
- ใช้กรณี
  - คำนวณการฝังสำหรับข้อความที่คุณกำหนดเอง
  - คำนวณความคล้ายคลึงกันระหว่างข้อความ
  - ใช้การฝังแบบกำหนดเองสำหรับการดึงข้อมูล
  - ใช้การฝังแบบกำหนดเองสำหรับการจัดกลุ่ม
- การฝึกอบรม
  - ข้อมูล
  - ผู้สอนรถไฟ
- การประเมิน
  - เอ็มเทบี
  - ป้ายโฆษณา
  - การเรียกคืนทันที
- การหาปริมาณ
- ข้อบกพร่องหรือคำถาม?
- การอ้างอิง
- ผู้สอนที่อื่น

การติดตั้ง

มันง่ายมากที่จะใช้ INSTRUCTOR สำหรับการฝังข้อความใดๆ คุณสามารถลองใช้ในสมุดบันทึก Colab ได้อย่างง่ายดาย ในเครื่องของคุณ เราขอแนะนำให้สร้างสภาพแวดล้อมเสมือนก่อน:

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

นั่นจะสร้าง instructor สิ่งแวดล้อมที่เราใช้ หากต้องการใช้เครื่องมือการฝัง ขั้นแรกให้ติดตั้งแพ็คเกจ InstructorEmbedding จาก PyPI

pip install InstructorEmbedding

หรือติดตั้งโดยตรงจากโค้ดของเรา

pip install -e .

การตั้งค่าสภาพแวดล้อม

เปิดใช้งานสภาพแวดล้อมโดยการรัน

conda activate instructor

เริ่มต้นใช้งาน

ขั้นแรกให้ดาวน์โหลดโมเดลที่ได้รับการฝึกล่วงหน้า (ดูรายการรุ่นสำหรับรายการรุ่นทั้งหมดที่มีจำหน่าย)

 from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR ( 'hkunlp/instructor-large' )

จากนั้นให้ประโยคและคำแนะนำที่กำหนดเองแก่โมเดล

 # prepare texts with instructions
text_instruction_pairs = [
    { "instruction" : "Represent the Science title:" , "text" : "3D ActionSLAM: wearable person tracking in multi-floor environments" },
    { "instruction" : "Represent the Medicine sentence for retrieving a duplicate sentence:" , "text" : "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear." }
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs :
    texts_with_instructions . append ([ pair [ "instruction" ], pair [ "text" ]])

# calculate embeddings
customized_embeddings = model . encode ( texts_with_instructions )

และนั่นก็เป็นเช่นนั้นแล้ว ตอนนี้เรามีรายการอาร์เรย์จำนวนมากที่มีการฝัง

 for pair , embedding in zip ( text_instruction_pairs , customized_embeddings ):
    print ( "Instruction: " , pair [ "instruction" ])
    print ( "text: " , pair [ "text" ])
    print ( "Embedding: " , embedding )
    print ( "" )

ฟังก์ชั่น `encode`

ผู้ใช้โมเดลจำเป็นต้องใช้เฉพาะฟังก์ชัน encode เท่านั้น:

 model . encode ( sentences ,
              batch_size : int = 32 ,
              show_progress_bar : bool = None ,
              output_value : str = 'sentence_embedding' ,
              convert_to_numpy : bool = True ,
              convert_to_tensor : bool = False ,
              device : str = None ,
              normalize_embeddings : bool = False )

sentences : ประโยคที่จะฝัง ควรอยู่ในรูปแบบ [["instruction prompt 0", "text to be embedded 0], ["instruction prompt 1", "text to be embedded 1], ...]
batch_size (ค่าเริ่มต้น: 32): ขนาดแบตช์ที่ใช้สำหรับการคำนวณ จะกำหนดจำนวนประโยคที่ประมวลผลร่วมกันในแต่ละชุด
show_progress_bar (ค่าเริ่มต้น: ไม่มี): หากตั้งค่าเป็น True จะแสดงแถบความคืบหน้าขณะเข้ารหัสประโยค ซึ่งเป็นการแสดงภาพความคืบหน้าของการเข้ารหัส
output_value (ค่าเริ่มต้น: 'sentence_embedding'): ระบุประเภทเอาต์พุตที่ต้องการ ค่าเริ่มต้น 'sentence_embedding' จะส่งคืนการฝังประโยค การตั้งค่าเป็น 'token_embeddings' จะส่งคืนการฝังโทเค็น wordpiece การตั้งค่าเป็นไม่มีจะส่งกลับค่าเอาต์พุตทั้งหมด
convert_to_numpy (ค่าเริ่มต้น: True ): หากตั้งค่าเป็น True เอาต์พุตจะเป็นรายการเวกเตอร์ที่มีตัวเลข หากตั้งค่าเป็น False ผลลัพธ์จะเป็นรายการของ PyTorch tensors
convert_to_tensor (ค่าเริ่มต้น: False ): หากตั้งค่าเป็น True ฟังก์ชันจะส่งคืนเทนเซอร์แบบสแต็กเป็นเอาต์พุตเดี่ยว พารามิเตอร์นี้จะแทนที่การตั้งค่าใดๆ ที่ระบุโดย convert_to_numpy
device (ค่าเริ่มต้น: ไม่มี): ระบุ torch.device ที่จะใช้สำหรับการคำนวณ หากไม่ได้ระบุ ฟังก์ชันจะใช้อุปกรณ์เริ่มต้น
normalize_embeddings (ค่าเริ่มต้น: False ): หากตั้งค่าเป็น True เวกเตอร์ที่ส่งคืนจะมีความยาว 1 ซึ่งบ่งชี้ว่าพวกมันถูกทำให้เป็นมาตรฐาน ในกรณีนี้ การค้นหาความคล้ายคลึงกันจะใช้ dot-product ที่เร็วกว่า ( util.dot_score ) แทนความคล้ายคลึงโคไซน์

รายการรุ่น

เราได้เปิดตัวชุดด่านตรวจของผู้สอนที่มีขนาดแตกต่างกัน คุณสามารถโหลดโมเดลเหล่านี้ได้อย่างง่ายดายด้วยแพ็คเกจ InstructorEmbedding

แบบอย่าง	เฉลี่ย คะแนน
hkunlp/instructor-base	55.9
hkunlp/ผู้สอน-ใหญ่	58.4
hkunlp/instructor-xl	58.8

ใช้กรณี

เรามีกรณีการใช้งานเฉพาะบางประการดังต่อไปนี้ หากต้องการตัวอย่างและการใช้งานเพิ่มเติม โปรดดูเอกสารของเรา

คำนวณการฝังสำหรับข้อความที่คุณกำหนดเอง

หากคุณต้องการคำนวณการฝังแบบกำหนดเองสำหรับประโยคเฉพาะ คุณอาจทำตามเทมเพลตแบบรวมเพื่อเขียนคำแนะนำ:

แสดงถึง domain text_type สำหรับ task_objective :

domain เป็นทางเลือก และระบุโดเมนของข้อความ เช่น วิทยาศาสตร์ การเงิน การแพทย์ ฯลฯ
จำเป็นต้องมี text_type และระบุหน่วยการเข้ารหัส เช่น ประโยค เอกสาร ย่อหน้า ฯลฯ
task_objective เป็นทางเลือก และระบุวัตถุประสงค์ของการฝัง เช่น ดึงเอกสาร จัดประเภทประโยค ฯลฯ

คำนวณความคล้ายคลึงกันระหว่างข้อความ

คุณสามารถใช้ INSTRUCTOR เพื่อคำนวณความคล้ายคลึงกันระหว่างประโยคสองกลุ่ม ด้วย การฝังแบบกำหนดเอง

 from sklearn . metrics . pairwise import cosine_similarity
sentences_a = [[ 'Represent the Science sentence: ' , 'Parton energy loss in QCD matter' ], 
               [ 'Represent the Financial statement: ' , 'The Federal Reserve on Wednesday raised its benchmark interest rate.' ]]
sentences_b = [[ 'Represent the Science sentence: ' , 'The Chiral Phase Transition in Dissipative Dynamics' ],
               [ 'Represent the Financial statement: ' , 'The funds rose less than 0.5 per cent on Friday' ]]
embeddings_a = model . encode ( sentences_a )
embeddings_b = model . encode ( sentences_b )
similarities = cosine_similarity ( embeddings_a , embeddings_b )

ใช้การฝังแบบกำหนดเองสำหรับการดึงข้อมูล

 import numpy as np
from sklearn . metrics . pairwise import cosine_similarity
query  = [[ 'Represent the Wikipedia question for retrieving supporting documents: ' , 'where is the food stored in a yam plant' ]]
corpus = [[ 'Represent the Wikipedia document for retrieval: ' , 'Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.' ],
          [ 'Represent the Wikipedia document for retrieval: ' , "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession" ],
          [ 'Represent the Wikipedia document for retrieval: ' , 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.' ]]
query_embeddings = model . encode ( query )
corpus_embeddings = model . encode ( corpus )
similarities = cosine_similarity ( query_embeddings , corpus_embeddings )
retrieved_doc_id = np . argmax ( similarities )
print ( retrieved_doc_id )

ใช้การฝังแบบกำหนดเองสำหรับการจัดกลุ่ม

 import sklearn . cluster
sentences = [[ 'Represent the Medicine sentence for clustering: ' , 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Comparison of Atmospheric Neutrino Flux Calculations at Low Energies' ],
             [ 'Represent the Medicine sentence for clustering: ' , 'Fermion Bags in the Massive Gross-Neveu Model' ],
             [ 'Represent the Medicine sentence for clustering: ' , "QCD corrections to Associated t-tbar-H production at the Tevatron" ],
             [ 'Represent the Medicine sentence for clustering: ' , 'A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium' ]]
embeddings = model . encode ( sentences )
clustering_model = sklearn . cluster . MiniBatchKMeans ( n_clusters = 2 )
clustering_model . fit ( embeddings )
cluster_assignment = clustering_model . labels_
print ( cluster_assignment )

การฝึกอบรม

ข้อมูล

เราสร้าง Multitask Embeddings Data with Instructions (MEDI) ซึ่งประกอบด้วยชุดข้อมูล 330 ชุดจาก Super-NI (Super-NaturalInstructions) ข้อมูลการฝึกอบรมการฝังตัวแปลงประโยค KILT และ MedMCQA ซึ่งครอบคลุมโดเมนและงานที่หลากหลาย เราสร้างคู่ที่เป็นบวกและลบหากไม่มีการระบุ และจัดเก็บไว้ในรูปแบบที่เป็นหนึ่งเดียว:

 [
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show's renewal for a second season. Critical reviews of the series have been generally positive, citing the show's positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band's most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John's single "Physical" for nine consecutive weeks, and then by Hall & Oates' "I Can't Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band's biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini AppletininAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).nVarieties of aromatised wine.nVarieties of aromatised wine Vermouth.nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

แต่ละอินสแตนซ์ประกอบด้วยแบบสอบถาม คู่ค่าบวก คู่ค่าลบ และรหัสของงาน ซึ่งใช้เพื่อให้แน่ใจว่าข้อมูลในชุดการฝึกเดียวกันนั้นมาจากงานเดียวกัน สามารถดาวน์โหลดข้อมูล MEDI ได้ที่ลิงค์นี้

ผู้สอนรถไฟ

เราจัดเตรียมสคริปต์ตัวอย่างสำหรับการฝึกอบรมผู้สอน คุณอาจต้องดาวน์โหลดข้อมูล MEDI ก่อน แตกไฟล์โฟลเดอร์และวาง medi-data.json ไว้ใต้ --cache_dir

 python train . py - - model_name_or_path sentence - transformers / gtr - t5 - large - - output_dir { output_directory } - - cache_dir { cache_directory } - - max_source_length 512 - - num_train_epochs 10 - - save_steps 500 - - cl_temperature 0.1 - - warmup_ratio 0.1 - - learning_rate 2e-5 - - overwrite_output_dir

เราอธิบายข้อโต้แย้งดังต่อไปนี้:

--model_name_or_path : จุดตรวจที่ได้รับการฝึกไว้ล่วงหน้าเพื่อเริ่มต้นด้วย เรารองรับทั้งรหัสโมเดล (เช่น sentence-transformers/gtr-t5-large , sentence-transformers/sentence-t5-large ) หรือเส้นทางจุดตรวจ (เช่น จุดตรวจสอบที่บันทึกโดยเทรนเนอร์หม้อแปลง)
--cl_temperature : อุณหภูมิสำหรับการสูญเสียเชิงเปรียบเทียบ
--cache_dir : ไดเร็กทอรีสำหรับแคชโมเดลและข้อมูลที่ดาวน์โหลด ข้อมูล MEDI ที่ดาวน์โหลด ( medi-data.json ) ควรวางไว้ใต้ไดเร็กทอรี --cache_dir
--output_dir : ไดเร็กทอรีสำหรับจัดเก็บโมเดลที่ได้รับการฝึกอบรม (จุดตรวจสอบ) เพื่อการประเมินผล

อาร์กิวเมนต์อื่นๆ ทั้งหมดเป็นอาร์กิวเมนต์การฝึกอบรม Huggingface's transformers เช่น --overwrite_output_dir , --num_train_epochs , --learning_rate สำหรับรายละเอียด โปรดดูที่หม้อแปลง Huggingface

การประเมิน

เราประเมินผู้สอนอย่างหนาแน่นในงานที่หลากหลาย 70 งาน ครอบคลุมงานและโดเมนที่หลากหลาย โดยเฉพาะอย่างยิ่ง เราสร้างการประเมินโดยใช้เกณฑ์มาตรฐาน 3 รายการ ได้แก่ MTEB, Billboard และ Prompt Retrieval เราอธิบายรายละเอียดเกี่ยวกับการรันสคริปต์การประเมินผลดังต่อไปนี้

เอ็มเทบี

หากต้องการประเมินประสิทธิภาพของโมเดลบนชุดข้อมูลเกณฑ์มาตรฐาน MTEB ให้ติดตั้งไลบรารี MTEB ก่อน

 cd evaluation / MTEB
pip install - e .

จากนั้นรันคำสั่งต่อไปนี้:

 python examples / evaluate_model . py - - model_name hkunlp / instructor - large - - output_dir outputs - - task_name ArguAna - - result_file results

คุณสามารถประเมินจุดตรวจสอบโมเดลที่ได้รับการฝึกของคุณได้โดยการระบุ --model_name และรันชุดข้อมูล MTEB ทั้งหมดโดยการเปลี่ยน --task_name ตรวจสอบเอกสารหรือเกณฑ์มาตรฐาน MTEB ของเราเพื่อดูเกณฑ์การประเมินของงานทั้งหมด

ป้ายโฆษณา

ในการประเมินประสิทธิภาพของโมเดลบน Billboard ให้รันคำสั่งต่อไปนี้:

 cd evaluation / text_evaluation
python main . py - - model_name hkunlp / instructor - large - - task mscoco - - add_prompt

คุณสามารถประเมินจุดตรวจสอบโมเดลที่ได้รับการฝึกของคุณได้โดยการระบุ --model_name และรันชุดข้อมูล Billboard ทั้งหมดโดยการเปลี่ยน --task ในชุดข้อมูลทั้งสามชุดใน Billboard เรารายงานความสัมพันธ์ของ Pearson

การเรียกคืนทันที

ในการประเมินประสิทธิภาพของโมเดลบนพร้อมท์การดึงข้อมูล ให้รันคำสั่งต่อไปนี้:

 cd evaluation / prompt_retrieval
python main . py - - embedding_model hkunlp / instructor - large - - task rte - - model_cache_dir { cache_dir } - - output_dir { output_dir } - - add_prompt

คุณสามารถประเมินจุดตรวจสอบโมเดลที่ได้รับการฝึกของคุณได้โดยการระบุ --model_name และเรียกใช้ชุดข้อมูลการดึงข้อมูลพร้อมท์โดยการเปลี่ยน --task เพื่อให้มีตัวชี้วัดที่สอดคล้องกัน เราจึงโยนงานทั้งหมดใน Prompt Retrieval เป็นรูปแบบ "ข้อความเป็นข้อความ" และรายงานคะแนน Rouge-L

การหาปริมาณ

หากต้องการ หาปริมาณ โมเดล instructor embedding ให้รันโค้ดต่อไปนี้:

 # imports 
import torch
from InstructorEmbedding import INSTRUCTOR

# load the model 
model = INSTRUCTOR ( 'hkunlp/instructor-large' , device = 'cpu' )  # you can use GPU

# quantize the model 
qmodel = torch . quantization . quantize_dynamic (
model , { torch . nn . Linear }, dtype = torch . qint8 )

# Inference 
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel . encode ([[ instruction , sentence ]])  
# you can also normalize the embeddings:  normalize_embeddings=True 

print ( f"Quantized Embeddings: n { embeddings } " )

จะลดขนาดโมเดลลง 10 เท่า และเวลาในการอนุมานจะน้อยกว่ารุ่นปกติ :)

ข้อบกพร่องหรือคำถาม?

หากคุณมีคำถามใดๆ ที่เกี่ยวข้องกับรหัสหรือเอกสาร โปรดส่งอีเมลไปที่ Hongjin ( [email protected] ) และ Weijia ( [email protected] ) โปรดลองระบุปัญหาพร้อมรายละเอียดเพื่อให้เราสามารถช่วยเหลือคุณได้ดีขึ้นและรวดเร็วยิ่งขึ้น

การอ้างอิง

หากคุณพบว่างานของเรามีประโยชน์ โปรดอ้างอิงถึงเรา:

 @inproceedings { INSTRUCTOR ,
  title = { One Embedder, Any Task: Instruction-Finetuned Text Embeddings } ,
  author = { Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao } ,
  url = { https://arxiv.org/abs/2212.09741 } ,
  year = { 2022 } ,
}

ผู้สอนที่อื่น

เราขอขอบคุณความพยายามของชุมชนในการขยายผู้สอน!

LangChain รองรับ InstructEmbeddings ซึ่งใช้โมเดล INSTRUCTOR
MosaicML ได้รวม Instructor-Large และ Instructor-XL ไว้ด้วย
embaas integrated Instructor-Large
กองหญ้าประกอบด้วยส่วนประกอบ InstructorTextEmbedder และ InstructorDocumentEmbedder

ขยาย

instructor embedding

ส้อมส่วนตัวของฉัน

เครื่องมือฝังตัวเดียว งานใดก็ได้: การฝังข้อความที่ได้รับการปรับแต่งตามคำสั่ง

ลิงค์ด่วน

การติดตั้ง

การตั้งค่าสภาพแวดล้อม

เริ่มต้นใช้งาน

ฟังก์ชั่น `encode`

รายการรุ่น

ใช้กรณี

คำนวณการฝังสำหรับข้อความที่คุณกำหนดเอง

คำนวณความคล้ายคลึงกันระหว่างข้อความ

ใช้การฝังแบบกำหนดเองสำหรับการดึงข้อมูล

ใช้การฝังแบบกำหนดเองสำหรับการจัดกลุ่ม

การฝึกอบรม

ข้อมูล

ผู้สอนรถไฟ

การประเมิน

เอ็มเทบี

ป้ายโฆษณา

การเรียกคืนทันที

การหาปริมาณ

ข้อบกพร่องหรือคำถาม?

การอ้างอิง

ผู้สอนที่อื่น

GitHub sgrebnov/cordova plugin background download

Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

GitHub the via/releases

chat.petals.dev

GPT Prompt Templates

GPTyped

node telegram bot api

typebot.io

python wechaty getting started

waymo open dataset

termwind

wp functions

instructor embedding

ส้อมส่วนตัวของฉัน

เครื่องมือฝังตัวเดียว งานใดก็ได้: การฝังข้อความที่ได้รับการปรับแต่งตามคำสั่ง

ลิงค์ด่วน

การติดตั้ง

การตั้งค่าสภาพแวดล้อม

เริ่มต้นใช้งาน

ฟังก์ชั่น encode

รายการรุ่น

ใช้กรณี

คำนวณการฝังสำหรับข้อความที่คุณกำหนดเอง

คำนวณความคล้ายคลึงกันระหว่างข้อความ

ใช้การฝังแบบกำหนดเองสำหรับการดึงข้อมูล

ใช้การฝังแบบกำหนดเองสำหรับการจัดกลุ่ม

การฝึกอบรม

ข้อมูล

ผู้สอนรถไฟ

การประเมิน

เอ็มเทบี

ป้ายโฆษณา

การเรียกคืนทันที

การหาปริมาณ

ข้อบกพร่องหรือคำถาม?

การอ้างอิง

ผู้สอนที่อื่น

ฟังก์ชั่น `encode`