lmppl ดาวน์โหลด - lmppl ดาวน์โหลดซอร์สโค้ด

lmppl

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

ความฉงนสนเท่ห์ของโมเดลภาษา (LM-PPL)

ความฉงนสนเท่ห์วัดว่าข้อความสามารถคาดเดาได้อย่างไรโดยแบบจำลองภาษา (LM) และมักใช้ในการประเมินความคล่องหรือลักษณะทั่วไปของข้อความ (ความงุนงงต่ำกว่า ความคล่องมากขึ้นหรือลักษณะทั่วไปของข้อความ) LM-PPL เป็นไลบรารี Python สำหรับคำนวณความฉงนสนเท่ห์ของข้อความด้วย LM ที่ได้รับการฝึกล่วงหน้าทุกประเภท เราคำนวณความฉงนสนเท่ห์ตามปกติสำหรับ LM ที่เกิดซ้ำ เช่น GPT3 (Brown et al., 2020) และความฉงนสนเท่ห์ของตัวถอดรหัสสำหรับ LM ตัวเข้ารหัส-ตัวถอดรหัส เช่น BART (Lewis et al., 2020) หรือ T5 (Raffel et al., 2020 ) ในขณะที่เราคำนวณความฉงนสนเท่ห์หลอก (Wang and Cho, 2018) สำหรับ LM ที่สวมหน้ากาก

เริ่มต้นเลย

ติดตั้งผ่าน pip

pip install lmppl

ตัวอย่าง

มาแก้การวิเคราะห์ความรู้สึกแบบงงๆ เป็นตัวอย่างกันเถอะ! โปรดจำไว้ว่าข้อความที่มีความงุนงงน้อยกว่าจะดีกว่า ดังนั้นเราจึงเปรียบเทียบข้อความสองฉบับ (เชิงบวกและเชิงลบ) และเลือกข้อความที่มีความงุนงงน้อยกว่าเป็นการทำนายแบบจำลอง

LM ที่เกิดซ้ำ รวมถึง GPT แบบต่างๆ

 import lmppl

scorer = lmppl . LM ( 'gpt2' )
text = [
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am happy.' ,
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am sad.'
]
ppl = scorer . get_perplexity ( text )
print ( list ( zip ( text , ppl )))
>> > [
  ( 'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am happy.' , 136.64255272925908 ),
  ( 'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am sad.' , 139.2400838400971 )
]
print ( f"prediction: { text [ ppl . index ( min ( ppl ))] } " )
>> > "prediction: sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am happy."

Masked LM รวมถึงรุ่นต่างๆ ของ BERT

 import lmppl

scorer = lmppl . MaskedLM ( 'microsoft/deberta-v3-small' )
text = [
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am happy.' ,
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am sad.'
]
ppl = scorer . get_perplexity ( text )
print ( list ( zip ( text , ppl )))
>> > [
  ( 'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am happy.' , 1190212.1699246117 ),
  ( 'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am sad.' , 1152767.482071837 )
]
print ( f"prediction: { text [ ppl . index ( min ( ppl ))] } " )
>> > "prediction: sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am sad."

ตัวเข้ารหัส-ตัวถอดรหัส LM รวมถึงรุ่น T5 และ BART

 import lmppl

scorer = lmppl . EncoderDecoderLM ( 'google/flan-t5-small' )
inputs = [
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee.' ,
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee.'
]
outputs = [
    'I am happy.' ,
    'I am sad.'
]
ppl = scorer . get_perplexity ( input_texts = inputs , output_texts = outputs )
print ( list ( zip ( outputs , ppl )))
>> > [
  ( 'I am happy.' , 4138.748977714201 ),
  ( 'I am sad.' , 2991.629250051472 )
]
print ( f"prediction: { outputs [ ppl . index ( min ( ppl ))] } " )
>> > "prediction: I am sad."

โมเดล

ด้านล่างนี้คือตัวอย่างบางส่วนของโมเดลยอดนิยมและประเภทโมเดลที่เกี่ยวข้องเพื่อใช้ภายในแพ็คเกจ lmppl

แบบอย่าง	ID กอดใบหน้า	ประเภทรุ่น
เบิร์ต	google-bert/bert-base-uncased	มาสก์LM
โรเบอร์ตา	โรเบอร์ตา-ใหญ่	มาสก์LM
จีพีที 2	gpt2-xl	แอล.เอ็ม
ประหม่า-ul2	google/flan-ul2	ตัวเข้ารหัสDecoderLM
GPT-นีโอเอ็กซ์	เอลิวเธอร์เอไอ/gpt-neox-20b	แอล.เอ็ม
เลือก	เฟสบุ๊ค/opt-30b	แอล.เอ็ม
มิกซ์ทรัล	มิสทราไล/Mixtral-8x22B-v0.1	แอล.เอ็ม
ลามะ 3	เมตาลามะ/เมตาลามะ-3-8B	แอล.เอ็ม

เคล็ดลับ

ความยาวโทเค็นสูงสุด : LM แต่ละตัวมีความยาวโทเค็นสูงสุดของตัวเอง ( max_length สำหรับ LM ที่เกิดซ้ำ/มาสก์ และ max_length_encoder และ max_length_decoder สำหรับ LM ตัวเข้ารหัส-ตัวถอดรหัส) การจำกัดโทเค็นสูงสุดเหล่านี้จะช่วยลดเวลาในการประมวลผลข้อความ แต่อาจส่งผลต่อความแม่นยำของความสับสน ดังนั้นโปรดทดลองกับข้อความของคุณและตัดสินใจเลือกความยาวโทเค็นที่เหมาะสมที่สุด
ขนาดแบทช์ : หนึ่งสามารถส่งขนาดแบทช์ไปยังฟังก์ชัน get_perplexity (เช่น get_perplexity(text, batch_size=32) ) ตามค่าเริ่มต้น ระบบจะประมวลผลข้อความทั้งหมดเพียงครั้งเดียว ซึ่งอาจทำให้เกิดข้อผิดพลาดของหน่วยความจำหากจำนวนข้อความมากเกินไป

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.0.0
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2024-11-30
ขนาด 15.16KB
มาจาก Github

แอปที่เกี่ยวข้อง

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
MySchedule.py

2024-12-15
viptools for eslam

2024-12-15
VITAident

2024-12-15

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
waymo open dataset

ซอร์สโค้ดอื่น ๆ

December 2023 Update
SmartTube

ซอร์สโค้ดอื่น ๆ

24.71 Stable
Sunamu

ซอร์สโค้ดอื่น ๆ

Release 2.2.0
waymo open dataset

ซอร์สโค้ดอื่น ๆ

December 2023 Update
wp functions

หมวดหมู่อื่นๆ

1.0.0
termwind

หมวดหมู่อื่นๆ

v2.3.0

ข้อมูลที่เกี่ยวข้อง ทั้งหมด