ดาวน์โหลด ekphrasis - ดาวน์โหลด ekphrasis ซอร์สโค้ด

ekphrasis

ซอร์สโค้ดอื่น ๆ

ดาวน์โหลด

คอลเลกชันของเครื่องมือข้อความน้ำหนักเบา มุ่งไปที่ข้อความจากโซเชียลเน็ตเวิร์ก เช่น Twitter หรือ Facebook สำหรับการแปลงโทเค็น การทำให้คำเป็นมาตรฐาน การแบ่งส่วนคำ (สำหรับการแยกแฮชแท็ก) และการแก้ไขตัวสะกด โดยใช้สถิติคำจาก 2 องค์กรใหญ่ (วิกิพีเดียภาษาอังกฤษ, ทวิตเตอร์ - 330 ล้าน ทวีตภาษาอังกฤษ)

ekphrasis ได้รับการพัฒนาโดยเป็นส่วนหนึ่งของขั้นตอนการประมวลผลข้อความสำหรับการส่งของทีม DataStories สำหรับ SemEval-2017 Task 4 (ภาษาอังกฤษ) การวิเคราะห์ความรู้สึกใน Twitter

หากคุณใช้ห้องสมุดในโครงการวิจัยของคุณ โปรดอ้างอิงรายงาน "DataStories ที่ SemEval-2017 Task 4: LSTM เชิงลึกพร้อมความสนใจสำหรับการวิเคราะห์ความรู้สึกระดับข้อความและตามหัวข้อ"

การอ้างอิง:

 @InProceedings{baziotis-pelekis-doulkeridis:2017:SemEval2,
  author    = {Baziotis, Christos  and  Pelekis, Nikos  and  Doulkeridis, Christos},
  title     = {DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)},
  month     = {August},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {747--754}
}

ข้อจำกัดความรับผิดชอบ: ห้องสมุดไม่ได้รับการพัฒนาอีกต่อไป ฉันจะพยายามแก้ไขปัญหาสำคัญแต่ฉันไม่สามารถให้สัญญาใดๆ ได้

การติดตั้ง

สร้างจากแหล่งที่มา

 pip install git+git://github.com/cbaziotis/ekphrasis.git

หรือติดตั้งจาก pypi

 pip install ekphrasis -U

ภาพรวม

ekphrasis มีฟังก์ชันการทำงานดังต่อไปนี้:

Tokenizer ทางสังคม โทเค็นข้อความที่มุ่งเน้นไปที่เครือข่ายโซเชียล (Facebook, Twitter...) ซึ่งเข้าใจอีโมติคอนที่ซับซ้อน อีโมจิ และการแสดงออกที่ไม่มีโครงสร้างอื่นๆ เช่น วันที่ เวลา และอื่นๆ
การแบ่งส่วนคำ . คุณสามารถแยกสตริงยาวออกเป็นคำที่เป็นส่วนประกอบได้ เหมาะสำหรับการแบ่งส่วนแฮชแท็ก
การแก้ไขตัวสะกด คุณสามารถแทนที่คำที่สะกดผิดด้วยคำที่น่าจะเป็นไปได้มากที่สุด
การปรับแต่ง Taylor การแบ่งส่วนคำ การแก้ไขตัวสะกด และการระบุคำศัพท์ เพื่อให้เหมาะกับความต้องการของคุณ
กลไกการแบ่งส่วนคำและการแก้ไขตัวสะกด ทำงานบนสถิติของคำที่รวบรวมจากคลังข้อมูลที่กำหนด เราจัดทำสถิติคำศัพท์จากคลังข้อมูลขนาดใหญ่ 2 แห่ง (จาก Wikipedia และ Twitter) แต่คุณสามารถสร้างสถิติคำศัพท์จากคลังข้อมูลของคุณเองได้ คุณอาจต้องทำเช่นนั้นหากคุณทำงานกับข้อความเฉพาะโดเมน เช่น เอกสารชีวการแพทย์ เช่น คำที่อธิบายเทคนิคหรือสารประกอบทางเคมีอาจถือเป็นคำที่สะกดผิดได้ โดยใช้คำว่าสถิติจากคลังข้อมูลวัตถุประสงค์ทั่วไป
ekphrasis โทเค็นข้อความตามรายการนิพจน์ทั่วไป คุณสามารถเปิดใช้งาน ekphrasis เพื่อระบุเอนทิตีใหม่ได้อย่างง่ายดาย เพียงเพิ่มรายการใหม่ลงในพจนานุกรมของนิพจน์ทั่วไป ( ekphrasis/regexes/expressions.txt )
ไปป์ไลน์ก่อนการประมวลผล คุณสามารถรวมขั้นตอนข้างต้นทั้งหมดด้วยวิธีง่ายๆ เพื่อเตรียมไฟล์ข้อความในชุดข้อมูลของคุณสำหรับการวิเคราะห์บางประเภทหรือสำหรับการเรียนรู้ของเครื่อง นอกจากนี้ สำหรับการดำเนินการข้างต้น คุณสามารถดำเนินการทำให้ข้อความเป็นมาตรฐาน คำอธิบายประกอบคำ (การติดฉลาก) และอื่นๆ ได้

ไปป์ไลน์การประมวลผลล่วงหน้าข้อความ

คุณสามารถกำหนดไปป์ไลน์การประมวลผลล่วงหน้าได้อย่างง่ายดายโดยใช้ TextPreProcessor

 from ekphrasis . classes . preprocessor import TextPreProcessor
from ekphrasis . classes . tokenizer import SocialTokenizer
from ekphrasis . dicts . emoticons import emoticons

text_processor = TextPreProcessor (
    # terms that will be normalized
    normalize = [ 'url' , 'email' , 'percent' , 'money' , 'phone' , 'user' ,
        'time' , 'url' , 'date' , 'number' ],
    # terms that will be annotated
    annotate = { "hashtag" , "allcaps" , "elongated" , "repeated" ,
        'emphasis' , 'censored' },
    fix_html = True ,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter = "twitter" , 
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector = "twitter" , 
    
    unpack_hashtags = True ,  # perform word segmentation on hashtags
    unpack_contractions = True ,  # Unpack contractions (can't -> can not)
    spell_correct_elong = False ,  # spell correction for elongated words
    
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer = SocialTokenizer ( lowercase = True ). tokenize ,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts = [ emoticons ]
)

sentences = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies :/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
]

for s in sentences :
    print ( " " . join ( text_processor . pre_process_doc ( s )))

เอาท์พุท:

 cant <allcaps> wait <allcaps> for the new season of <hashtag> twin peaks </hashtag> ＼(^o^)／ ! <repeated> <hashtag> david lynch </hashtag> <hashtag> tv series </hashtag> <happy>

i saw the new <hashtag> john doe </hashtag> movie and it sucks <elongated> ! <repeated> waisted <allcaps> <money> . <repeated> <hashtag> bad movies </hashtag> <annoyed>

<user> : can not wait for the <date> <hashtag> sentiment </hashtag> talks ! yay <allcaps> <elongated> ! <repeated> <laugh> <url>

หมายเหตุ:

คำที่ยาวขึ้นจะถูกทำให้เป็นมาตรฐานโดยอัตโนมัติ
การแก้ไขตัวสะกดส่งผลต่อประสิทธิภาพการทำงาน

สถิติคำ

ekphrasis ให้สถิติคำ (ยูนิแกรมและบิ๊กแกรม) จาก 2 องค์กรใหญ่:

วิกิพีเดียภาษาอังกฤษ
คอลเลกชันข้อความ Twitter ภาษาอังกฤษ 330 ล้านข้อความ

สถิติคำเหล่านี้จำเป็นสำหรับการแบ่งส่วนคำและการแก้ไขตัวสะกด นอกจากนี้ คุณสามารถสร้างสถิติคำจากคลังข้อมูลของคุณเองได้ คุณสามารถใช้ ekphrasis/tools/generate_stats.py และสร้างสถิติจากไฟล์ข้อความ หรือไดเร็กทอรีที่มีคอลเลกชันของไฟล์ข้อความ ตัวอย่างเช่น เพื่อสร้างสถิติคำสำหรับ text8 (http://mattmahoney.net/dc/text8.zip) คุณสามารถทำได้:

 python generate_stats.py --input text8.txt --name text8 --ngrams 2 --mincount 70 30

อินพุต: เส้นทางไปยังไฟล์หรือไดเร็กทอรีที่มีไฟล์สำหรับคำนวณสถิติ
ชื่อ: ชื่อของคลังข้อมูล
ngrams: ขึ้นอยู่กับจำนวน ngrams ในการคำนวณสถิติ
mincount: จำนวนขั้นต่ำของแต่ละ ngram ที่จะรวม ในกรณีนี้ จำนวนขั้นต่ำสำหรับยูนิแกรมคือ 70 และสำหรับบิ๊กแกรมคือ 30

หลังจากที่คุณรันสคริปต์ คุณจะเห็นไดเร็กทอรีใหม่ภายใน ekphrasis/stats/ พร้อมด้วยสถิติของคลังข้อมูลของคุณ ในกรณีของตัวอย่างข้างต้น ekphrasis/stats/text8/

การแบ่งส่วนคำ

การใช้การแบ่งส่วนคำใช้อัลกอริธึม Viterbi และอิงตาม CH14 จากหนังสือ Beautiful Data (Segaran และ Hammerbacher, 2009) การใช้งานต้องใช้สถิติคำเพื่อระบุและแยกคำในสตริง คุณสามารถใช้คำว่าสถิติจากหนึ่งใน 2 คลังข้อมูลที่ให้มา หรือจากคลังข้อมูลของคุณเอง

ตัวอย่าง: เพื่อที่จะดำเนินการแบ่งส่วนคำ ขั้นแรกคุณต้องสร้างตัวอย่างตัวแบ่งส่วนด้วยคลังข้อมูลที่กำหนด จากนั้นใช้เมธอด segment() :

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter ( corpus = "mycorpus" ) 
print ( seg . segment ( "smallandinsignificant" ))

เอาท์พุท:

 > small and insignificant

คุณสามารถทดสอบผลลัพธ์ได้โดยใช้สถิติจากกลุ่มต่างๆ:

 from ekphrasis . classes . segmenter import Segmenter

# segmenter using the word statistics from english Wikipedia
seg_eng = Segmenter ( corpus = "english" ) 

# segmenter using the word statistics from Twitter
seg_tw = Segmenter ( corpus = "twitter" )

words = [ "exponentialbackoff" , "gamedev" , "retrogaming" , "thewatercooler" , "panpsychism" ]
for w in words :
    print ( w )
    print ( "(eng):" , seg_eng . segment ( w ))
    print ( "(tw):" , seg_tw . segment ( w ))
    print ()

เอาท์พุท:

 exponentialbackoff
(eng): exponential backoff
(tw): exponential back off

gamedev
(eng): gamedev
(tw): game dev

retrogaming
(eng): retrogaming
(tw): retro gaming

thewatercooler
(eng): the water cooler
(tw): the watercooler

panpsychism
(eng): panpsychism
(tw): pan psych is m

สุดท้ายนี้ หากคำนั้นเป็น CamelCased หรือ PascalCased อัลกอริธึมจะแยกคำตามตัวพิมพ์เล็กและตัวพิมพ์ใหญ่

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter () 
print ( seg . segment ( "camelCased" ))
print ( seg . segment ( "PascalCased" ))

เอาท์พุท:

 > camel cased
> pascal cased

การแก้ไขการสะกด

ตัวแก้ไขการสะกดนั้นมีพื้นฐานมาจากตัวแก้ไขตัวสะกดของ Peter Norvig เช่นเดียวกับอัลกอริธึมการแบ่งส่วน เราใช้สถิติคำเพื่อค้นหาตัวเลือกที่น่าจะเป็นไปได้มากที่สุด นอกจากสถิติที่ให้มาแล้ว คุณยังสามารถใช้สถิติของคุณเองได้

ตัวอย่าง:

คุณสามารถแก้ไขตัวสะกดได้เช่นเดียวกับการแบ่งส่วนคำ ขั้นแรก คุณต้องสร้างอินสแตนซ์ของออบเจ็กต์ SpellCorrector ซึ่งใช้สถิติจากคลังข้อมูลที่คุณเลือก จากนั้นจึงใช้วิธีการที่มีอยู่

 from ekphrasis . classes . spellcorrect import SpellCorrector
sp = SpellCorrector ( corpus = "english" ) 
print ( sp . correct ( "korrect" ))

เอาท์พุท:

 > correct

Tokenizer ทางสังคม

ความยากในการทำโทเค็นคือการหลีกเลี่ยงการแยกสำนวนหรือคำที่ควรเก็บไว้ให้ครบถ้วน (เป็นโทเค็นเดียว) สิ่งนี้สำคัญกว่าในข้อความจากโซเชียลเน็ตเวิร์ก ด้วยการเขียนและสำนวนที่ "สร้างสรรค์" เช่น อีโมติคอน แฮชแท็ก และอื่นๆ แม้ว่าจะมีโทเค็นบางตัวที่มุ่งไปที่ Twitter [1],[2] ซึ่งจดจำมาร์กอัปของ Twitter และการแสดงออกทางความรู้สึกขั้นพื้นฐานหรืออีโมติคอนธรรมดา แต่โทเค็นของเราสามารถระบุอีโมติคอน อิโมจิ และการแสดงออกที่ซับซ้อนได้เกือบทั้งหมด

โดยเฉพาะสำหรับงานต่างๆ เช่น การวิเคราะห์ความรู้สึก มีสำนวนมากมายที่มีบทบาทสำคัญในการระบุความรู้สึกที่แสดงออกในข้อความ การแสดงออกเช่นนี้คือ:

คำที่เซ็นเซอร์ เช่น f**k , s**t
คำที่เน้นย้ำ เช่น a *great* time I don't *think* I ... .
อีโมติคอน เช่น >:( , :)) , o/
คำที่คั่นด้วยเครื่องหมายขีด เช่น over-consumption anti-american mind-blowing

นอกจากนี้ ekphrasis ยังสามารถระบุการแสดงออกที่มีข้อมูลได้ ขึ้นอยู่กับงาน คุณอาจต้องการเก็บ / แยกเป็นโทเค็นเดียว (IR) แล้วทำให้เป็นมาตรฐาน เนื่องจากข้อมูลนี้อาจไม่เกี่ยวข้องกับงาน (การวิเคราะห์ความรู้สึก) การแสดงออกเช่นนี้คือ:

วันที่เช่น Feb 18th December 2, 2016 December 2-2016 10/17/94 3 December 2016 April 25, 1995 11.15.16 November 24th 2016 January 21st
เวลา เช่น 5:45pm , 11:36 AM , 2:45 pm , 5:30 น.
สกุลเงิน เช่น $220M , $2B , $65.000 , €10 , $50K
หมายเลขโทรศัพท์.
URL เช่น http://www.cs.unipi.gr , https://t.co/Wfw5Z1iSEt

ตัวอย่าง :

 import nltk
from ekphrasis . classes . tokenizer import SocialTokenizer


def wsp_tokenizer ( text ):
    return text . split ( " " )

puncttok = nltk . WordPunctTokenizer (). tokenize

social_tokenizer = SocialTokenizer ( lowercase = False ). tokenize

sents = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! >:-D http://sentimentsymposium.com/." ,
]

for s in sents :
    print ()
    print ( "ORG: " , s )  # original sentence
    print ( "WSP : " , wsp_tokenizer ( s ))  # whitespace tokenizer
    print ( "WPU : " , puncttok ( s ))  # WordPunct tokenizer
    print ( "SC : " , social_tokenizer ( s ))  # social tokenizer

เอาท์พุท:

 ORG:  CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))
WSP :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay!!!', '#davidlynch', '#tvseries', ':)))']
WPU :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#', 'TwinPeaks', '＼(^', 'o', '^)／', 'yaaaay', '!!!', '#', 'davidlynch', '#', 'tvseries', ':)))']
SC :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay', '!', '!', '!', '#davidlynch', '#tvseries', ':)))']

ORG:  I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/
WSP :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks!!!', 'WAISTED', '$10...', '#badmovies', '>3:/']
WPU :  ['I', 'saw', 'the', 'new', '#', 'johndoe', 'movie', 'and', 'it', 'suuuuucks', '!!!', 'WAISTED', '$', '10', '...', '#', 'badmovies', '>', '3', ':/']
SC :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks', '!', '!', '!', 'WAISTED', '$10', '.', '.', '.', '#badmovies', '>', '3:/']

อ้างอิง

[1] K. Gimpel และคณะ "การแท็กส่วนหนึ่งของคำพูดสำหรับ Twitter: คำอธิบายประกอบ คุณลักษณะ และการทดลอง" ในการดำเนินการประชุมประจำปีครั้งที่ 49 ของสมาคมภาษาศาสตร์คอมพิวเตอร์: เทคโนโลยีภาษามนุษย์: เอกสารสั้น - เล่ม 2 ต.ค. 2011 หน้า 42–47

(2) C. Potts, “บทแนะนำ Sentiment Symposium: Tokenizing,” บทแนะนำ Sentiment Symposium, 2011. [ออนไลน์] มีจำหน่าย: http://sentiment.christopherpotts.net/tokenizing.html

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-01-03
ขนาด 50MB
มาจาก Github

แอปที่เกี่ยวข้อง

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
MySchedule.py

2024-12-15
viptools for eslam

2024-12-15
VITAident

2024-12-15

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
waymo open dataset

ซอร์สโค้ดอื่น ๆ

December 2023 Update
SmartTube

ซอร์สโค้ดอื่น ๆ

24.71 Stable
Sunamu

ซอร์สโค้ดอื่น ๆ

Release 2.2.0
waymo open dataset

ซอร์สโค้ดอื่น ๆ

December 2023 Update
wp functions

หมวดหมู่อื่นๆ

1.0.0
termwind

หมวดหมู่อื่นๆ

v2.3.0

ข้อมูลที่เกี่ยวข้อง ทั้งหมด