ekphrasis下載 - ekphrasis原始碼下載

ekphrasis

其他源碼

下載

輕量級文字工具集合，適用於來自Twitter 或Facebook 等社交網路的文本，用於標記化、單字規範化、分詞（用於分割主題標籤）和拼字修正，使用來自2 大語料庫的單字統計資料（英語維基百科、twitter - 3.3 億）英文推文）。

ekphrasis是作為DataStories團隊提交的SemEval-2017 任務 4（英語）「Twitter 中的情緒分析」的文字處理管道的一部分而開發的。

如果您在研究專案中使用該程式庫，請引用論文「DataStories at SemEval-2017 Task 4：Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis」。

引用:

 @InProceedings{baziotis-pelekis-doulkeridis:2017:SemEval2,
  author    = {Baziotis, Christos  and  Pelekis, Nikos  and  Doulkeridis, Christos},
  title     = {DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)},
  month     = {August},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {747--754}
}

免責聲明：該庫不再積極開發。我會盡力解決重要問題，但我不能做出任何承諾。

安裝

從原始碼構建

 pip install git+git://github.com/cbaziotis/ekphrasis.git

或從 pypi 安裝

 pip install ekphrasis -U

概述

ekphrasis提供以下功能：

社交分詞器。面向社交網路（Facebook、Twitter...）的文字標記器，它可以理解複雜的表情符號、表情符號和其他非結構化表達，例如日期、時間等。
分詞。您可以將長字串拆分為其組成字。適用於主題標籤分割。
拼字更正。您可以用最可能的候選單字取代拼字錯誤的單字。
定制。 Taylor 提供分詞、拼字修正和術語識別功能，以滿足您的需求。
單字分割和拼字糾正機制是基於從給定語料庫收集的單字統計資料進行操作。我們提供來自 2 大語料庫（來自維基百科和 Twitter）的單詞統計信息，但您也可以從您自己的語料庫生成單詞統計信息。如果您正在處理特定領域的文字（例如生物醫學文件），您可能需要這樣做。例如，使用來自通用語料庫的單字統計數據，描述技術或化學化合物的單字可以被視為拼字錯誤的單字。
ekphrasis根據正規表示式清單對文字進行標記。您可以輕鬆啟用ekphrasis來識別新實體，只需將新條目新增至正規表示式字典 ( ekphrasis/regexes/expressions.txt ) 中即可。
預處理管道。您可以以簡單的方式組合上述所有步驟，以便準備資料集中的文字檔案以進行某種分析或機器學習。此外，除了上述操作之外，您還可以執行文字規範化、單字註釋（標籤）等。

文字預處理管道

您可以使用TextPreProcessor輕鬆定義預處理管道。

 from ekphrasis . classes . preprocessor import TextPreProcessor
from ekphrasis . classes . tokenizer import SocialTokenizer
from ekphrasis . dicts . emoticons import emoticons

text_processor = TextPreProcessor (
    # terms that will be normalized
    normalize = [ 'url' , 'email' , 'percent' , 'money' , 'phone' , 'user' ,
        'time' , 'url' , 'date' , 'number' ],
    # terms that will be annotated
    annotate = { "hashtag" , "allcaps" , "elongated" , "repeated" ,
        'emphasis' , 'censored' },
    fix_html = True ,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter = "twitter" , 
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector = "twitter" , 
    
    unpack_hashtags = True ,  # perform word segmentation on hashtags
    unpack_contractions = True ,  # Unpack contractions (can't -> can not)
    spell_correct_elong = False ,  # spell correction for elongated words
    
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer = SocialTokenizer ( lowercase = True ). tokenize ,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts = [ emoticons ]
)

sentences = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies :/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
]

for s in sentences :
    print ( " " . join ( text_processor . pre_process_doc ( s )))

輸出：

 cant <allcaps> wait <allcaps> for the new season of <hashtag> twin peaks </hashtag> ＼(^o^)／ ! <repeated> <hashtag> david lynch </hashtag> <hashtag> tv series </hashtag> <happy>

i saw the new <hashtag> john doe </hashtag> movie and it sucks <elongated> ! <repeated> waisted <allcaps> <money> . <repeated> <hashtag> bad movies </hashtag> <annoyed>

<user> : can not wait for the <date> <hashtag> sentiment </hashtag> talks ! yay <allcaps> <elongated> ! <repeated> <laugh> <url>

筆記：

拉長的單字會自動標準化。
拼字糾正會影響性能。

字數統計

ekphrasis提供 2 個大語料庫的單字統計資料（一元詞組和二元詞組）：

英文維基百科
收集了 3.3 億則英文 Twitter 訊息

這些單字統計數據是分詞和拼字糾正所需要的。此外，您可以從自己的語料庫產生單字統計資料。您可以使用ekphrasis/tools/generate_stats.py並從文字檔案或包含文字檔案集合的目錄產生統計資料。例如，為了產生 text8 (http://mattmahoney.net/dc/text8.zip) 的單字統計訊息，您可以執行以下操作：

 python generate_stats.py --input text8.txt --name text8 --ngrams 2 --mincount 70 30

輸入：包含用於計算統計資料的檔案或目錄的路徑。
名稱：語料庫的名稱。
ngrams：最多計算統計資料的 ngrams。
mincount：每個 ngram 的最小計數，以便包含在內。在本例中，一元組的最小計數為 70，二元組的最小計數為 30。

運行腳本後，您將在ekphrasis/stats/中看到一個新目錄，其中包含語料庫的統計資料。在上面的範例中， ekphrasis/stats/text8/ 。

分詞

分詞實作使用 Viterbi 演算法，基於《Beautiful Data》（Segaran 和 Hammerbacher，2009）一書中的 CH14。此實作需要單字統計，以便識別和分隔字串中的單字。您可以使用提供的 2 個語料庫之一或您自己的語料庫中的“統計”一詞。

範例：為了執行分詞，首先必須使用給定的語料庫實例化一個分詞器，然後只需使用segment()方法：

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter ( corpus = "mycorpus" ) 
print ( seg . segment ( "smallandinsignificant" ))

輸出：

 > small and insignificant

您可以使用不同語料庫的統計資料來測試輸出：

 from ekphrasis . classes . segmenter import Segmenter

# segmenter using the word statistics from english Wikipedia
seg_eng = Segmenter ( corpus = "english" ) 

# segmenter using the word statistics from Twitter
seg_tw = Segmenter ( corpus = "twitter" )

words = [ "exponentialbackoff" , "gamedev" , "retrogaming" , "thewatercooler" , "panpsychism" ]
for w in words :
    print ( w )
    print ( "(eng):" , seg_eng . segment ( w ))
    print ( "(tw):" , seg_tw . segment ( w ))
    print ()

輸出：

 exponentialbackoff
(eng): exponential backoff
(tw): exponential back off

gamedev
(eng): gamedev
(tw): game dev

retrogaming
(eng): retrogaming
(tw): retro gaming

thewatercooler
(eng): the water cooler
(tw): the watercooler

panpsychism
(eng): panpsychism
(tw): pan psych is m

最後，如果單字是駝峰命名法或帕斯卡命名法，那麼演算法會根據字元的大小寫來分割單字。

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter () 
print ( seg . segment ( "camelCased" ))
print ( seg . segment ( "PascalCased" ))

輸出：

 > camel cased
> pascal cased

拼字更正

拼字校正器是基於 Peter Norvig 的拼字校正器。就像分段演算法一樣，我們利用單字統計來找到最可能的候選人。除了提供的統計數據之外，您還可以使用自己的統計數據。

例子：

您可以執行拼字修正，就像分詞一樣。首先，您必須實例化一個SpellCorrector對象，該對象使用您選擇的語料庫中的統計數據，然後使用可用的方法。

 from ekphrasis . classes . spellcorrect import SpellCorrector
sp = SpellCorrector ( corpus = "english" ) 
print ( sp . correct ( "korrect" ))

輸出：

 > correct

社交分詞器

標記化的困難在於避免分割應保持完整（作為一個標記）的表達式或單字。這在來自社群網路的文本中更為重要，這些文本具有「創意」寫作和表情符號、主題標籤等表達方式。儘管有一些針對Twitter [1]、[2] 的分詞器可以識別Twitter 標記和一些基本的情感表達或簡單的表情符號，但我們的分詞器能夠識別幾乎所有表情符號、表情符號和許多複雜的表達。

特別是對於情感分析這樣的任務，有許多表達方式對於辨識文本中表達的情感起著決定性的作用。類似這樣的表達方式有：

審查詞，例如f**k 、 s**t 。
帶有強調的詞語，例如a *great* time 、 I don't *think* I ... 。
表情符號，例如>:( , :)) 、 o/ 。
以短劃線分隔的詞語，例如over-consumption 、 anti-american 、 mind-blowing 。

此外，ekphrasis 可以辨識承載資訊的表達方式。根據任務的不同，您可能希望將它們保留/提取為一個標記（IR），然後將它們標準化，因為這些資訊可能與任務（情緒分析）無關。類似這樣的表達方式有：

日期，例如Feb 18th 、 December 2, 2016 、 December 2-2016 、 10/17/94 、 3 December 2016 、 April 25, 1995 、 11.15.16 、 October 24th 1995 、11.15.16 、 November 24th 2016 January 21st 。
時間，例如5:45pm 、 11:36 AM 、 2:45 pm 、 5:30 。
貨幣，例如$220M 、 $2B 、 $65.000 、 €10 、 $50K 。
電話號碼。
URL，例如http://www.cs.unipi.gr 、 https://t.co/Wfw5Z1iSEt 。

例子：

 import nltk
from ekphrasis . classes . tokenizer import SocialTokenizer


def wsp_tokenizer ( text ):
    return text . split ( " " )

puncttok = nltk . WordPunctTokenizer (). tokenize

social_tokenizer = SocialTokenizer ( lowercase = False ). tokenize

sents = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! >:-D http://sentimentsymposium.com/." ,
]

for s in sents :
    print ()
    print ( "ORG: " , s )  # original sentence
    print ( "WSP : " , wsp_tokenizer ( s ))  # whitespace tokenizer
    print ( "WPU : " , puncttok ( s ))  # WordPunct tokenizer
    print ( "SC : " , social_tokenizer ( s ))  # social tokenizer

輸出：

 ORG:  CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))
WSP :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay!!!', '#davidlynch', '#tvseries', ':)))']
WPU :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#', 'TwinPeaks', '＼(^', 'o', '^)／', 'yaaaay', '!!!', '#', 'davidlynch', '#', 'tvseries', ':)))']
SC :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay', '!', '!', '!', '#davidlynch', '#tvseries', ':)))']

ORG:  I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/
WSP :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks!!!', 'WAISTED', '$10...', '#badmovies', '>3:/']
WPU :  ['I', 'saw', 'the', 'new', '#', 'johndoe', 'movie', 'and', 'it', 'suuuuucks', '!!!', 'WAISTED', '$', '10', '...', '#', 'badmovies', '>', '3', ':/']
SC :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks', '!', '!', '!', 'WAISTED', '$10', '.', '.', '.', '#badmovies', '>', '3:/']