ekphrasisダウンロード - ekphrasisソースコードのダウンロード

ekphrasis

その他のソースコード

ダウンロード

Twitter や Facebook などのソーシャルネットワークからのテキストに特化した軽量テキストツールのコレクション。2 つの大きなコーパス (英語版 Wikipedia、twitter - 3 億 3,000 万) からの単語統計を使用して、トークン化、単語の正規化、単語のセグメンテーション (ハッシュタグの分割用)、およびスペル修正を行います。英語のツイート）。

ekphrasis は、 DataStoriesチームによるSemEval-2017 タスク 4 (英語)、Twitter の感情分析の提出用のテキスト処理パイプラインの一部として開発されました。

研究プロジェクトでライブラリを使用する場合は、論文「SemEval-2017 の DataStories タスク 4: メッセージレベルおよびトピックベースのセンチメント分析に注意を払ったディープ LSTM」を引用してください。

引用：

 @InProceedings{baziotis-pelekis-doulkeridis:2017:SemEval2,
  author    = {Baziotis, Christos  and  Pelekis, Nikos  and  Doulkeridis, Christos},
  title     = {DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)},
  month     = {August},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {747--754}
}

免責事項:このライブラリは現在は積極的に開発されていません。重要な問題の解決に努めますが、お約束はできません。

インストール

ソースからビルドする

 pip install git+git://github.com/cbaziotis/ekphrasis.git

またはpypiからインストールします

 pip install ekphrasis -U

概要

ekphrasis は次の機能を提供します。

ソーシャルトークナイザー。ソーシャルネットワーク (Facebook、Twitter など) 向けのテキストトークナイザーで、複雑な顔文字、絵文字、および日付、時刻などのその他の非構造化表現を理解します。
単語の分割。長い文字列を構成単語に分割できます。ハッシュタグのセグメンテーションに適しています。
スペル修正。スペルミスのある単語を、最も可能性の高い候補単語に置き換えることができます。
カスタマイズ。 Taylor は、ニーズに合わせて単語の分割、スペル修正、用語の識別を行います。
単語の分割とスペル修正のメカニズムは、特定のコーパスから収集された単語の統計に基づいて動作します。 2 つの大きなコーパス (Wikipedia と Twitter) から単語統計を提供しますが、独自のコーパスから単語統計を生成することもできます。生物医学文書などのドメイン固有のテキストを扱う場合は、これが必要になる場合があります。たとえば、技術または化合物を説明する単語は、汎用コーパスからの単語統計を使用して、スペルミスの単語として処理される場合があります。
ekphrasis は、正規表現のリストに基づいてテキストをトークン化します。正規表現の辞書 ( ekphrasis/regexes/expressions.txt ) に新しいエントリを追加するだけで、 ekphrasis で新しいエンティティを識別できるように簡単にできます。
前処理パイプライン。ある種の分析または機械学習用にデータセット内のテキストファイルを準備するために、上記のすべての手順を簡単な方法で組み合わせることができます。さらに、前述のアクションに加えて、テキストの正規化、単語の注釈 (ラベル付け) などを実行できます。

テキスト前処理パイプライン

TextPreProcessor使用すると、前処理パイプラインを簡単に定義できます。

 from ekphrasis . classes . preprocessor import TextPreProcessor
from ekphrasis . classes . tokenizer import SocialTokenizer
from ekphrasis . dicts . emoticons import emoticons

text_processor = TextPreProcessor (
    # terms that will be normalized
    normalize = [ 'url' , 'email' , 'percent' , 'money' , 'phone' , 'user' ,
        'time' , 'url' , 'date' , 'number' ],
    # terms that will be annotated
    annotate = { "hashtag" , "allcaps" , "elongated" , "repeated" ,
        'emphasis' , 'censored' },
    fix_html = True ,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter = "twitter" , 
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector = "twitter" , 
    
    unpack_hashtags = True ,  # perform word segmentation on hashtags
    unpack_contractions = True ,  # Unpack contractions (can't -> can not)
    spell_correct_elong = False ,  # spell correction for elongated words
    
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer = SocialTokenizer ( lowercase = True ). tokenize ,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts = [ emoticons ]
)

sentences = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies :/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
]

for s in sentences :
    print ( " " . join ( text_processor . pre_process_doc ( s )))

出力：

 cant <allcaps> wait <allcaps> for the new season of <hashtag> twin peaks </hashtag> ＼(^o^)／ ! <repeated> <hashtag> david lynch </hashtag> <hashtag> tv series </hashtag> <happy>

i saw the new <hashtag> john doe </hashtag> movie and it sucks <elongated> ! <repeated> waisted <allcaps> <money> . <repeated> <hashtag> bad movies </hashtag> <annoyed>

<user> : can not wait for the <date> <hashtag> sentiment </hashtag> talks ! yay <allcaps> <elongated> ! <repeated> <laugh> <url>

注:

長い単語は自動的に正規化されます。
スペル修正はパフォーマンスに影響します。

単語の統計

ekphrasis は、 2 つの大きなコーパスから単語統計 (ユニグラムとバイグラム) を提供します。

英語のウィキペディア
3 億 3,000 万件の英語の Twitter メッセージのコレクション

これらの単語統計は、単語の分割とスペル修正に必要です。さらに、独自のコーパスから単語統計を生成できます。 ekphrasis/tools/generate_stats.py使用して、テキストファイル、またはテキストファイルのコレクションを含むディレクトリから統計を生成できます。たとえば、text8 (http://mattmahoney.net/dc/text8.zip) の単語統計を生成するには、次のようにします。

 python generate_stats.py --input text8.txt --name text8 --ngrams 2 --mincount 70 30

input: 統計を計算するためのファイルまたはファイルを含むディレクトリへのパス。
name: コーパスの名前。
ngrams: 統計を計算する最大 ngram 数。
mincount: 含めるための各 ngram の最小数。この場合、ユニグラムの最小カウントは 70、バイグラムの最小カウントは 30 です。

スクリプトを実行すると、 ekphrasis/stats/内にコーパスの統計を含む新しいディレクトリが表示されます。上記の例の場合、 ekphrasis/stats/text8/ 。

単語の分割

単語セグメンテーションの実装には Viterbi アルゴリズムが使用され、書籍『Beautiful Data』(Segaran および Hammerbacher、2009) の CH14 に基づいています。この実装では、文字列内の単語を識別して分離するために単語統計が必要です。提供されている 2 つのコーパスのいずれか、または独自のコーパスから単語統計を使用できます。

例:単語のセグメンテーションを実行するには、まず特定のコーパスでセグメンタをインスタンス化してから、 segment()メソッドを使用するだけです。

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter ( corpus = "mycorpus" ) 
print ( seg . segment ( "smallandinsignificant" ))

出力：

 > small and insignificant

さまざまなコーパスからの統計を使用して出力をテストできます。

 from ekphrasis . classes . segmenter import Segmenter

# segmenter using the word statistics from english Wikipedia
seg_eng = Segmenter ( corpus = "english" ) 

# segmenter using the word statistics from Twitter
seg_tw = Segmenter ( corpus = "twitter" )

words = [ "exponentialbackoff" , "gamedev" , "retrogaming" , "thewatercooler" , "panpsychism" ]
for w in words :
    print ( w )
    print ( "(eng):" , seg_eng . segment ( w ))
    print ( "(tw):" , seg_tw . segment ( w ))
    print ()

出力：

 exponentialbackoff
(eng): exponential backoff
(tw): exponential back off

gamedev
(eng): gamedev
(tw): game dev

retrogaming
(eng): retrogaming
(tw): retro gaming

thewatercooler
(eng): the water cooler
(tw): the watercooler

panpsychism
(eng): panpsychism
(tw): pan psych is m

最後に、単語がキャメルケースまたはパスカルケースの場合、アルゴリズムは文字の大文字と小文字に基づいて単語を分割します。

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter () 
print ( seg . segment ( "camelCased" ))
print ( seg . segment ( "PascalCased" ))

出力：

 > camel cased
> pascal cased

スペル修正

Spell Corrector は Peter Norvig の Spell-Corrector をベースにしています。セグメンテーションアルゴリズムと同様に、単語の統計を利用して、最も可能性の高い候補を見つけます。提供された統計のほかに、独自の統計を使用することもできます。

例：

単語の分割と同様に、スペル修正を実行できます。まず、選択したコーパスからの統計を使用するSpellCorrectorオブジェクトをインスタンス化し、利用可能なメソッドのいずれかを使用する必要があります。

 from ekphrasis . classes . spellcorrect import SpellCorrector
sp = SpellCorrector ( corpus = "english" ) 
print ( sp . correct ( "korrect" ))

出力：

 > correct

ソーシャルトークナイザー

トークン化の難しさは、(1 つのトークンとして) そのまま保持する必要がある式や単語の分割を避けることです。これは、顔文字やハッシュタグなどの「クリエイティブな」文章や表現を含むソーシャルネットワークのテキストではより重要です。 Twitter マークアップといくつかの基本的な感情表現または単純な絵文字を認識する、Twitter 向けのトークナイザー [1]、[2] がいくつかありますが、当社のトークナイザーは、ほぼすべての顔文字、絵文字、および多くの複雑な表現を識別できます。

特に感情分析などのタスクでは、テキストで表現された感情を特定する上で決定的な役割を果たす表現が数多くあります。このような表現は次のとおりです。

f**k 、 s**tなどの検閲された単語。
a *great* time 、 I don't *think* I ...など、強調を伴う単語。
>:( 、 :)) 、 o/などの絵文字。
ダッシュで区切られた単語 ( over-consumption 、 anti-american 、 mind-blowingなど)。

さらに、エクフラシスは情報を含む表現を識別できます。タスクによっては、この情報はタスク (センチメント分析) には無関係である可能性があるため、それらを 1 つのトークン (IR) として保存/抽出し、その後正規化する必要がある場合があります。このような表現は次のとおりです。

日付: Feb 18th 、 December 2, 2016 、 December 2-2016 、 10/17/94 、 3 December 2016 、 April 25, 1995 、 11.15.16 、 November 24th 2016 、 January 21stなど。
時刻: 5:45pm 、 11:36 AM 、 2:45 pm 、 5:30など。
通貨 ( $220M 、 $2B 、 $65.000 、 €10 、 $50Kなど)。
電話番号。
URL ( http://www.cs.unipi.gr 、 https://t.co/Wfw5Z1iSEtなど)。

例：

 import nltk
from ekphrasis . classes . tokenizer import SocialTokenizer


def wsp_tokenizer ( text ):
    return text . split ( " " )

puncttok = nltk . WordPunctTokenizer (). tokenize

social_tokenizer = SocialTokenizer ( lowercase = False ). tokenize

sents = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! >:-D http://sentimentsymposium.com/." ,
]

for s in sents :
    print ()
    print ( "ORG: " , s )  # original sentence
    print ( "WSP : " , wsp_tokenizer ( s ))  # whitespace tokenizer
    print ( "WPU : " , puncttok ( s ))  # WordPunct tokenizer
    print ( "SC : " , social_tokenizer ( s ))  # social tokenizer

出力：

 ORG:  CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))
WSP :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay!!!', '#davidlynch', '#tvseries', ':)))']
WPU :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#', 'TwinPeaks', '＼(^', 'o', '^)／', 'yaaaay', '!!!', '#', 'davidlynch', '#', 'tvseries', ':)))']
SC :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay', '!', '!', '!', '#davidlynch', '#tvseries', ':)))']

ORG:  I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/
WSP :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks!!!', 'WAISTED', '$10...', '#badmovies', '>3:/']
WPU :  ['I', 'saw', 'the', 'new', '#', 'johndoe', 'movie', 'and', 'it', 'suuuuucks', '!!!', 'WAISTED', '$', '10', '...', '#', 'badmovies', '>', '3', ':/']
SC :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks', '!', '!', '!', 'WAISTED', '$10', '.', '.', '.', '#badmovies', '>', '3:/']