ekphrasis 다운로드 - ekphrasis 소스 코드 다운로드

ekphrasis

기타 소스코드

다운로드

2개의 큰 말뭉치(영어 Wikipedia, twitter - 3억 3천만 개)의 단어 통계를 사용하여 토큰화, 단어 정규화, 단어 분할(해시태그 분할용) 및 철자 교정을 위해 Twitter 또는 Facebook과 같은 소셜 네트워크의 텍스트에 맞춰진 경량 텍스트 도구 모음입니다. 영어 트윗).

ekphrasis는 DataStories 팀이 SemEval-2017 작업 4(영어), Twitter의 감정 분석 제출을 위한 텍스트 처리 파이프라인의 일부로 개발되었습니다.

연구 프로젝트에서 라이브러리를 사용하는 경우 "SemEval-2017 작업 4의 DataStories: 메시지 수준 및 주제 기반 감정 분석에 대한 관심을 갖춘 Deep LSTM" 논문을 인용하세요.

소환:

 @InProceedings{baziotis-pelekis-doulkeridis:2017:SemEval2,
  author    = {Baziotis, Christos  and  Pelekis, Nikos  and  Doulkeridis, Christos},
  title     = {DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)},
  month     = {August},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {747--754}
}

면책조항: 라이브러리는 더 이상 적극적으로 개발되지 않습니다. 중요한 문제를 해결하려고 노력하겠지만 어떤 약속도 드릴 수 없습니다.

설치

소스에서 빌드

 pip install git+git://github.com/cbaziotis/ekphrasis.git

또는 pypi에서 설치

 pip install ekphrasis -U

개요

ekphrasis는 다음과 같은 기능을 제공합니다:

소셜 토크나이저 . 복잡한 이모티콘, 이모지 및 날짜, 시간 등과 같은 기타 구조화되지 않은 표현을 이해하는 소셜 네트워크(Facebook, Twitter...)에 맞춰진 텍스트 토크나이저입니다.
단어 분할 . 긴 문자열을 구성 단어로 분할할 수 있습니다. 해시태그 세분화에 적합합니다.
철자 교정 . 철자가 틀린 단어를 가장 가능성이 높은 후보 단어로 바꿀 수 있습니다.
사용자 정의 . 귀하의 필요에 맞게 단어 분할, 철자 교정 및 용어 식별 기능을 테일러링하세요.
단어 분할 및 철자 수정 메커니즘은 특정 말뭉치에서 수집된 단어 통계를 기반으로 작동합니다. 우리는 2개의 큰 말뭉치(Wikipedia 및 Twitter)에서 단어 통계를 제공하지만, 자신의 말뭉치에서 단어 통계를 생성할 수도 있습니다. 생물 의학 문서와 같은 도메인별 텍스트로 작업하는 경우 그렇게 해야 할 수도 있습니다. 예를 들어 기술이나 화합물을 설명하는 단어는 범용 자료의 통계라는 단어를 사용하여 철자가 틀린 단어로 처리될 수 있습니다.
ekphrasis는 정규식 목록을 기반으로 텍스트를 토큰화합니다. 정규식 사전( ekphrasis/regexes/expressions.txt )에 새 항목을 추가하기만 하면 ekphrasis를 쉽게 활성화하여 새 엔터티를 식별할 수 있습니다.
전처리 파이프라인 . 일종의 분석이나 기계 학습을 위해 데이터 세트의 텍스트 파일을 준비하기 위해 위의 모든 단계를 쉬운 방법으로 결합할 수 있습니다. 또한 앞서 언급한 작업 외에도 텍스트 정규화, 단어 주석(라벨링) 등을 수행할 수 있습니다.

텍스트 전처리 파이프라인

TextPreProcessor 사용하여 전처리 파이프라인을 쉽게 정의할 수 있습니다.

 from ekphrasis . classes . preprocessor import TextPreProcessor
from ekphrasis . classes . tokenizer import SocialTokenizer
from ekphrasis . dicts . emoticons import emoticons

text_processor = TextPreProcessor (
    # terms that will be normalized
    normalize = [ 'url' , 'email' , 'percent' , 'money' , 'phone' , 'user' ,
        'time' , 'url' , 'date' , 'number' ],
    # terms that will be annotated
    annotate = { "hashtag" , "allcaps" , "elongated" , "repeated" ,
        'emphasis' , 'censored' },
    fix_html = True ,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter = "twitter" , 
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector = "twitter" , 
    
    unpack_hashtags = True ,  # perform word segmentation on hashtags
    unpack_contractions = True ,  # Unpack contractions (can't -> can not)
    spell_correct_elong = False ,  # spell correction for elongated words
    
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer = SocialTokenizer ( lowercase = True ). tokenize ,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts = [ emoticons ]
)

sentences = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies :/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
]

for s in sentences :
    print ( " " . join ( text_processor . pre_process_doc ( s )))

산출:

 cant <allcaps> wait <allcaps> for the new season of <hashtag> twin peaks </hashtag> ＼(^o^)／ ! <repeated> <hashtag> david lynch </hashtag> <hashtag> tv series </hashtag> <happy>

i saw the new <hashtag> john doe </hashtag> movie and it sucks <elongated> ! <repeated> waisted <allcaps> <money> . <repeated> <hashtag> bad movies </hashtag> <annoyed>

<user> : can not wait for the <date> <hashtag> sentiment </hashtag> talks ! yay <allcaps> <elongated> ! <repeated> <laugh> <url>

참고:

긴 단어는 자동으로 정규화됩니다.
맞춤법 교정은 성능에 영향을 미칩니다.

단어 통계

ekphrasis는 2개의 큰 말뭉치에서 단어 통계(유니그램 및 바이그램)를 제공합니다.

영어 위키피디아
3억 3천만 개의 영어 트위터 메시지 모음

이러한 단어 통계는 단어 분할 및 철자 교정에 필요합니다. 또한 자신의 말뭉치에서 단어 통계를 생성할 수도 있습니다. ekphrasis/tools/generate_stats.py 사용하고 텍스트 파일 또는 텍스트 파일 모음이 포함된 디렉터리에서 통계를 생성할 수 있습니다. 예를 들어, text8(http://mattmahoney.net/dc/text8.zip)에 대한 단어 통계를 생성하려면 다음을 수행할 수 있습니다.

 python generate_stats.py --input text8.txt --name text8 --ngrams 2 --mincount 70 30

입력: 통계 계산을 위한 파일이 포함된 파일 또는 디렉터리의 경로입니다.
이름: 말뭉치의 이름입니다.
ngrams: 통계를 계산할 최대 ngram 수입니다.
mincount: 포함되기 위한 각 ngram의 최소 개수입니다. 이 경우 유니그램의 최소 개수는 70이고 바이그램의 최소 개수는 30입니다.

스크립트를 실행하면 ekphrasis/stats/ 내에 말뭉치 통계가 포함된 새 디렉터리가 표시됩니다. 위 예의 경우 ekphrasis/stats/text8/ 입니다.

단어 분할

단어 분할 구현은 Viterbi 알고리즘을 사용하며 Beautiful Data(Segaran and Hammerbacher, 2009) 책의 CH14를 기반으로 합니다. 구현에는 문자열의 단어를 식별하고 분리하기 위해 단어 통계가 필요합니다. 제공된 2개의 말뭉치 중 하나 또는 자신의 말뭉치에서 통계라는 단어를 사용할 수 있습니다.

예: 단어 분할을 수행하려면 먼저 주어진 말뭉치로 분할기를 인스턴스화한 다음 segment() 메서드를 사용해야 합니다.

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter ( corpus = "mycorpus" ) 
print ( seg . segment ( "smallandinsignificant" ))

산출:

 > small and insignificant

다양한 말뭉치의 통계를 사용하여 출력을 테스트할 수 있습니다.

 from ekphrasis . classes . segmenter import Segmenter

# segmenter using the word statistics from english Wikipedia
seg_eng = Segmenter ( corpus = "english" ) 

# segmenter using the word statistics from Twitter
seg_tw = Segmenter ( corpus = "twitter" )

words = [ "exponentialbackoff" , "gamedev" , "retrogaming" , "thewatercooler" , "panpsychism" ]
for w in words :
    print ( w )
    print ( "(eng):" , seg_eng . segment ( w ))
    print ( "(tw):" , seg_tw . segment ( w ))
    print ()

산출:

 exponentialbackoff
(eng): exponential backoff
(tw): exponential back off

gamedev
(eng): gamedev
(tw): game dev

retrogaming
(eng): retrogaming
(tw): retro gaming

thewatercooler
(eng): the water cooler
(tw): the watercooler

panpsychism
(eng): panpsychism
(tw): pan psych is m

마지막으로 단어가 camelCased 또는 PascalCased인 경우 알고리즘은 문자의 대소문자를 기준으로 단어를 분할합니다.

 from ekphrasis . classes . segmenter import Segmenter
seg = Segmenter () 
print ( seg . segment ( "camelCased" ))
print ( seg . segment ( "PascalCased" ))

산출:

 > camel cased
> pascal cased

철자 교정

철자 교정기는 Peter Norvig의 철자 교정기를 기반으로 합니다. 분할 알고리즘과 마찬가지로 가장 가능성이 높은 후보를 찾기 위해 단어 통계를 활용합니다. 제공된 통계 외에도 자신만의 통계를 사용할 수 있습니다.

예:

단어 분할과 마찬가지로 철자 교정을 수행할 수 있습니다. 먼저 선택한 자료의 통계를 사용하고 사용 가능한 방법을 사용하는 SpellCorrector 개체를 인스턴스화해야 합니다.

 from ekphrasis . classes . spellcorrect import SpellCorrector
sp = SpellCorrector ( corpus = "english" ) 
print ( sp . correct ( "korrect" ))

산출:

 > correct

소셜 토크나이저

토큰화의 어려움은 그대로 유지해야 하는 표현이나 단어를 하나의 토큰으로 분할하지 않는 것입니다. 이는 "창의적인" 글쓰기와 이모티콘, 해시태그 등과 같은 표현이 포함된 소셜 네트워크의 텍스트에서 더욱 중요합니다. 트위터 마크업과 몇 가지 기본적인 감정 표현 또는 간단한 이모티콘을 인식하는 트위터용 토크나이저가 있지만[1],[2], 우리의 토크나이저는 거의 모든 이모티콘, 이모티콘 및 많은 복잡한 표현을 식별할 수 있습니다.

특히 감성 분석 등의 작업에서는 텍스트에 표현된 감성을 파악하는 데 결정적인 역할을 하는 표현이 많습니다. 이와 같은 표현은 다음과 같습니다.

f**k , s**t 와 같은 검열된 단어.
a *great* time 과 같이 강조하는 단어는 I don't *think* I ... .
>:( , :)) , o/ 와 같은 이모티콘.
over-consumption , anti-american , mind-blowing 등 대시로 구분된 단어입니다.

또한 ekphrasis는 정보가 포함된 표현을 식별할 수 있습니다. 작업에 따라 하나의 토큰(IR)으로 보존/추출한 다음 정규화하는 것이 좋습니다. 이 정보는 작업(감정 분석)과 관련이 없을 수 있기 때문입니다. 이와 같은 표현은 다음과 같습니다.

날짜(예: Feb 18th , December 2, 2016 , December 2-2016 , 10/17/94 , 3 December 2016 , April 25, 1995 , 11.15.16 , November 24th 2016 , January 21st )
시간(예: 5:45pm , 11:36 AM , 2:45 pm , 5:30 )
$220M , $2B , $65.000 , €10 , $50K 와 같은 통화.
전화번호.
URL(예: http://www.cs.unipi.gr , https://t.co/Wfw5Z1iSEt )

예 :

 import nltk
from ekphrasis . classes . tokenizer import SocialTokenizer


def wsp_tokenizer ( text ):
    return text . split ( " " )

puncttok = nltk . WordPunctTokenizer (). tokenize

social_tokenizer = SocialTokenizer ( lowercase = False ). tokenize

sents = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))" ,
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/" ,
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! >:-D http://sentimentsymposium.com/." ,
]

for s in sents :
    print ()
    print ( "ORG: " , s )  # original sentence
    print ( "WSP : " , wsp_tokenizer ( s ))  # whitespace tokenizer
    print ( "WPU : " , puncttok ( s ))  # WordPunct tokenizer
    print ( "SC : " , social_tokenizer ( s ))  # social tokenizer

산출:

 ORG:  CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))
WSP :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay!!!', '#davidlynch', '#tvseries', ':)))']
WPU :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#', 'TwinPeaks', '＼(^', 'o', '^)／', 'yaaaay', '!!!', '#', 'davidlynch', '#', 'tvseries', ':)))']
SC :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay', '!', '!', '!', '#davidlynch', '#tvseries', ':)))']

ORG:  I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies >3:/
WSP :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks!!!', 'WAISTED', '$10...', '#badmovies', '>3:/']
WPU :  ['I', 'saw', 'the', 'new', '#', 'johndoe', 'movie', 'and', 'it', 'suuuuucks', '!!!', 'WAISTED', '$', '10', '...', '#', 'badmovies', '>', '3', ':/']
SC :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks', '!', '!', '!', 'WAISTED', '$10', '.', '.', '.', '#badmovies', '>', '3:/']