機能のOpenNLP(オープンナチュラル言語処理)ライブラリとインターフェイスするライブラリ。すべての関数がまだ実装されているわけではありません。
追加情報/ドキュメント:
Marignaliaのソースを読んでください
[clojure-opennlp " 0.5.0 " ] ; ; uses Opennlp 1.9.0
Clojure-Opennlpは、Clojure 1.5+で動作します
( use 'clojure.pprint) ; just for this documentation
( use 'opennlp.nlp)
( use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here
モデルファイルを使用して処理機能を作成する必要があります。これらは、ルートプロジェクトディレクトリから実行していると仮定しています。また、http://opennlp.sourceforge.net/models-1.5のOpenNLPプロジェクトからモデルファイルをダウンロードすることもできます。
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def detokenize ( make-detokenizer " models/english-detokenizer.xml " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def name-find ( make-name-finder " models/namefind/en-ner-person.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
ツール作成者はマルチマトドであるため、ファイル名の代わりにモデルを使用してツールを作成することもできます(SRC/OpenNLP/Tools/Train.cljのトレーニングツールを使用してモデルを作成できます):
( def tokenize ( make-tokenizer my-tokenizer-model)) ; ; etc, etc
次に、作成した関数を使用してテキストで操作を実行します。
文の検出:
( pprint ( get-sentences " First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea... " ))
[ " First sentence. " , " Second sentence? " , " Here is another one. " ,
" And so on and so forth - you get the idea... " ]
トークン化:
( pprint ( tokenize " Mr. Smith gave a car to his son on Friday " ))
[ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " ,
" Friday " ]
Detokenizing:
( detokenize [ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " , " Friday " ])
" Mr. Smith gave a car to his son on Friday. "
理想的には、s ==(detokenize(tokenize s))、Detokenization Model XMLファイルは進行中の作業です。英語で正しく掘り下げない何かに出くわすかどうかを教えてください。
スピーチの一部のタグ付け:
( pprint ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " )))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " gave " " VBD " ]
[ " a " " DT " ]
[ " car " " NN " ]
[ " to " " TO " ]
[ " his " " PRP$ " ]
[ " son " " NN " ]
[ " on " " IN " ]
[ " Friday. " " NNP " ])
名前の発見:
( name-find ( tokenize " My name is Lee, not John. " ))
( "Lee" " John " )
TreeBank-Chunkingは、POSタグ付きの文からのフレーズを分割し、タグ付けします。注目すべき違いは、以下に示すように、次のように以下の構造体のリストを返すことです。
( pprint ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " is " " meant " " to " " deactivate " ], :tag " VP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " when " ], :tag " ADVP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " }
{ :phrase [ " is " " pressed " ], :tag " VP " })
フレーズだけの場合:
( phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
([ " The " " override " " system " ] [ " is " " meant " " to " " deactivate " ] [ " the " " accelerator " ] [ " when " ] [ " the " " brake " " pedal " ] [ " is " " pressed " ])
そして文字列だけで:
( phrase-strings ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
( "The override system " " is meant to deactivate " " the accelerator " " when " " the brake pedal " " is pressed " )
ドキュメントの分類:
より良い使用例については、opennlp.test.tools.trainを参照してください。
( def doccat ( make-document-categorizer " my-doccat-model " ))
( doccat " This is some good text " )
" Happy "
特定の操作の確率は、結果のメタデータとして利用可能です。
( meta ( get-sentences " This is a sentence. This is also one. " ))
{ :probabilities ( 0.9999054310803004 0.9941126097177366 )}
( meta ( tokenize " This is a sentence. " ))
{ :probabilities ( 1.0 1.0 1.0 0.9956236737394807 1.0 )}
( meta ( pos-tag [ " This " " is " " a " " sentence " " . " ]))
{ :probabilities ( 0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769 )}
( meta ( chunker ( pos-tag [ " This " " is " " a " " sentence " " . " ])))
{ :probabilities ( 0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069 )}
( meta ( name-find [ " My " " name " " is " " John " ]))
{ :probabilities ( 0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192 )}
opennlp.nlp/*beam-size*
(デフォルトは3)をrebind rebind for pos-tagger and treebank-parserで以下してください。
( binding [*beam-size* 1 ]
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " )))
opennlp.treebank/*advance-percentage*
(デフォルトは0.95)のTreeBank-ParserのRebindを次のように再構築できます。
( binding [*advance-percentage* 0.80 ]
( def parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " )))
注:TreeBankの解析は非常にメモリ集中的です。JVMに十分な量のメモリが利用可能であることを確認してください(-XMX512Mのようなものを使用)、またはTreeBankパーサーを使用するとヒープスペースがなくなります。
TreeBank Parsingは、それがどれほど複雑であるかのために独自のセクションを取得します。
TreeBank-ParserモデルはいずれもGITリポジトリに含まれていません。OpenNLPプロジェクトからは別にダウンロードする必要があります。
それを作成:
( def treebank-parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " ))
TreeBank-Parserを使用するには、Whitespaceで区切られたトークンで一連の文を渡します(できればトークン化を使用)
( treebank-parser [ " This is a sentence . " ])
[ " (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .))) " ]
TreeBank-Parser StringをClojureが実行するのが少し簡単なものに変換するには、(Make-Tree ...)関数を使用します。
( make-tree ( first ( treebank-parser [ " This is a sentence . " ])))
{ :chunk { :chunk ({ :chunk { :chunk " This " , :tag DT}, :tag NP} { :chunk ({ :chunk " is " , :tag VBZ} { :chunk ({ :chunk " a " , :tag DT} { :chunk " sentence " , :tag NN}), :tag NP}), :tag VP} { :chunk " . " , :tag .}), :tag S}, :tag TOP}
これが、もう少し読みやすい形式に分割されたデータストラクチャです。
{ :tag TOP
:chunk { :tag S
:chunk ({ :tag NP
:chunk { :tag DT
:chunk " This " }}
{ :tag VP
:chunk ({ :tag VBZ
:chunk " is " }
{ :tag NP
:chunk ({ :tag DT
:chunk " a " }
{ :tag NN
:chunk " sentence " })})}
{ :tag .
:chunk " . " })}}
うまくいけば、それが少し明確になり、ネストされたマップです。他の誰かがこの情報を表現するためのより良い方法のための提案を持っている場合は、電子メールまたはパッチを送ってください。
この時点で、ツリーバンクの解析はベータ版と見なされます。
( use 'opennlp.tools.filters)
( pprint ( nouns ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " car " " NN " ]
[ " son " " NN " ]
[ " Friday " " NNP " ])
( pprint ( verbs ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " gave " " VBD " ])
( use 'opennlp.tools.filters)
( pprint ( noun-phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed " )))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " })
( pos-filter determiners #"^DT" )
#'user/determiners
( doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
Given a list of pos-tagged elements, return only the determiners in a list.
( pprint ( determiners ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " a " " DT " ])
(チャンクフィルター...)を使用してTreeBank-Chunkフィルターを作成することもできます
( chunk-filter fragments #"^FRAG$" )
( doc fragments)
-------------------------
opennlp.nlp/fragments
([elements__178__auto__])
Given a list of treebank-chunked elements, return only the fragments in a list.
目的の操作に応じて、メソッドをタグ付けするときに怠け者になるのに役立つ方法がいくつかあります。対応する方法を使用します。
#'opennlp.tools.lazy/lazy-get-sentences
#'opennlp.tools.lazy/lazy-tokenize
#'opennlp.tools.lazy/lazy-tag
#'opennlp.tools.lazy/lazy-chunk
#'opennlp.tools.lazy/sentence-seq
これらを使用する方法は次のとおりです。
( use 'opennlp.nlp)
( use 'opennlp.treebank)
( use 'opennlp.tools.lazy)
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
( lazy-get-sentences [ " This body of text has three sentences. This is the first. This is the third. " " This body has only two. Here's the last one. " ] get-sentences)
; ; will lazily return:
([ " This body of text has three sentences. " " This is the first. " " This is the third. " ] [ " This body has only two. " " Here's the last one. " ])
( lazy-tokenize [ " This is a sentence. " " This is another sentence. " " This is the third. " ] tokenize)
; ; will lazily return:
([ " This " " is " " a " " sentence " " . " ] [ " This " " is " " another " " sentence " " . " ] [ " This " " is " " the " " third " " . " ])
( lazy-tag [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag)
; ; will lazily return:
(([ " This " " DT " ] [ " is " " VBZ " ] [ " a " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]) ([ " This " " DT " ] [ " is " " VBZ " ] [ " another " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]))
( lazy-chunk [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag chunker)
; ; will lazily return:
(({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " a " " sentence " ], :tag " NP " }) ({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " another " " sentence " ], :tag " NP " }))
怠zyな機能を自由に使用してください。しかし、私はまだレイアウトに100%設定されていないので、将来変化する可能性があります。 (たぶんそれらをチェーンして、一連の文の代わりに、(レイジーチャンク(レイジータグ(レイジートークン(レイジーゲットセンテンス...)...))))))))。
opennlp.tools.lazy/cente-seqを使用してファイルから怠zyな一連の文を生成する:
( with-open [rdr ( clojure.java.io/reader " /tmp/bigfile " )]
( let [sentences ( sentence-seq rdr get-sentences)]
; ; process your lazy seq of sentences however you desire
( println " first 5 sentences: " )
( clojure.pprint/pprint ( take 5 sentences))))
各ツールのトレーニングモデルを許可するコードがあります。 Training.markdownのドキュメントをご覧ください
Copyright(c)2010 Matthew Lee Hinman
Clojureが使用するのと同じように、Eclipse Publicライセンスの下で配布されます。ファイルのコピーを参照してください。