Perpustakaan untuk berinteraksi dengan perpustakaan fungsi OpenNLP (Open Natural Processing). Belum semua fungsi diimplementasikan.
Informasi/dokumentasi tambahan:
Baca Sumber dari Marginalia
[clojure-opennlp " 0.5.0 " ] ; ; uses Opennlp 1.9.0
Clojure-Opennlp bekerja dengan Clojure 1.5+
( use 'clojure.pprint) ; just for this documentation
( use 'opennlp.nlp)
( use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here
Anda perlu membuat fungsi pemrosesan menggunakan file model. Ini menganggap Anda menjalankan dari direktori proyek root. Anda juga dapat mengunduh file model dari proyek OpenNLP di http://opennlp.sourceForge.net/models-1.5
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def detokenize ( make-detokenizer " models/english-detokenizer.xml " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def name-find ( make-name-finder " models/namefind/en-ner-person.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
Pencipta alat adalah multimethod, sehingga Anda juga dapat membuat alat apa pun menggunakan model alih-alih nama file (Anda dapat membuat model dengan alat pelatihan di SRC/OpenNLP/Tools/Train.clj):
( def tokenize ( make-tokenizer my-tokenizer-model)) ; ; etc, etc
Kemudian, gunakan fungsi yang Anda buat untuk melakukan operasi pada teks:
Mendeteksi kalimat:
( pprint ( get-sentences " First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea... " ))
[ " First sentence. " , " Second sentence? " , " Here is another one. " ,
" And so on and so forth - you get the idea... " ]
Tokenizing:
( pprint ( tokenize " Mr. Smith gave a car to his son on Friday " ))
[ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " ,
" Friday " ]
Detokenisasi:
( detokenize [ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " , " Friday " ])
" Mr. Smith gave a car to his son on Friday. "
Idealnya, s == (detokenize (tokenize s)), file XML model detoken adalah pekerjaan yang sedang berlangsung, beri tahu saya jika Anda mengalami sesuatu yang tidak melepas dengan benar dalam bahasa Inggris.
Tagging bagian-of-speech:
( pprint ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " )))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " gave " " VBD " ]
[ " a " " DT " ]
[ " car " " NN " ]
[ " to " " TO " ]
[ " his " " PRP$ " ]
[ " son " " NN " ]
[ " on " " IN " ]
[ " Friday. " " NNP " ])
Temuan Nama:
( name-find ( tokenize " My name is Lee, not John. " ))
( "Lee" " John " )
Treebank-chunking membagi dan menandai frasa dari kalimat yang ditandai. Perbedaan penting adalah mengembalikan daftar struct dengan: frasa dan: tombol tag, seperti yang terlihat di bawah:
( pprint ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " is " " meant " " to " " deactivate " ], :tag " VP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " when " ], :tag " ADVP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " }
{ :phrase [ " is " " pressed " ], :tag " VP " })
Hanya untuk frasa:
( phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
([ " The " " override " " system " ] [ " is " " meant " " to " " deactivate " ] [ " the " " accelerator " ] [ " when " ] [ " the " " brake " " pedal " ] [ " is " " pressed " ])
Dan hanya dengan string:
( phrase-strings ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
( "The override system " " is meant to deactivate " " the accelerator " " when " " the brake pedal " " is pressed " )
Kategorisasi Dokumen:
Lihat opennlp.test.tools.train untuk contoh penggunaan yang lebih baik.
( def doccat ( make-document-categorizer " my-doccat-model " ))
( doccat " This is some good text " )
" Happy "
Probabilitas OpenNLP Pasokan untuk operasi tertentu tersedia sebagai metadata pada hasilnya, jika berlaku:
( meta ( get-sentences " This is a sentence. This is also one. " ))
{ :probabilities ( 0.9999054310803004 0.9941126097177366 )}
( meta ( tokenize " This is a sentence. " ))
{ :probabilities ( 1.0 1.0 1.0 0.9956236737394807 1.0 )}
( meta ( pos-tag [ " This " " is " " a " " sentence " " . " ]))
{ :probabilities ( 0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769 )}
( meta ( chunker ( pos-tag [ " This " " is " " a " " sentence " " . " ])))
{ :probabilities ( 0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069 )}
( meta ( name-find [ " My " " name " " is " " John " ]))
{ :probabilities ( 0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192 )}
Anda dapat menebus opennlp.nlp/*beam-size*
(defaultnya adalah 3) untuk post-tagger dan treeKank-parser dengan:
( binding [*beam-size* 1 ]
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " )))
Anda dapat menebus opennlp.treebank/*advance-percentage*
(standarnya adalah 0,95) untuk Pohon-Parser dengan:
( binding [*advance-percentage* 0.80 ]
( def parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " )))
Catatan: Parsing Treebank sangat intensif memori, pastikan JVM Anda memiliki jumlah memori yang cukup tersedia (menggunakan sesuatu seperti -xmx512m) atau Anda akan kehabisan ruang tumpukan saat menggunakan parser treeBank.
Parsing Treebank mendapatkan bagiannya sendiri karena seberapa rumitnya.
CATATAN Tidak ada model Treebank-Parser yang termasuk dalam GIT Repo, Anda harus mengunduhnya secara terpisah dari proyek OpenNLP.
Membuatnya:
( def treebank-parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " ))
Untuk menggunakan Pohon-Parser Treebank, berikan serangkaian kalimat dengan token mereka dipisahkan oleh whitespace (lebih disukai menggunakan tokenize)
( treebank-parser [ " This is a sentence . " ])
[ " (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .))) " ]
Untuk mengubah string Treebank-Parser menjadi sesuatu yang sedikit lebih mudah dilakukan Clojure, gunakan fungsi (make-tree ...):
( make-tree ( first ( treebank-parser [ " This is a sentence . " ])))
{ :chunk { :chunk ({ :chunk { :chunk " This " , :tag DT}, :tag NP} { :chunk ({ :chunk " is " , :tag VBZ} { :chunk ({ :chunk " a " , :tag DT} { :chunk " sentence " , :tag NN}), :tag NP}), :tag VP} { :chunk " . " , :tag .}), :tag S}, :tag TOP}
Berikut adalah basis data yang dibagi menjadi format yang lebih mudah dibaca:
{ :tag TOP
:chunk { :tag S
:chunk ({ :tag NP
:chunk { :tag DT
:chunk " This " }}
{ :tag VP
:chunk ({ :tag VBZ
:chunk " is " }
{ :tag NP
:chunk ({ :tag DT
:chunk " a " }
{ :tag NN
:chunk " sentence " })})}
{ :tag .
:chunk " . " })}}
Semoga itu membuatnya sedikit lebih jelas, peta bersarang. Jika ada orang lain yang memiliki saran untuk cara yang lebih baik untuk mewakili informasi ini, jangan ragu untuk mengirimi saya email atau tambalan.
Parsing Treebank dianggap beta pada saat ini.
( use 'opennlp.tools.filters)
( pprint ( nouns ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " car " " NN " ]
[ " son " " NN " ]
[ " Friday " " NNP " ])
( pprint ( verbs ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " gave " " VBD " ])
( use 'opennlp.tools.filters)
( pprint ( noun-phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed " )))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " })
( pos-filter determiners #"^DT" )
#'user/determiners
( doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
Given a list of pos-tagged elements, return only the determiners in a list.
( pprint ( determiners ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " a " " DT " ])
Anda juga dapat membuat filter Chunk Treebank menggunakan (chunk-filter ...)
( chunk-filter fragments #"^FRAG$" )
( doc fragments)
-------------------------
opennlp.nlp/fragments
([elements__178__auto__])
Given a list of treebank-chunked elements, return only the fragments in a list.
Ada beberapa metode untuk membantu Anda menjadi malas saat menandai metode, tergantung pada operasi yang diinginkan, gunakan metode yang sesuai:
#'opennlp.tools.lazy/lazy-get-sentences
#'opennlp.tools.lazy/lazy-tokenize
#'opennlp.tools.lazy/lazy-tag
#'opennlp.tools.lazy/lazy-chunk
#'opennlp.tools.lazy/sentence-seq
Inilah cara menggunakannya:
( use 'opennlp.nlp)
( use 'opennlp.treebank)
( use 'opennlp.tools.lazy)
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
( lazy-get-sentences [ " This body of text has three sentences. This is the first. This is the third. " " This body has only two. Here's the last one. " ] get-sentences)
; ; will lazily return:
([ " This body of text has three sentences. " " This is the first. " " This is the third. " ] [ " This body has only two. " " Here's the last one. " ])
( lazy-tokenize [ " This is a sentence. " " This is another sentence. " " This is the third. " ] tokenize)
; ; will lazily return:
([ " This " " is " " a " " sentence " " . " ] [ " This " " is " " another " " sentence " " . " ] [ " This " " is " " the " " third " " . " ])
( lazy-tag [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag)
; ; will lazily return:
(([ " This " " DT " ] [ " is " " VBZ " ] [ " a " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]) ([ " This " " DT " ] [ " is " " VBZ " ] [ " another " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]))
( lazy-chunk [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag chunker)
; ; will lazily return:
(({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " a " " sentence " ], :tag " NP " }) ({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " another " " sentence " ], :tag " NP " }))
Jangan ragu untuk menggunakan fungsi malas, tetapi saya masih belum 100% ditetapkan pada tata letak, sehingga mereka dapat berubah di masa depan. ;
Menghasilkan urutan kalimat yang malas dari file menggunakan opennlp.tools.lazy/kalimat-seq:
( with-open [rdr ( clojure.java.io/reader " /tmp/bigfile " )]
( let [sentences ( sentence-seq rdr get-sentences)]
; ; process your lazy seq of sentences however you desire
( println " first 5 sentences: " )
( clojure.pprint/pprint ( take 5 sentences))))
Ada kode untuk memungkinkan model pelatihan untuk masing -masing alat. Silakan lihat dokumentasi di pelatihan.markdown
Hak Cipta (C) 2010 Matthew Lee Hinman
Didistribusikan di bawah lisensi Publik Eclipse, sama seperti penggunaan Clojure. Lihat File Menyalin.