مكتبة للتفاعل مع مكتبة OpenNLP (معالجة اللغة الطبيعية المفتوحة). لا يتم تنفيذ جميع الوظائف حتى الآن.
معلومات/وثائق إضافية:
اقرأ المصدر من Marginalia
[clojure-opennlp " 0.5.0 " ] ; ; uses Opennlp 1.9.0
يعمل Clojure-OpennLP مع Clojure 1.5+
( use 'clojure.pprint) ; just for this documentation
( use 'opennlp.nlp)
( use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here
ستحتاج إلى عمل وظائف المعالجة باستخدام ملفات النموذج. هذه تفترض أنك تعمل من دليل مشروع الجذر. يمكنك أيضًا تنزيل ملفات النموذج من مشروع OpenNLP على http://opennlp.sourceforge.net/models-1.5
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def detokenize ( make-detokenizer " models/english-detokenizer.xml " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def name-find ( make-name-finder " models/namefind/en-ner-person.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
إن مباني الأدوات عبارة عن أجهزة متعددة ، بحيث يمكنك أيضًا إنشاء أي من الأدوات التي تستخدم نموذجًا بدلاً من اسم الملف (يمكنك إنشاء نموذج مع أدوات التدريب في SRC/OpenNLP/TALLP/TRAIN.CLJ):
( def tokenize ( make-tokenizer my-tokenizer-model)) ; ; etc, etc
ثم ، استخدم الوظائف التي أنشأتها لأداء عمليات على النص:
اكتشاف الجمل:
( pprint ( get-sentences " First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea... " ))
[ " First sentence. " , " Second sentence? " , " Here is another one. " ,
" And so on and so forth - you get the idea... " ]
الرمز المميز:
( pprint ( tokenize " Mr. Smith gave a car to his son on Friday " ))
[ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " ,
" Friday " ]
إعادة الهبوط:
( detokenize [ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " , " Friday " ])
" Mr. Smith gave a car to his son on Friday. "
من الناحية المثالية ، S == (Detokenize (Tokenize S)) ، فإن ملف نموذج XML النموذجية هو عمل مستمر ، يرجى إعلامي إذا كنت تصادف شيئًا لا يعرض بشكل صحيح باللغة الإنجليزية.
وضع علامة على جزء من الكلام:
( pprint ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " )))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " gave " " VBD " ]
[ " a " " DT " ]
[ " car " " NN " ]
[ " to " " TO " ]
[ " his " " PRP$ " ]
[ " son " " NN " ]
[ " on " " IN " ]
[ " Friday. " " NNP " ])
العثور على الاسم:
( name-find ( tokenize " My name is Lee, not John. " ))
( "Lee" " John " )
تقسيم وعلامات TreeBank-Chunking من جملة ذات علامات POS. هناك اختلاف ملحوظ هو أنه يرجع قائمة من الهياكل مع عبارة: مفاتيح العلامات ، كما هو موضح أدناه:
( pprint ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " is " " meant " " to " " deactivate " ], :tag " VP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " when " ], :tag " ADVP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " }
{ :phrase [ " is " " pressed " ], :tag " VP " })
فقط العبارات:
( phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
([ " The " " override " " system " ] [ " is " " meant " " to " " deactivate " ] [ " the " " accelerator " ] [ " when " ] [ " the " " brake " " pedal " ] [ " is " " pressed " ])
ومع سلاسل فقط:
( phrase-strings ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
( "The override system " " is meant to deactivate " " the accelerator " " when " " the brake pedal " " is pressed " )
تصنيف المستند:
انظر OpenNLP.Test.tools.train للحصول على أمثلة أفضل للاستخدام.
( def doccat ( make-document-categorizer " my-doccat-model " ))
( doccat " This is some good text " )
" Happy "
تتوفر إمدادات احتمالات OpenNLP لعملية معينة من البيانات الأولية في النتيجة ، عند الاقتضاء:
( meta ( get-sentences " This is a sentence. This is also one. " ))
{ :probabilities ( 0.9999054310803004 0.9941126097177366 )}
( meta ( tokenize " This is a sentence. " ))
{ :probabilities ( 1.0 1.0 1.0 0.9956236737394807 1.0 )}
( meta ( pos-tag [ " This " " is " " a " " sentence " " . " ]))
{ :probabilities ( 0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769 )}
( meta ( chunker ( pos-tag [ " This " " is " " a " " sentence " " . " ])))
{ :probabilities ( 0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069 )}
( meta ( name-find [ " My " " name " " is " " John " ]))
{ :probabilities ( 0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192 )}
يمكنك إعادة Rebind opennlp.nlp/*beam-size*
(الافتراضي هو 3) لـ POS-Tagger و TreeBank-Parser مع:
( binding [*beam-size* 1 ]
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " )))
يمكنك Rebind opennlp.treebank/*advance-percentage*
(الافتراضي هو 0.95) لـ TreeBank-Parser مع:
( binding [*advance-percentage* 0.80 ]
( def parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " )))
ملاحظة: إن تحليل TreeBank مكثف للغاية للذاكرة ، تأكد من أن JVM لديك كمية كافية من الذاكرة متوفرة (باستخدام شيء مثل -xmx512m) أو ستنفد من مساحة الكومة عند استخدام محلل Treebank.
يحصل تحليل TreeBank على قسم خاص به بسبب تعقيده.
لاحظ أنه لم يتم تضمين أي من طراز Treebank-Parser في Git Repo ، فسيتعين عليك تنزيله بشكل منفصل من مشروع OpenNLP.
إنشاءه:
( def treebank-parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " ))
لاستخدام Parser TreeBank ، تمرير مجموعة من الجمل مع الرموز المميزة التي تفصلها المسافة البيضاء (ويفضل استخدام الرمز المميز)
( treebank-parser [ " This is a sentence . " ])
[ " (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .))) " ]
من أجل تحويل السلسلة الترويجية للضرب إلى شيء أسهل قليلاً على Clojure ، استخدم وظيفة (Make-Tree ...):
( make-tree ( first ( treebank-parser [ " This is a sentence . " ])))
{ :chunk { :chunk ({ :chunk { :chunk " This " , :tag DT}, :tag NP} { :chunk ({ :chunk " is " , :tag VBZ} { :chunk ({ :chunk " a " , :tag DT} { :chunk " sentence " , :tag NN}), :tag NP}), :tag VP} { :chunk " . " , :tag .}), :tag S}, :tag TOP}
فيما يلي أن بنية البيانات تنقسم إلى تنسيق أكثر قابلية للقراءة:
{ :tag TOP
:chunk { :tag S
:chunk ({ :tag NP
:chunk { :tag DT
:chunk " This " }}
{ :tag VP
:chunk ({ :tag VBZ
:chunk " is " }
{ :tag NP
:chunk ({ :tag DT
:chunk " a " }
{ :tag NN
:chunk " sentence " })})}
{ :tag .
:chunk " . " })}}
نأمل أن يجعل ذلك أكثر وضوحا ، خريطة متداخلة. إذا كان لدى أي شخص آخر أي اقتراحات لطرق أفضل لتمثيل هذه المعلومات ، فلا تتردد في إرسال بريد إلكتروني لي أو تصحيح.
يعتبر تحليل Treebank تجريبيًا في هذه المرحلة.
( use 'opennlp.tools.filters)
( pprint ( nouns ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " car " " NN " ]
[ " son " " NN " ]
[ " Friday " " NNP " ])
( pprint ( verbs ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " gave " " VBD " ])
( use 'opennlp.tools.filters)
( pprint ( noun-phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed " )))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " })
( pos-filter determiners #"^DT" )
#'user/determiners
( doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
Given a list of pos-tagged elements, return only the determiners in a list.
( pprint ( determiners ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " a " " DT " ])
يمكنك أيضًا إنشاء مرشحات TreeBank-Chunk باستخدام (مرشح قطعة ...)
( chunk-filter fragments #"^FRAG$" )
( doc fragments)
-------------------------
opennlp.nlp/fragments
([elements__178__auto__])
Given a list of treebank-chunked elements, return only the fragments in a list.
هناك بعض الطرق لمساعدتك على أن تكون كسولًا عند وضع العلامات ، اعتمادًا على العملية المطلوبة ، استخدم الطريقة المقابلة:
#'opennlp.tools.lazy/lazy-get-sentences
#'opennlp.tools.lazy/lazy-tokenize
#'opennlp.tools.lazy/lazy-tag
#'opennlp.tools.lazy/lazy-chunk
#'opennlp.tools.lazy/sentence-seq
إليك كيفية استخدامها:
( use 'opennlp.nlp)
( use 'opennlp.treebank)
( use 'opennlp.tools.lazy)
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
( lazy-get-sentences [ " This body of text has three sentences. This is the first. This is the third. " " This body has only two. Here's the last one. " ] get-sentences)
; ; will lazily return:
([ " This body of text has three sentences. " " This is the first. " " This is the third. " ] [ " This body has only two. " " Here's the last one. " ])
( lazy-tokenize [ " This is a sentence. " " This is another sentence. " " This is the third. " ] tokenize)
; ; will lazily return:
([ " This " " is " " a " " sentence " " . " ] [ " This " " is " " another " " sentence " " . " ] [ " This " " is " " the " " third " " . " ])
( lazy-tag [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag)
; ; will lazily return:
(([ " This " " DT " ] [ " is " " VBZ " ] [ " a " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]) ([ " This " " DT " ] [ " is " " VBZ " ] [ " another " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]))
( lazy-chunk [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag chunker)
; ; will lazily return:
(({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " a " " sentence " ], :tag " NP " }) ({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " another " " sentence " ], :tag " NP " }))
لا تتردد في استخدام الوظائف البطيئة ، لكنني ما زلت غير محدد بنسبة 100 ٪ على التصميم ، لذلك قد تتغير في المستقبل. (ربما تسلسلهم بدلاً من سلسلة من الجمل ، يبدو الأمر (الكسول-العلامة الكسولة (الكسول-tokenize (الكسول-الحالات ...)))))).
توليد سلسلة كسول من الجمل من ملف باستخدام OpenNLP.Tools.Lazy/Secence-Seq:
( with-open [rdr ( clojure.java.io/reader " /tmp/bigfile " )]
( let [sentences ( sentence-seq rdr get-sentences)]
; ; process your lazy seq of sentences however you desire
( println " first 5 sentences: " )
( clojure.pprint/pprint ( take 5 sentences))))
هناك رمز للسماح بنماذج التدريب لكل من الأدوات. يرجى الاطلاع على الوثائق في التدريب
حقوق الطبع والنشر (C) 2010 ماثيو لي هينمان
تم توزيعه بموجب ترخيص Eclipse العام ، وهو نفسه يستخدم Clojure. انظر الملف نسخ.