clojure opennlp ดาวน์โหลด - clojure opennlp ซอร์สโค้ดดาวน์โหลดดาวน์โหลด

อินเทอร์เฟซไลบรารีของ Clojure ไปยัง OpenNLP - https://opennlp.apache.org/

ไลบรารีที่จะเชื่อมต่อกับไลบรารี OpenNLP (เปิดการประมวลผลภาษาธรรมชาติ) ของฟังก์ชั่น ยังไม่ได้ใช้ฟังก์ชั่นทั้งหมด

ข้อมูลเพิ่มเติม/เอกสาร:

การประมวลผลภาษาธรรมชาติใน Clojure ด้วย Clojure-Opennlp
การค้นหาบริบทโดยใช้ clojure-opennlp

อ่านแหล่งที่มาจากชายขอบ

http://dakrone.github.com/clojure-opennlp/

ปัญหาที่รู้จัก

เมื่อใช้ Treebank-Chunker ในประโยคโปรดตรวจสอบให้แน่ใจว่าคุณมีช่วงเวลาหนึ่งในตอนท้ายของประโยคหากคุณไม่มีช่วงเวลา chunker จะสับสนและลดคำสุดท้าย นอกจากนี้ประโยคของคุณควรจะถูกต้องทางหลักแล้วใช่ไหม?

การใช้งานจาก Leiningen:

[clojure-opennlp " 0.5.0 " ] ; ; uses Opennlp 1.9.0

Clojure-Opennlp ทำงานร่วมกับ Clojure 1.5+

ตัวอย่างการใช้งานพื้นฐาน (จากการเติม):

( use 'clojure.pprint) ; just for this documentation
( use 'opennlp.nlp)
( use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here

คุณจะต้องทำฟังก์ชั่นการประมวลผลโดยใช้ไฟล์โมเดล สิ่งเหล่านี้ถือว่าคุณกำลังทำงานจากไดเรกทอรีรูทโครงการ นอกจากนี้คุณยังสามารถดาวน์โหลดไฟล์รุ่นจากโครงการ OpenNLP ได้ที่ http://opennlp.sourceforge.net/models-1.5

( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def detokenize ( make-detokenizer " models/english-detokenizer.xml " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def name-find ( make-name-finder " models/namefind/en-ner-person.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))

ตัวสร้างเครื่องมือเป็นวิธีการหลายวิธีดังนั้นคุณยังสามารถสร้างเครื่องมือใด ๆ โดยใช้โมเดลแทนชื่อไฟล์ (คุณสามารถสร้างแบบจำลองด้วยเครื่องมือฝึกอบรมใน SRC/OpenNLP/Tools/Train.CLJ):

( def tokenize ( make-tokenizer my-tokenizer-model)) ; ; etc, etc

จากนั้นใช้ฟังก์ชั่นที่คุณสร้างขึ้นเพื่อดำเนินการกับข้อความ:

ตรวจจับประโยค:

( pprint ( get-sentences " First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea... " ))
[ " First sentence. " , " Second sentence? " , " Here is another one. " ,
 " And so on and so forth - you get the idea... " ]

โทเค็น:

( pprint ( tokenize " Mr. Smith gave a car to his son on Friday " ))
[ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " ,
 " Friday " ]

DETOKEZING:

( detokenize [ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " , " Friday " ])
" Mr. Smith gave a car to his son on Friday. "

ตามหลักการแล้ว s == (detokenize (tokenize s)) ไฟล์โมเดล XML แบบ Detokenization เป็นงานที่กำลังดำเนินการโปรดแจ้งให้เราทราบหากคุณพบสิ่งที่ไม่ได้ถูกกระตุ้นอย่างถูกต้องในภาษาอังกฤษ

การติดแท็กส่วนหนึ่งของคำพูด:

( pprint ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " )))
([ " Mr. " " NNP " ]
 [ " Smith " " NNP " ]
 [ " gave " " VBD " ]
 [ " a " " DT " ]
 [ " car " " NN " ]
 [ " to " " TO " ]
 [ " his " " PRP$ " ]
 [ " son " " NN " ]
 [ " on " " IN " ]
 [ " Friday. " " NNP " ])

การค้นหาชื่อ:

( name-find ( tokenize " My name is Lee, not John. " ))
( "Lee" " John " )

Treebank-chunking แยกและแท็กวลีจากประโยคที่ติดแท็ก ความแตกต่างที่น่าสังเกตคือมันส่งคืนรายการโครงสร้างด้วย: วลีและ:: แท็กคีย์ดังที่เห็นด้านล่าง:

( pprint ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
 { :phrase [ " is " " meant " " to " " deactivate " ], :tag " VP " }
 { :phrase [ " the " " accelerator " ], :tag " NP " }
 { :phrase [ " when " ], :tag " ADVP " }
 { :phrase [ " the " " brake " " pedal " ], :tag " NP " }
 { :phrase [ " is " " pressed " ], :tag " VP " })

สำหรับวลี:

( phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
([ " The " " override " " system " ] [ " is " " meant " " to " " deactivate " ] [ " the " " accelerator " ] [ " when " ] [ " the " " brake " " pedal " ] [ " is " " pressed " ])

และมีเพียงสตริง:

( phrase-strings ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
( "The override system " " is meant to deactivate " " the accelerator " " when " " the brake pedal " " is pressed " )

การจัดหมวดหมู่เอกสาร:

ดู opennlp.test.tools.train สำหรับตัวอย่างการใช้งานที่ดีขึ้น

( def doccat ( make-document-categorizer " my-doccat-model " ))

( doccat " This is some good text " )
" Happy "

ความน่าจะเป็นความมั่นใจ

ความน่าจะเป็น OpenNLP Supplies สำหรับการดำเนินการที่กำหนดนั้นมีให้เป็นข้อมูลเมตาของผลลัพธ์หากมี:

( meta ( get-sentences " This is a sentence. This is also one. " ))
{ :probabilities ( 0.9999054310803004 0.9941126097177366 )}

( meta ( tokenize " This is a sentence. " ))
{ :probabilities ( 1.0 1.0 1.0 0.9956236737394807 1.0 )}

( meta ( pos-tag [ " This " " is " " a " " sentence " " . " ]))
{ :probabilities ( 0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769 )}

( meta ( chunker ( pos-tag [ " This " " is " " a " " sentence " " . " ])))
{ :probabilities ( 0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069 )}

( meta ( name-find [ " My " " name " " is " " John " ]))
{ :probabilities ( 0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192 )}

ขนาดลำแสง

คุณสามารถ rebind opennlp.nlp/*beam-size* (ค่าเริ่มต้นคือ 3) สำหรับ pos-tagger และ treebank-parser ด้วย:

( binding [*beam-size* 1 ]
  ( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " )))

เปอร์เซ็นต์ล่วงหน้า

คุณสามารถ rebind opennlp.treebank/*advance-percentage* (ค่าเริ่มต้นคือ 0.95) สำหรับ treebank-parser ด้วย:

( binding [*advance-percentage* 0.80 ]
  ( def parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " )))

การปั่น

หมายเหตุ: การแยกวิเคราะห์ Treebank เป็นหน่วยความจำที่เข้มข้นมากตรวจสอบให้แน่ใจว่า JVM ของคุณมีหน่วยความจำเพียงพอ (โดยใช้สิ่งต่าง ๆ เช่น -xmx512m) หรือคุณจะหมดพื้นที่กองเมื่อใช้ตัวแยกวิเคราะห์ Treebank

การแยกวิเคราะห์ Treebank ได้รับส่วนของตัวเองเนื่องจากมันซับซ้อนแค่ไหน

หมายเหตุไม่มีรุ่น Treebank-Parser รวมอยู่ใน Git Repo คุณจะต้องดาวน์โหลดแยกต่างหากจากโครงการ OpenNLP

การสร้าง:

( def treebank-parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " ))

หากต้องการใช้ Treebank-Parser ให้ส่งประโยคที่มีโทเค็นคั่นด้วยช่องว่าง (ควรใช้โทเค็น)

( treebank-parser [ " This is a sentence . " ])
[ " (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .))) " ]

ในการแปลงสตริง Treebank-Parser เป็นสิ่งที่ง่ายขึ้นเล็กน้อยสำหรับ Clojure ที่จะดำเนินการให้ใช้ฟังก์ชัน (Make-Tree ... ):

( make-tree ( first ( treebank-parser [ " This is a sentence . " ])))
{ :chunk { :chunk ({ :chunk { :chunk " This " , :tag DT}, :tag NP} { :chunk ({ :chunk " is " , :tag VBZ} { :chunk ({ :chunk " a " , :tag DT} { :chunk " sentence " , :tag NN}), :tag NP}), :tag VP} { :chunk " . " , :tag .}), :tag S}, :tag TOP}

นี่คือโครงสร้างข้อมูลแบ่งออกเป็นรูปแบบที่อ่านได้อีกเล็กน้อย:

{ :tag TOP
 :chunk { :tag S
         :chunk ({ :tag NP
                  :chunk { :tag DT
                          :chunk " This " }}
                 { :tag VP
                  :chunk ({ :tag VBZ
                           :chunk " is " }
                          { :tag NP
                           :chunk ({ :tag DT
                                    :chunk " a " }
                                   { :tag NN
                                    :chunk " sentence " })})}
                 { :tag .
                  :chunk " . " })}}

หวังว่าจะทำให้ชัดเจนขึ้นเล็กน้อยแผนที่ซ้อนกัน หากใครมีข้อเสนอแนะใด ๆ สำหรับวิธีที่ดีกว่าในการเป็นตัวแทนข้อมูลนี้อย่าลังเลที่จะส่งอีเมลหรือแพตช์ให้ฉัน

การแยกวิเคราะห์ Treebank ถือเป็นเบต้า ณ จุดนี้

ตัวกรอง

การกรองลำดับ pos-tagged

( use 'opennlp.tools.filters)

( pprint ( nouns ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " Mr. " " NNP " ]
 [ " Smith " " NNP " ]
 [ " car " " NN " ]
 [ " son " " NN " ]
 [ " Friday " " NNP " ])

( pprint ( verbs ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " gave " " VBD " ])

การกรอง Chunks Treebank

( use 'opennlp.tools.filters)

( pprint ( noun-phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed " )))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
 { :phrase [ " the " " accelerator " ], :tag " NP " }
 { :phrase [ " the " " brake " " pedal " ], :tag " NP " })

การสร้างตัวกรองของคุณเอง:

( pos-filter determiners #"^DT" )
#'user/determiners
( doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
  Given a list of pos-tagged elements, return only the determiners in a list.

( pprint ( determiners ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " a " " DT " ])

คุณยังสามารถสร้างตัวกรอง treebank-chunk โดยใช้ (chunk-filter ... )

( chunk-filter fragments #"^FRAG$" )

( doc fragments)
-------------------------
opennlp.nlp/fragments
([elements__178__auto__])
  Given a list of treebank-chunked elements, return only the fragments in a list.

ขี้เกียจ

มีวิธีการบางอย่างที่จะช่วยให้คุณขี้เกียจเมื่อติดแท็กวิธีการขึ้นอยู่กับการดำเนินการที่ต้องการใช้วิธีการที่เกี่ยวข้อง:

 #'opennlp.tools.lazy/lazy-get-sentences
#'opennlp.tools.lazy/lazy-tokenize
#'opennlp.tools.lazy/lazy-tag
#'opennlp.tools.lazy/lazy-chunk
#'opennlp.tools.lazy/sentence-seq

นี่คือวิธีการใช้งาน:

( use 'opennlp.nlp)
( use 'opennlp.treebank)
( use 'opennlp.tools.lazy)

( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))

( lazy-get-sentences [ " This body of text has three sentences. This is the first. This is the third. " " This body has only two. Here's the last one. " ] get-sentences)
; ; will lazily return:
([ " This body of text has three sentences. " " This is the first. " " This is the third. " ] [ " This body has only two. " " Here's the last one. " ])

( lazy-tokenize [ " This is a sentence. " " This is another sentence. " " This is the third. " ] tokenize)
; ; will lazily return:
([ " This " " is " " a " " sentence " " . " ] [ " This " " is " " another " " sentence " " . " ] [ " This " " is " " the " " third " " . " ])

( lazy-tag [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag)
; ; will lazily return:
(([ " This " " DT " ] [ " is " " VBZ " ] [ " a " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]) ([ " This " " DT " ] [ " is " " VBZ " ] [ " another " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]))

( lazy-chunk [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag chunker)
; ; will lazily return:
(({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " a " " sentence " ], :tag " NP " }) ({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " another " " sentence " ], :tag " NP " }))

อย่าลังเลที่จะใช้ฟังก์ชั่นขี้เกียจ แต่ฉันยังไม่ได้ตั้งค่าไว้ในรูปแบบ 100% ดังนั้นพวกเขาอาจเปลี่ยนแปลงได้ในอนาคต (อาจจะผูกมัดพวกเขาดังนั้นแทนที่จะเป็นลำดับของประโยคมันดูเหมือน (ขี้เกียจ chunk (ขี้เกียจแท็ก (ขี้เกียจ tokenize (ขี้เกียจ-ไปด้วย ... )))))

การสร้างลำดับที่ขี้เกียจจากไฟล์โดยใช้ opennlp.tools.lazy/sentence-seq:

( with-open [rdr ( clojure.java.io/reader " /tmp/bigfile " )]
  ( let [sentences ( sentence-seq rdr get-sentences)]
    ; ; process your lazy seq of sentences however you desire
    ( println " first 5 sentences: " )
    ( clojure.pprint/pprint ( take 5 sentences))))

การฝึกอบรม

มีรหัสเพื่อให้โมเดลการฝึกอบรมสำหรับเครื่องมือแต่ละตัว โปรดดูเอกสารในการฝึกอบรมมาร์คดาวน์

ใบอนุญาต

แจกจ่ายภายใต้ใบอนุญาตสาธารณะ Eclipse เช่นเดียวกับการใช้งานของ Clojure ดูการคัดลอกไฟล์

ผู้มีส่วนร่วม

Rob Zinkov - Zaxtax
Alexandre Patry - Apatry

สิ่งที่ต้องทำ

~~เพิ่มวิธีการสร้างลำดับที่ขี้เกียจของประโยคจากไฟล์~~ (เสร็จแล้ว!)
~~ผู้ขับขี่~~ (ยังคงทำงานได้มากขึ้น แต่ตอนนี้ก็ใช้งานได้)
ทำอะไรบางอย่างกับการแยกวิเคราะห์สำหรับการแยกต้นไม้ Treebank
~~แยกสิ่งของต้นไม้ไว้ในเนมสเปซของตัวเอง~~ (เสร็จแล้ว!)
~~Treebank Chunker~~ (เสร็จแล้ว!)
~~ตัวแยกวิเคราะห์ต้นไม้~~ (เสร็จแล้ว!)
~~ความเกียจคร้าน~~ (เสร็จแล้ว! สำหรับตอนนี้)
Treebank Linker (WIP)
~~ผู้ช่วยวลีสำหรับ chunker~~ (เสร็จแล้ว!)
~~คิดออกว่าจะใช้ใบอนุญาตใด~~ (เสร็จแล้ว!)
ตัวกรองสำหรับ Treebank-Parser
ส่งคืนผลลัพธ์ความน่าจะเป็นหลายอย่างสำหรับ Treebank-Parser
~~สำรวจรวมถึงหมายเลขความน่าจะเป็น~~ (เพิ่มจำนวนความน่าจะเป็นเป็นข้อมูลเมตา)
~~การฝึกอบรมแบบจำลอง/เทรนเนอร์~~ (เสร็จแล้ว!)
แก้ไขรูปแบบโครงสร้างข้อมูลสำหรับประโยคที่ติดแท็ก
ฟังก์ชันการทำงาน ขนาดลำแสง
ฟังก์ชันการทำงานของเอกสาร ล่วงหน้า
สร้างชุดทดสอบเต็มรูปแบบ: - ~~เครื่องมือหลัก~~ (เสร็จแล้ว) -- ~~ตัวกรอง~~ (เสร็จแล้ว) -- ~~ความเกียจคร้าน~~ (เสร็จแล้ว) - การฝึกอบรม (ทำได้ดีมากยกเว้นการติดแท็ก)