Una biblioteca para interactuar con la biblioteca de funciones OpenNLP (Open Natural Language Processing). Aún no se implementan todas las funciones.
Información/documentación adicional:
Lea la fuente de Marginalia
[clojure-opennlp " 0.5.0 " ] ; ; uses Opennlp 1.9.0
Clojure-Opennlp funciona con Clojure 1.5+
( use 'clojure.pprint) ; just for this documentation
( use 'opennlp.nlp)
( use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here
Deberá realizar las funciones de procesamiento utilizando los archivos del modelo. Estos suponen que se está ejecutando desde el directorio del proyecto root. También puede descargar los archivos de modelo desde el proyecto OpenNLP en http://opennlp.sourceforge.net/models-1.5
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def detokenize ( make-detokenizer " models/english-detokenizer.xml " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def name-find ( make-name-finder " models/namefind/en-ner-person.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
Los creadores de herramientas son multimetodos, por lo que también puede crear cualquiera de las herramientas utilizando un modelo en lugar de un nombre de archivo (puede crear un modelo con las herramientas de capacitación en SRC/Opennlp/Tools/Train.clj):
( def tokenize ( make-tokenizer my-tokenizer-model)) ; ; etc, etc
Luego, use las funciones que ha creado para realizar operaciones en texto:
Detección de oraciones:
( pprint ( get-sentences " First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea... " ))
[ " First sentence. " , " Second sentence? " , " Here is another one. " ,
" And so on and so forth - you get the idea... " ]
Tokenizing:
( pprint ( tokenize " Mr. Smith gave a car to his son on Friday " ))
[ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " ,
" Friday " ]
Retrokening:
( detokenize [ " Mr. " , " Smith " , " gave " , " a " , " car " , " to " , " his " , " son " , " on " , " Friday " ])
" Mr. Smith gave a car to his son on Friday. "
Idealmente, S == (retrokenize (tokenize s)), el archivo XML modelo de retrokenización es un trabajo en progreso, hágamelo saber si se encuentra con algo que no se retiró correctamente en inglés.
Etiquetado de parte de voz:
( pprint ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " )))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " gave " " VBD " ]
[ " a " " DT " ]
[ " car " " NN " ]
[ " to " " TO " ]
[ " his " " PRP$ " ]
[ " son " " NN " ]
[ " on " " IN " ]
[ " Friday. " " NNP " ])
Encontrado de nombre:
( name-find ( tokenize " My name is Lee, not John. " ))
( "Lee" " John " )
Trebank-Chunking divide y etiqueta frases de una oración etiquetada con POS. Una diferencia notable es que devuelve una lista de estructuras con la frase y: las teclas de etiqueta, como se ve a continuación:
( pprint ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " is " " meant " " to " " deactivate " ], :tag " VP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " when " ], :tag " ADVP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " }
{ :phrase [ " is " " pressed " ], :tag " VP " })
Solo para las frases:
( phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
([ " The " " override " " system " ] [ " is " " meant " " to " " deactivate " ] [ " the " " accelerator " ] [ " when " ] [ " the " " brake " " pedal " ] [ " is " " pressed " ])
Y con solo cuerdas:
( phrase-strings ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed. " ))))
( "The override system " " is meant to deactivate " " the accelerator " " when " " the brake pedal " " is pressed " )
Categorización de documentos:
Consulte opennlp.test.tools.Train para mejores ejemplos de uso.
( def doccat ( make-document-categorizer " my-doccat-model " ))
( doccat " This is some good text " )
" Happy "
Los suministros de Probabilidades OpenNLP para una operación determinada están disponibles como metadatos en el resultado, cuando corresponda:
( meta ( get-sentences " This is a sentence. This is also one. " ))
{ :probabilities ( 0.9999054310803004 0.9941126097177366 )}
( meta ( tokenize " This is a sentence. " ))
{ :probabilities ( 1.0 1.0 1.0 0.9956236737394807 1.0 )}
( meta ( pos-tag [ " This " " is " " a " " sentence " " . " ]))
{ :probabilities ( 0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769 )}
( meta ( chunker ( pos-tag [ " This " " is " " a " " sentence " " . " ])))
{ :probabilities ( 0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069 )}
( meta ( name-find [ " My " " name " " is " " John " ]))
{ :probabilities ( 0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192 )}
Puede volver a revertir opennlp.nlp/*beam-size*
(el valor predeterminado es 3) para el POS-Tagger y TreeBank-Parser con:
( binding [*beam-size* 1 ]
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " )))
Puede volver a reembolsar opennlp.treebank/*advance-percentage*
(el valor predeterminado es 0.95) para TreeBank-Parser con:
( binding [*advance-percentage* 0.80 ]
( def parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " )))
Nota: El análisis de Treebank es muy intensivo en la memoria, asegúrese de que su JVM tenga una cantidad suficiente de memoria disponible (usando algo como -xmx512m) o se quedará sin espacio en el montón cuando use un analizador TreeBank.
El análisis de Treebank obtiene su propia sección debido a lo complejo que es.
Nota Ninguno del modelo Treebank-Parser está incluido en el Repo GIT, tendrá que descargarlo por separado del proyecto OpenNLP.
Creándolo:
( def treebank-parser ( make-treebank-parser " parser-model/en-parser-chunking.bin " ))
Para usar el Treebank-Parser, pase una variedad de oraciones con sus fichas separadas por Whitespace (preferiblemente usando Tokenize)
( treebank-parser [ " This is a sentence . " ])
[ " (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .))) " ]
Para transformar la cadena TreeBank-Parser en algo un poco más fácil para realizar Clojure, use la función (Make-Tree ...):
( make-tree ( first ( treebank-parser [ " This is a sentence . " ])))
{ :chunk { :chunk ({ :chunk { :chunk " This " , :tag DT}, :tag NP} { :chunk ({ :chunk " is " , :tag VBZ} { :chunk ({ :chunk " a " , :tag DT} { :chunk " sentence " , :tag NN}), :tag NP}), :tag VP} { :chunk " . " , :tag .}), :tag S}, :tag TOP}
Aquí está la dataTructura dividida en un formato un poco más legible:
{ :tag TOP
:chunk { :tag S
:chunk ({ :tag NP
:chunk { :tag DT
:chunk " This " }}
{ :tag VP
:chunk ({ :tag VBZ
:chunk " is " }
{ :tag NP
:chunk ({ :tag DT
:chunk " a " }
{ :tag NN
:chunk " sentence " })})}
{ :tag .
:chunk " . " })}}
Esperemos que eso lo deje un poco más claro, un mapa anidado. Si alguien más tiene alguna sugerencia para mejores formas de representar esta información, no dude en enviarme un correo electrónico o un parche.
El análisis de Bank se considera beta en este momento.
( use 'opennlp.tools.filters)
( pprint ( nouns ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " Mr. " " NNP " ]
[ " Smith " " NNP " ]
[ " car " " NN " ]
[ " son " " NN " ]
[ " Friday " " NNP " ])
( pprint ( verbs ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " gave " " VBD " ])
( use 'opennlp.tools.filters)
( pprint ( noun-phrases ( chunker ( pos-tag ( tokenize " The override system is meant to deactivate the accelerator when the brake pedal is pressed " )))))
({ :phrase [ " The " " override " " system " ], :tag " NP " }
{ :phrase [ " the " " accelerator " ], :tag " NP " }
{ :phrase [ " the " " brake " " pedal " ], :tag " NP " })
( pos-filter determiners #"^DT" )
#'user/determiners
( doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
Given a list of pos-tagged elements, return only the determiners in a list.
( pprint ( determiners ( pos-tag ( tokenize " Mr. Smith gave a car to his son on Friday. " ))))
([ " a " " DT " ])
También puede crear filtros TreeBank-Chunk usando (Chunk-Filter ...)
( chunk-filter fragments #"^FRAG$" )
( doc fragments)
-------------------------
opennlp.nlp/fragments
([elements__178__auto__])
Given a list of treebank-chunked elements, return only the fragments in a list.
Hay algunos métodos para ayudarlo a ser perezoso al etiquetar los métodos, dependiendo de la operación deseada, use el método correspondiente:
#'opennlp.tools.lazy/lazy-get-sentences
#'opennlp.tools.lazy/lazy-tokenize
#'opennlp.tools.lazy/lazy-tag
#'opennlp.tools.lazy/lazy-chunk
#'opennlp.tools.lazy/sentence-seq
Aquí le mostramos cómo usarlos:
( use 'opennlp.nlp)
( use 'opennlp.treebank)
( use 'opennlp.tools.lazy)
( def get-sentences ( make-sentence-detector " models/en-sent.bin " ))
( def tokenize ( make-tokenizer " models/en-token.bin " ))
( def pos-tag ( make-pos-tagger " models/en-pos-maxent.bin " ))
( def chunker ( make-treebank-chunker " models/en-chunker.bin " ))
( lazy-get-sentences [ " This body of text has three sentences. This is the first. This is the third. " " This body has only two. Here's the last one. " ] get-sentences)
; ; will lazily return:
([ " This body of text has three sentences. " " This is the first. " " This is the third. " ] [ " This body has only two. " " Here's the last one. " ])
( lazy-tokenize [ " This is a sentence. " " This is another sentence. " " This is the third. " ] tokenize)
; ; will lazily return:
([ " This " " is " " a " " sentence " " . " ] [ " This " " is " " another " " sentence " " . " ] [ " This " " is " " the " " third " " . " ])
( lazy-tag [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag)
; ; will lazily return:
(([ " This " " DT " ] [ " is " " VBZ " ] [ " a " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]) ([ " This " " DT " ] [ " is " " VBZ " ] [ " another " " DT " ] [ " sentence " " NN " ] [ " . " " . " ]))
( lazy-chunk [ " This is a sentence. " " This is another sentence. " ] tokenize pos-tag chunker)
; ; will lazily return:
(({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " a " " sentence " ], :tag " NP " }) ({ :phrase [ " This " ], :tag " NP " } { :phrase [ " is " ], :tag " VP " } { :phrase [ " another " " sentence " ], :tag " NP " }))
Siéntase libre de usar las funciones perezosas, pero todavía no estoy 100% establecido en el diseño, por lo que pueden cambiar en el futuro. (Tal vez encadenándolos, en lugar de una secuencia de oraciones, se parece (perezoso-chunk (etiqueta perezosa (-tokenize (perezoso-get-sentencias ...)))).
Generando una secuencia perezosa de oraciones desde un archivo usando Opennlp.Tools.Lazy/Sentence-seq:
( with-open [rdr ( clojure.java.io/reader " /tmp/bigfile " )]
( let [sentences ( sentence-seq rdr get-sentences)]
; ; process your lazy seq of sentences however you desire
( println " first 5 sentences: " )
( clojure.pprint/pprint ( take 5 sentences))))
Hay código para permitir modelos de entrenamiento para cada una de las herramientas. Consulte la documentación en Training.Markdown
Copyright (c) 2010 Matthew Lee Hinman
Distribuido bajo la licencia pública de Eclipse, lo mismo que Clojure usa. Vea la copia del archivo.