LAC, the full name of Lexical Analysis of Chinese, is a joint lexical analysis tool developed by Baidu's Natural Language Processing Department to realize functions such as Chinese word segmentation, part-of-speech tagging, and proper name recognition. This tool has the following features and advantages:
Here we mainly introduce the installation and use of Python, and the use of other languages:
Code compatible with Python2/3
Fully automatic installation: pip install lac
Semi-automatic download: first download http://pypi.python.org/pypi/lac/, decompress and run python setup.py install
After the installation is complete, you can enter lac
or lac --segonly
or lac --rank
on the command line to start the service for a quick experience.
Domestic networks can use Baidu source installation, the installation speed is faster:
pip install lac -i https://mirror.baidu.com/pypi/simple
from LAC import LAC
# 装载分词模型
lac = LAC ( mode = 'seg' )
# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
seg_result = lac . run ( text )
# 批量样本输入, 输入为多个句子组成的list,平均速率会更快
texts = [ u"LAC是个优秀的分词工具" , u"百度是一家高科技公司" ]
seg_result = lac . run ( texts )
【单样本】:seg_result = [LAC, 是, 个, 优秀, 的, 分词, 工具]
【批量样本】:seg_result = [[LAC, 是, 个, 优秀, 的, 分词, 工具], [百度, 是, 一家, 高科技, 公司]]
from LAC import LAC
# 装载LAC模型
lac = LAC ( mode = 'lac' )
# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
lac_result = lac . run ( text )
# 批量样本输入, 输入为多个句子组成的list,平均速率更快
texts = [ u"LAC是个优秀的分词工具" , u"百度是一家高科技公司" ]
lac_result = lac . run ( texts )
The output of each sentence is its word segmentation result word_list and the tags_list of each word, whose format is (word_list, tags_list)
【单样本】: lac_result = ([百度, 是, 一家, 高科技, 公司], [ORG, v, m, n, n])
【批量样本】:lac_result = [
([百度, 是, 一家, 高科技, 公司], [ORG, v, m, n, n]),
([LAC, 是, 个, 优秀, 的, 分词, 工具], [nz, v, q, a, u, n, n])
]
The set of part-of-speech and proper name category tags is as follows, in which we mark the four most commonly used proper name categories in uppercase:
Label | meaning | Label | meaning | Label | meaning | Label | meaning |
---|---|---|---|---|---|---|---|
n | common noun | f | location noun | s | place noun | nw | Title of work |
nz | Other proper names | v | common verb | vd | verb adverb | vn | noun verb |
a | adjective | ad | adverb | an | noun | d | adverb |
m | Quantifier | q | quantifier | r | pronoun | p | preposition |
c | conjunction | u | particle | xc | Other function words | w | Punctuation |
PER | name | LOC | Place name | ORG | Organization name | TIME | time |
from LAC import LAC
# 装载词语重要性模型
lac = LAC ( mode = 'rank' )
# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
rank_result = lac . run ( text )
# 批量样本输入, 输入为多个句子组成的list,平均速率会更快
texts = [ u"LAC是个优秀的分词工具" , u"百度是一家高科技公司" ]
rank_result = lac . run ( texts )
【单样本】:rank_result = [['LAC', '是', '个', '优秀', '的', '分词', '工具'],
[nz, v, q, a, u, n, n],[3, 0, 0, 2, 0, 3, 1]]
【批量样本】:rank_result = [
(['LAC', '是', '个', '优秀', '的', '分词', '工具'],
[nz, v, q, a, u, n, n], [3, 0, 0, 2, 0, 3, 1]),
(['百度', '是', '一家', '高科技', '公司'],
[ORG, v, m, n, n], [3, 0, 2, 3, 1])
]
The set of labels for each category of word importance is as follows. We use 4-Level gradient for classification:
Label | meaning | common in part of speech |
---|---|---|
0 | Redundant words expressed in query | p, w, xc ... |
1 | Weakly qualified words in query | r, c, u... |
2 | Strongly qualified words in query | n, s, v... |
3 | core words in query | nz, nw, LOC ... |
On the basis of model output, LAC also supports users to configure customized segmentation results and proper name type output. When the model predicts an item that matches the dictionary, it will replace the original result with a customized result. In order to achieve more precise matching, we support long fragments composed of multiple words as an item.
We implement this function by loading a dictionary file. Each line of the dictionary file represents a customized item, consisting of one word or multiple consecutive words. '/' is used after each word to indicate a label. If there is no '/' label The model's default label will be used. The more words per item, the more precise the intervention effect will be.
Dictionary file example
This is just an example to show the results under various requirements. The mode of configuring the dictionary with wildcards will be opened in the future, so stay tuned.
春天/SEASON
花/n 开/v
秋天的风
落 阳
from LAC import LAC
lac = LAC ()
# 装载干预词典, sep参数表示词典文件采用的分隔符,为None时默认使用空格或制表符't'
lac . load_customization ( 'custom.txt' , sep = None )
# 干预后结果
custom_result = lac . run ( u"春天的花开秋天的风以及冬天的落阳" )
春天/TIME 的/u 花开/v 秋天/TIME 的/u 风/n 以及/c 冬天/TIME 的/u 落阳/n
春天/SEASON 的/u 花/n 开/v 秋天的风/n 以及/c 冬天/TIME 的/u 落/n 阳/n
We also provide an incremental training interface. Users can use their own data for incremental training. First, the data needs to be converted into the format of the model input, and all data files are "UTF-8" encoded:
Data sample
Consistent with the format of most open source word segmentation datasets, spaces are used as word segmentation markers, as shown below:
LAC 是 个 优秀 的 分词 工具 。
百度 是 一家 高科技 公司 。
春天 的 花开 秋天 的 风 以及 冬天 的 落阳 。
from LAC import LAC
# 选择使用分词模型
lac = LAC ( mode = 'seg' )
# 训练和测试数据集,格式一致
train_file = "./data/seg_train.tsv"
test_file = "./data/seg_test.tsv"
lac . train ( model_save_dir = './my_seg_model/' , train_data = train_file , test_data = test_file )
# 使用自己训练好的模型
my_lac = LAC ( model_path = 'my_seg_model' )
sample data
Based on the word segmentation data, each word is marked with its part of speech or entity category in the form of "/type". It is worth noting that lexical analysis training currently only supports data whose label system is consistent with ours. We will also open support for the new tag system in the future, so stay tuned.
LAC/nz 是/v 个/q 优秀/a 的/u 分词/n 工具/n 。/w
百度/ORG 是/v 一家/m 高科技/n 公司/n 。/w
春天/TIME 的/u 花开/v 秋天/TIME 的/u 风/n 以及/c 冬天/TIME 的/u 落阳/n 。/w
from LAC import LAC
# 选择使用默认的词法分析模型
lac = LAC ()
# 训练和测试数据集,格式一致
train_file = "./data/lac_train.tsv"
test_file = "./data/lac_test.tsv"
lac . train ( model_save_dir = './my_lac_model/' , train_data = train_file , test_data = test_file )
# 使用自己训练好的模型
my_lac = LAC ( model_path = 'my_lac_model' )
.
├── python # Python调用的脚本
├── c++ # C++调用的代码
├── java # Java调用的代码
├── Android # Android调用的示例
├── README.md # 本文件
└── CMakeList.txt # 编译C++和Java调用的脚本
If you use LAC in your academic work, please add the following citations. We are very pleased that LAC can help you in your academic work.
@article{jiao2018LAC,
title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
journal={arXiv preprint arXiv:1807.01882},
year={2018},
url={https://arxiv.org/abs/1807.01882}
}
We welcome developers to contribute code to LAC. If you develop new features and find bugs...you are welcome to submit Pull requests and issues to Github.