lac Download - lac Source code download

lac

Python

v2.1 models

Download

Tool introduction

LAC, the full name of Lexical Analysis of Chinese, is a joint lexical analysis tool developed by Baidu's Natural Language Processing Department to realize functions such as Chinese word segmentation, part-of-speech tagging, and proper name recognition. This tool has the following features and advantages:

Good results : The deep learning model jointly learns word segmentation, part-of-speech tagging, proper name recognition tasks, word importance, the overall effect F1 value exceeds 0.91, the F1 value of part-of-speech tagging exceeds 0.94, and the F1 value of proper name recognition exceeds 0.85, the effect is industry-leading.
High efficiency : Streamlined model parameters, combined with the performance optimization of the Paddle prediction library, the CPU single-thread performance reaches 800QPS, leading the industry in efficiency.
Customizable : Implement a simple and controllable intervention mechanism, and accurately match the user dictionary to intervene in the model. The dictionary supports long segment formats, making intervention more precise.
Convenient calling : supports one-click installation , and provides Python, Java and C++ calling interfaces and calling examples to achieve quick calling and integration.
Support for mobile terminals : Customized ultra-lightweight model with a volume of only 2M. The single-thread performance of mainstream thousand-yuan mobile phones reaches 200QPS, meeting the needs of most mobile terminal applications. The effect is industry-leading for the same volume.

Installation and use

Here we mainly introduce the installation and use of Python, and the use of other languages:

C++
JAVA
Android

Installation instructions

Code compatible with Python2/3

Fully automatic installation: pip install lac
Semi-automatic download: first download http://pypi.python.org/pypi/lac/, decompress and run python setup.py install
After the installation is complete, you can enter lac or lac --segonly or lac --rank on the command line to start the service for a quick experience.
Domestic networks can use Baidu source installation, the installation speed is faster: pip install lac -i https://mirror.baidu.com/pypi/simple

Function and use

participle

Code example:

 from LAC import LAC

# 装载分词模型
lac = LAC ( mode = 'seg' )

# 单个样本输入，输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
seg_result = lac . run ( text )

# 批量样本输入, 输入为多个句子组成的list，平均速率会更快
texts = [ u"LAC是个优秀的分词工具" , u"百度是一家高科技公司" ]
seg_result = lac . run ( texts )

Output:

 【单样本】：seg_result = [LAC, 是, 个, 优秀, 的, 分词, 工具]
【批量样本】：seg_result = [[LAC, 是, 个, 优秀, 的, 分词, 工具], [百度, 是, 一家, 高科技, 公司]]

Part-of-speech tagging and entity recognition

Code example:

 from LAC import LAC

# 装载LAC模型
lac = LAC ( mode = 'lac' )

# 单个样本输入，输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
lac_result = lac . run ( text )

# 批量样本输入, 输入为多个句子组成的list，平均速率更快
texts = [ u"LAC是个优秀的分词工具" , u"百度是一家高科技公司" ]
lac_result = lac . run ( texts )

Output:

The output of each sentence is its word segmentation result word_list and the tags_list of each word, whose format is (word_list, tags_list)

 【单样本】： lac_result = ([百度, 是, 一家, 高科技, 公司], [ORG, v, m, n, n])
【批量样本】：lac_result = [
                    ([百度, 是, 一家, 高科技, 公司], [ORG, v, m, n, n]),
                    ([LAC, 是, 个, 优秀, 的, 分词, 工具], [nz, v, q, a, u, n, n])
                ]

The set of part-of-speech and proper name category tags is as follows, in which we mark the four most commonly used proper name categories in uppercase:

Label	meaning	Label	meaning	Label	meaning	Label	meaning
n	common noun	f	location noun	s	place noun	nw	Title of work
nz	Other proper names	v	common verb	vd	verb adverb	vn	noun verb
a	adjective	ad	adverb	an	noun	d	adverb
m	Quantifier	q	quantifier	r	pronoun	p	preposition
c	conjunction	u	particle	xc	Other function words	w	Punctuation
PER	name	LOC	Place name	ORG	Organization name	TIME	time

word importance

Code example:

 from LAC import LAC

# 装载词语重要性模型
lac = LAC ( mode = 'rank' )

# 单个样本输入，输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
rank_result = lac . run ( text )

# 批量样本输入, 输入为多个句子组成的list，平均速率会更快
texts = [ u"LAC是个优秀的分词工具" , u"百度是一家高科技公司" ]
rank_result = lac . run ( texts )

Output:

 【单样本】：rank_result = [['LAC', '是', '个', '优秀', '的', '分词', '工具'], 
                        [nz, v, q, a, u, n, n],[3, 0, 0, 2, 0, 3, 1]]
【批量样本】：rank_result = [
                    (['LAC', '是', '个', '优秀', '的', '分词', '工具'], 
                     [nz, v, q, a, u, n, n], [3, 0, 0, 2, 0, 3, 1]),  
                    (['百度', '是', '一家', '高科技', '公司'], 
                     [ORG, v, m, n, n], [3, 0, 2, 3, 1])
                ]

The set of labels for each category of word importance is as follows. We use 4-Level gradient for classification:

Label	meaning	common in part of speech
0	Redundant words expressed in query	p, w, xc ...
1	Weakly qualified words in query	r, c, u...
2	Strongly qualified words in query	n, s, v...
3	core words in query	nz, nw, LOC ...

Customized features

On the basis of model output, LAC also supports users to configure customized segmentation results and proper name type output. When the model predicts an item that matches the dictionary, it will replace the original result with a customized result. In order to achieve more precise matching, we support long fragments composed of multiple words as an item.

We implement this function by loading a dictionary file. Each line of the dictionary file represents a customized item, consisting of one word or multiple consecutive words. '/' is used after each word to indicate a label. If there is no '/' label The model's default label will be used. The more words per item, the more precise the intervention effect will be.

Dictionary file example
This is just an example to show the results under various requirements. The mode of configuring the dictionary with wildcards will be opened in the future, so stay tuned.

春天/SEASON
花/n 开/v
秋天的风
落 阳

code example

 from LAC import LAC
lac = LAC ()

# 装载干预词典, sep参数表示词典文件采用的分隔符，为None时默认使用空格或制表符't'
lac . load_customization ( 'custom.txt' , sep = None )

# 干预后结果
custom_result = lac . run ( u"春天的花开秋天的风以及冬天的落阳" )

Taking the input "flowers blooming in spring, wind in autumn and setting sun in winter" as an example, the original output result is:

春天/TIME 的/u 花开/v 秋天/TIME 的/u 风/n 以及/c 冬天/TIME 的/u 落阳/n

The result after adding the dictionary file in the example is:

春天/SEASON 的/u 花/n 开/v 秋天的风/n 以及/c 冬天/TIME 的/u 落/n 阳/n

incremental training

We also provide an incremental training interface. Users can use their own data for incremental training. First, the data needs to be converted into the format of the model input, and all data files are "UTF-8" encoded:

1. Word segmentation training

Data sample
Consistent with the format of most open source word segmentation datasets, spaces are used as word segmentation markers, as shown below:

 LAC 是 个 优秀 的 分词 工具 。
百度 是 一家 高科技 公司 。
春天 的 花开 秋天 的 风 以及 冬天 的 落阳 。

code example

 from LAC import LAC

# 选择使用分词模型
lac = LAC ( mode = 'seg' )

# 训练和测试数据集，格式一致
train_file = "./data/seg_train.tsv"
test_file = "./data/seg_test.tsv"
lac . train ( model_save_dir = './my_seg_model/' , train_data = train_file , test_data = test_file )

# 使用自己训练好的模型
my_lac = LAC ( model_path = 'my_seg_model' )

2. Lexical analysis training

sample data
Based on the word segmentation data, each word is marked with its part of speech or entity category in the form of "/type". It is worth noting that lexical analysis training currently only supports data whose label system is consistent with ours. We will also open support for the new tag system in the future, so stay tuned.

 LAC/nz 是/v 个/q 优秀/a 的/u 分词/n 工具/n 。/w
百度/ORG 是/v 一家/m 高科技/n 公司/n 。/w
春天/TIME 的/u 花开/v 秋天/TIME 的/u 风/n 以及/c 冬天/TIME 的/u 落阳/n 。/w

code example

 from LAC import LAC

# 选择使用默认的词法分析模型
lac = LAC ()

# 训练和测试数据集，格式一致
train_file = "./data/lac_train.tsv"
test_file = "./data/lac_test.tsv"
lac . train ( model_save_dir = './my_lac_model/' , train_data = train_file , test_data = test_file )

# 使用自己训练好的模型
my_lac = LAC ( model_path = 'my_lac_model' )

File structure

 .
├── python                      # Python调用的脚本
├── c++                         # C++调用的代码
├── java                        # Java调用的代码
├── Android                     # Android调用的示例
├── README.md                   # 本文件
└── CMakeList.txt               # 编译C++和Java调用的脚本

Citing LAC in a paper

If you use LAC in your academic work, please add the following citations. We are very pleased that LAC can help you in your academic work.

 @article{jiao2018LAC,
	title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
	author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
	journal={arXiv preprint arXiv:1807.01882},
	year={2018},
	url={https://arxiv.org/abs/1807.01882}
}

Contribute code

We welcome developers to contribute code to LAC. If you develop new features and find bugs...you are welcome to submit Pull requests and issues to Github.

Expand

Additional Information

Version v2.1 models
Type Python
Update Time 2024-12-12
size 4.81MB
From Github

Related Applications

Google Blog Converters (blog data converter)

2009-05-24
Nuitka

2024-12-14
smartchart data visualization platform v6.9

2024-11-27
azure storage python

2024-12-15
datamule python

2024-11-08
Redash open source data chart tool v24.10.0

2024-11-27

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Blog Converters (blog data converter)

Python

1.0 R54
Nuitka

Python

1.0.0
smartchart data visualization platform v6.9

Python

6.9
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All