insuranceqa corpus zh Download - insuranceqa corpus zh Source code download

insuranceqa corpus zh

AI Source Code

insuranceqa-corpus-zh

Download

Insurance industry corpus

The corpus contains questions and answers collected from the website Insurance Library.

To the best of our knowledge, at the time this dataset was released, in 2017, it was the first open QA corpus in the insurance field:

The content of this corpus is generated by real-world users, and high-quality answers are provided by professionals with deep domain knowledge. So this is a corpus of real value, not a toy.
In the above paper, the corpus is used for the response selection task. On the other hand, other uses of this corpus are also possible. For example, independent learning through reading and understanding answers, observational learning, etc. enables the system to finally come up with its own answers to unseen questions.
The data set is divided into two parts: "question and answer corpus" and "question and answer pair corpus". The question and answer corpus is translated from the original English data without other processing. The question-and-answer corpus is based on the question-and-answer corpus. It also performs word segmentation, de-labeling and stopping, and adds labels. Therefore, "question and answer corpus" can be directly connected to machine learning tasks. If you are not satisfied with the data format or the word segmentation effect, you can directly use other methods to process the "question and answer corpus" to obtain data that can be used to train the model.

Install and use

1/3 dependency

Python: 2.x, 3.x
Pip

2/3 Install script package

 pip install -U insuranceqa_data

3/3 Install corpus

Enter the certificate store, purchase the certificate, enter [Certificate - Details] after purchase, and click [Copy Certificate Identity].

Then, set the environment variable INSQA_DL_LICENSE , such as using the command line terminal:

 # Linux / macOS
export INSQA_DL_LICENSE=YOUR_LICENSE
# # e.g. if your license id is `FOOBAR`, run `export INSQA_DL_LICENSE=FOOBAR`

# Windows
# # 1/2 Command Prompt
set INSQA_DL_LICENSE=YOUR_LICENSE
# # 2/2 PowerShell
$env :INSQA_DL_LICENSE= ' YOUR_LICENSE '

Finally, execute the following command to complete the data download.

python -c " import insuranceqa_data; insuranceqa_data.download_corpus() "

Data format description

Data is divided into two types: POOL format; PAIR format. Among them, the PAIR format is more suitable for machine learning training models.

Load POOL data

 import insuranceqa_data as insuranceqa
train_data = insuranceqa . load_pool_train () # 训练集
test_data = insuranceqa . load_pool_test ()   # 测试集
valid_data = insuranceqa . load_pool_valid () # 验证集

# valid_data, test_data and train_data share the same properties
for x in train_data :                       # 打印数据
    print ( 'index %s value: %s ++$++ %s ++$++ %s' % 
     ( x , train_data [ x ][ 'zh' ], train_data [ x ][ 'en' ], train_data [ x ][ 'answers' ], train_data [ x ][ 'negatives' ]))

answers_data = insuranceqa . load_pool_answers ()
for x in answers_data :                     # 答案数据
    print ( 'index %s: %s ++$++ %s' % ( x , answers_data [ x ][ 'zh' ], answers_data [ x ][ 'en' ]))

data design

-	question	Answer	Vocabulary (English)
train	12,889	21,325	107,889
verify	2,000	3354	16,931
test	2,000	3308	16,815

Each piece of data includes Chinese and English of the question, positive examples of the answer, and negative examples of the answer. There must be at least one positive example of the case, basically items 1-5 , all of which are correct answers. There are 200 negative examples of answers. The negative examples are created based on the question using search methods, so they are related to the question, but they are not the correct answer.

 {
    "INDEX": {
        "zh": "中文",
        "en": "英文",
        "domain": "保险种类",
        "answers": [""] # 答案正例列表
        "negatives": [""] # 答案负例列表
    },
    more ...
}

Training: corpus/pool/train.json.gz
Validation: corpus/pool/valid.json.gz
Test: corpus/pool/test.json.gz
Answer: corpus/pool/answers.json has a total of 27,413 answers, and the data format is json :

 {
    "INDEX": {
        "zh": "中文",
        "en": "英文"
    },
    more ...
}

Chinese and English bilingual documents

Q&A

格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文

corpus/pool/train.txt.gz , corpus/pool/valid.txt.gz , corpus/pool/test.txt.gz .

Answer

格式 INDEX ++$++ 中文 ++$++ 英文

corpus/pool/answers.txt.gz

The corpus is compressed using gzip to reduce the size, and the data can be accessed using commands such as zmore, zless, zcat, and zgrep.

 zmore pool/test.txt.gz

Load PAIR data

Using "question and answer data", you still need to do a lot of work to enter the machine learning model, such as word segmentation, removing stop words, removing punctuation marks, and adding label tags. Therefore, we can continue processing on the basis of "question and answer data", but in tasks such as word segmentation, we can use different word segmentation tools, which has an impact on model training. In order to make the data available quickly, insuranceqa-corpus-zh provides a data set using HanLP word segmentation, de-labeling, de-stopping, and adding labels. This data set is completely based on "question and answer data".

Load data

 import insuranceqa_data as insuranceqa
train_data = insuranceqa . load_pairs_train ()
test_data = insuranceqa . load_pairs_test ()
valid_data = insuranceqa . load_pairs_valid ()

# valid_data, test_data and train_data share the same properties
for x in test_data :
    print ( 'index %s value: %s ++$++ %s ++$++ %s' % 
     ( x [ 'qid' ], x [ 'question' ], x [ 'utterance' ], x [ 'label' ]))

vocab_data = insuranceqa . load_pairs_vocab ()
vocab_data [ 'word2id' ][ 'UNKNOWN' ]
vocab_data [ 'id2word' ][ 0 ]
vocab_data [ 'tf' ]
vocab_data [ 'total' ]

data design

vocab_data contains word2id (dict, from word to id), id2word (dict, from id to word), tf (dict, word frequency statistics) and total (total number of words). Among them, the identifier of unregistered words is UNKNOWN , and the id of unregistered words is 0.

The data formats of train_data , test_data and valid_data are the same. qid is the question ID, question is the question, utterance is the reply, if label is [1,0] it means the reply is the correct answer, [0,1] means the reply is not the correct answer, so utterance contains the data of positive and negative examples. Each question contains 10 negative examples and 1 positive example.

train_data contains 12,889 questions, 141779 data, positive examples: negative examples = 1:10 test_data contains 2,000 questions, 22000 data, positive examples: negative examples = 1:10 valid_data contains 2,000 questions, 22000 data, positive examples : Negative example = 1:10

Sentence length:

 max len of valid question : 31, average: 5(max)
max len of valid utterance: 878(max), average: 165(max)
max len of test question : 33, average: 5
max len of test utterance: 878, average: 161
max len of train question : 42(max), average: 5
max len of train utterance: 878, average: 162
vocab size: 24997

Machine Learning Project

You can use this corpus with the following open source code

deep-qa-1: Baseline model

InsuranceQA TensorFlow: CNN with TensorFlow

n-grams-get-started: N-gram model

word2vec-get-started: word vector model

statement

Statement 1: insuranceqa-corpus-zh

This data set was generated using translation insuranceQA, and the code is released under the Chunsong Public License, version 1.0. Data is for research purposes only and must be cited and addressed when published in any media, journal, magazine or blog.

 InsuranceQA Corpus, Chatopera Inc., https://github.com/chatopera/insuranceqa-corpus-zh, 07 27, 2017

Any data derived from insuranceqa-corpus also needs to be open and must declare content consistent with "Statement 1" and "Statement 2".

Statement 2: insuranceQA

This dataset is provided for research purposes only. If you publish anything using these data, please cite our paper: Applying Deep Learning to Answer Selection: A Study and An Open Task. Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015

Expand

Additional Information