The corpus contains questions and answers collected from the website Insurance Library.
To the best of our knowledge, at the time this dataset was released, in 2017, it was the first open QA corpus in the insurance field:
The content of this corpus is generated by real-world users, and high-quality answers are provided by professionals with deep domain knowledge. So this is a corpus of real value, not a toy.
In the above paper, the corpus is used for the response selection task. On the other hand, other uses of this corpus are also possible. For example, independent learning through reading and understanding answers, observational learning, etc. enables the system to finally come up with its own answers to unseen questions.
The data set is divided into two parts: "question and answer corpus" and "question and answer pair corpus". The question and answer corpus is translated from the original English data without other processing. The question-and-answer corpus is based on the question-and-answer corpus. It also performs word segmentation, de-labeling and stopping, and adds labels. Therefore, "question and answer corpus" can be directly connected to machine learning tasks. If you are not satisfied with the data format or the word segmentation effect, you can directly use other methods to process the "question and answer corpus" to obtain data that can be used to train the model.
pip install -U insuranceqa_data
Enter the certificate store, purchase the certificate, enter [Certificate - Details] after purchase, and click [Copy Certificate Identity].
Then, set the environment variable INSQA_DL_LICENSE
, such as using the command line terminal:
# Linux / macOS
export INSQA_DL_LICENSE=YOUR_LICENSE
# # e.g. if your license id is `FOOBAR`, run `export INSQA_DL_LICENSE=FOOBAR`
# Windows
# # 1/2 Command Prompt
set INSQA_DL_LICENSE=YOUR_LICENSE
# # 2/2 PowerShell
$env :INSQA_DL_LICENSE= ' YOUR_LICENSE '
Finally, execute the following command to complete the data download.
python -c " import insuranceqa_data; insuranceqa_data.download_corpus() "
Data is divided into two types: POOL format; PAIR format. Among them, the PAIR format is more suitable for machine learning training models.
import insuranceqa_data as insuranceqa
train_data = insuranceqa . load_pool_train () # 训练集
test_data = insuranceqa . load_pool_test () # 测试集
valid_data = insuranceqa . load_pool_valid () # 验证集
# valid_data, test_data and train_data share the same properties
for x in train_data : # 打印数据
print ( 'index %s value: %s ++$++ %s ++$++ %s' %
( x , train_data [ x ][ 'zh' ], train_data [ x ][ 'en' ], train_data [ x ][ 'answers' ], train_data [ x ][ 'negatives' ]))
answers_data = insuranceqa . load_pool_answers ()
for x in answers_data : # 答案数据
print ( 'index %s: %s ++$++ %s' % ( x , answers_data [ x ][ 'zh' ], answers_data [ x ][ 'en' ]))
- | question | Answer | Vocabulary (English) |
---|---|---|---|
train | 12,889 | 21,325 | 107,889 |
verify | 2,000 | 3354 | 16,931 |
test | 2,000 | 3308 | 16,815 |
Each piece of data includes Chinese and English of the question, positive examples of the answer, and negative examples of the answer. There must be at least one positive example of the case, basically items 1-5 , all of which are correct answers. There are 200 negative examples of answers. The negative examples are created based on the question using search methods, so they are related to the question, but they are not the correct answer.
{
"INDEX": {
"zh": "中文",
"en": "英文",
"domain": "保险种类",
"answers": [""] # 答案正例列表
"negatives": [""] # 答案负例列表
},
more ...
}
Training: corpus/pool/train.json.gz
Validation: corpus/pool/valid.json.gz
Test: corpus/pool/test.json.gz
Answer: corpus/pool/answers.json
has a total of 27,413 answers, and the data format is json
:
{
"INDEX": {
"zh": "中文",
"en": "英文"
},
more ...
}
格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文
corpus/pool/train.txt.gz
, corpus/pool/valid.txt.gz
, corpus/pool/test.txt.gz
.
格式 INDEX ++$++ 中文 ++$++ 英文
corpus/pool/answers.txt.gz
The corpus is compressed using gzip to reduce the size, and the data can be accessed using commands such as zmore, zless, zcat, and zgrep.
zmore pool/test.txt.gz
Using "question and answer data", you still need to do a lot of work to enter the machine learning model, such as word segmentation, removing stop words, removing punctuation marks, and adding label tags. Therefore, we can continue processing on the basis of "question and answer data", but in tasks such as word segmentation, we can use different word segmentation tools, which has an impact on model training. In order to make the data available quickly, insuranceqa-corpus-zh provides a data set using HanLP word segmentation, de-labeling, de-stopping, and adding labels. This data set is completely based on "question and answer data".
import insuranceqa_data as insuranceqa
train_data = insuranceqa . load_pairs_train ()
test_data = insuranceqa . load_pairs_test ()
valid_data = insuranceqa . load_pairs_valid ()
# valid_data, test_data and train_data share the same properties
for x in test_data :
print ( 'index %s value: %s ++$++ %s ++$++ %s' %
( x [ 'qid' ], x [ 'question' ], x [ 'utterance' ], x [ 'label' ]))
vocab_data = insuranceqa . load_pairs_vocab ()
vocab_data [ 'word2id' ][ 'UNKNOWN' ]
vocab_data [ 'id2word' ][ 0 ]
vocab_data [ 'tf' ]
vocab_data [ 'total' ]
vocab_data
contains word2id
(dict, from word to id), id2word
(dict, from id to word), tf
(dict, word frequency statistics) and total
(total number of words). Among them, the identifier of unregistered words is UNKNOWN
, and the id of unregistered words is 0.
The data formats of train_data
, test_data
and valid_data
are the same. qid
is the question ID, question
is the question, utterance
is the reply, if label
is [1,0]
it means the reply is the correct answer, [0,1]
means the reply is not the correct answer, so utterance
contains the data of positive and negative examples. Each question contains 10 negative examples and 1 positive example.
train_data
contains 12,889 questions, 141779
data, positive examples: negative examples = 1:10 test_data
contains 2,000 questions, 22000
data, positive examples: negative examples = 1:10 valid_data
contains 2,000 questions, 22000
data, positive examples : Negative example = 1:10
Sentence length:
max len of valid question : 31, average: 5(max)
max len of valid utterance: 878(max), average: 165(max)
max len of test question : 33, average: 5
max len of test utterance: 878, average: 161
max len of train question : 42(max), average: 5
max len of train utterance: 878, average: 162
vocab size: 24997
You can use this corpus with the following open source code
deep-qa-1: Baseline model
InsuranceQA TensorFlow: CNN with TensorFlow
n-grams-get-started: N-gram model
word2vec-get-started: word vector model
Statement 1: insuranceqa-corpus-zh
This data set was generated using translation insuranceQA, and the code is released under the Chunsong Public License, version 1.0. Data is for research purposes only and must be cited and addressed when published in any media, journal, magazine or blog.
InsuranceQA Corpus, Chatopera Inc., https://github.com/chatopera/insuranceqa-corpus-zh, 07 27, 2017
Any data derived from insuranceqa-corpus also needs to be open and must declare content consistent with "Statement 1" and "Statement 2".
Statement 2: insuranceQA
This dataset is provided for research purposes only. If you publish anything using these data, please cite our paper: Applying Deep Learning to Answer Selection: A Study and An Open Task. Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015