insuranceqa corpus zh下載 - insuranceqa corpus zh原始碼下載

insuranceqa corpus zh

Ai源碼

insuranceqa-corpus-zh

下載

保險業語料庫

該語料庫包含從網站Insurance Library 收集的問題和答案。

據我們所知，本資料集發布之時，2017 年，這是保險領域首個開放的QA語料庫：

該語料庫的內容由現實世界的用戶提出，高品質的答案由具有深度領域知識的專業人士提供。所以這是一個具有真正價值的語料，而不是玩具。
在上述論文中，語料庫用於答覆選擇任務。另一方面，這種語料庫的其他用法也是可能的。例如，透過閱讀理解答案，觀察學習等自主學習，使系統最終能拿出自己的看不見的問題的答案。
資料集分為兩個部分「問答語料」和「問答對語料」。問答語料是從原始英文資料翻譯過來，未經其他處理的。問答對語料是基於問答語料，又做了分詞和去標去停，加入label。所以，"問答對語料"可以直接對接機器學習任務。如果對於資料格式不滿意或對分詞效果不滿意，可以直接對"問答語料"使用其他方法進行處理，獲得可以用於訓練模型的資料。

安裝使用

1/3 依賴

Python: 2.x, 3.x
Pip

2/3 安裝腳本包

 pip install -U insuranceqa_data

3/3 安裝語料包

進入證書商店，購買證書，購買後進入【證書-詳情】，點選【複製證書識別】。

然後，設定環境變數INSQA_DL_LICENSE ，例如使用命令列終端：

 # Linux / macOS
export INSQA_DL_LICENSE=YOUR_LICENSE
# # e.g. if your license id is `FOOBAR`, run `export INSQA_DL_LICENSE=FOOBAR`

# Windows
# # 1/2 Command Prompt
set INSQA_DL_LICENSE=YOUR_LICENSE
# # 2/2 PowerShell
$env :INSQA_DL_LICENSE= ' YOUR_LICENSE '

最後，執行以下命令，完成資料的下載。

python -c " import insuranceqa_data; insuranceqa_data.download_corpus() "

資料格式說明

資料分為兩種：POOL 格式；PAIR 格式。其中，PAIR 格式較適合用於機器學習訓練模型。

載入POOL 數據

 import insuranceqa_data as insuranceqa
train_data = insuranceqa . load_pool_train () # 训练集
test_data = insuranceqa . load_pool_test ()   # 测试集
valid_data = insuranceqa . load_pool_valid () # 验证集

# valid_data, test_data and train_data share the same properties
for x in train_data :                       # 打印数据
    print ( 'index %s value: %s ++$++ %s ++$++ %s' % 
     ( x , train_data [ x ][ 'zh' ], train_data [ x ][ 'en' ], train_data [ x ][ 'answers' ], train_data [ x ][ 'negatives' ]))

answers_data = insuranceqa . load_pool_answers ()
for x in answers_data :                     # 答案数据
    print ( 'index %s: %s ++$++ %s' % ( x , answers_data [ x ][ 'zh' ], answers_data [ x ][ 'en' ]))

數據設計

-	問題	答案	詞彙（英語）
訓練	12,889	21,325	107,889
驗證	2,000	3354	16,931
測試	2,000	3308	16,815

每個數據包括問題的中文，英文，答案的正例，答案的負例。案的正例至少1項，基本上在1-5條，都是正確答案。答案的負例有200條，負例根據問題使用檢索的方式建立，所以和問題是相關的，但卻不是正確答案。

 {
    "INDEX": {
        "zh": "中文",
        "en": "英文",
        "domain": "保险种类",
        "answers": [""] # 答案正例列表
        "negatives": [""] # 答案负例列表
    },
    more ...
}

訓練： corpus/pool/train.json.gz
驗證： corpus/pool/valid.json.gz
測試： corpus/pool/test.json.gz
答案： corpus/pool/answers.json一共有27,413 個回答，資料格式為json :

 {
    "INDEX": {
        "zh": "中文",
        "en": "英文"
    },
    more ...
}

中英文對照文件

問答對

格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文

corpus/pool/train.txt.gz , corpus/pool/valid.txt.gz , corpus/pool/test.txt.gz .

答案

格式 INDEX ++$++ 中文 ++$++ 英文

corpus/pool/answers.txt.gz

語料庫使用gzip進行壓縮以減少體積，可以使用zmore, zless, zcat, zgrep等指令存取資料。

 zmore pool/test.txt.gz

載入PAIR 數據

使用"問答資料"，還需要做很多工作才能進入機器學習的模型，例如分詞，去停用詞，去標點符號，添加label標記。所以，在"問答資料"的基礎上，還可以繼續處理，但是在分詞等任務中，可以藉助不同分詞工具，這點對於模型訓練而言是有影響的。為了讓數據能快速可用，insuranceqa-corpus-zh提供了一個使用HanLP分詞和去標，去停，添加label的數據集，這個數據集完全是基於"問答數據"。

載入數據

 import insuranceqa_data as insuranceqa
train_data = insuranceqa . load_pairs_train ()
test_data = insuranceqa . load_pairs_test ()
valid_data = insuranceqa . load_pairs_valid ()

# valid_data, test_data and train_data share the same properties
for x in test_data :
    print ( 'index %s value: %s ++$++ %s ++$++ %s' % 
     ( x [ 'qid' ], x [ 'question' ], x [ 'utterance' ], x [ 'label' ]))

vocab_data = insuranceqa . load_pairs_vocab ()
vocab_data [ 'word2id' ][ 'UNKNOWN' ]
vocab_data [ 'id2word' ][ 0 ]
vocab_data [ 'tf' ]
vocab_data [ 'total' ]

數據設計

vocab_data包含word2id (dict, 從word到id), id2word (dict, 從id到word), tf (dict, 詞頻統計)和total (單字總數)。其中，未登錄詞的標識為UNKNOWN ，未登錄詞的id為0。

train_data , test_data和valid_data的資料格式一樣。 qid是問題 Id， question是問題， utterance是回复， label如果是[1,0]代表回復是正確答案， [0,1]代表回復不是正確答案，所以utterance包含了正例和負例的數據。每個問題含有10個負例和1個正例。

train_data含有問題12,889條，資料141779條，正例：負例= 1:10 test_data含有問題2,000條，資料22000條，正例：負例= 1:10 valid_data含有問題2,000條，資料22000條，正例：負例= 1:10

句子長度:

 max len of valid question : 31, average: 5(max)
max len of valid utterance: 878(max), average: 165(max)
max len of test question : 33, average: 5
max len of test utterance: 878, average: 161
max len of train question : 42(max), average: 5
max len of train utterance: 878, average: 162
vocab size: 24997

機器學習項目

可將本語料庫及以下開源碼搭配使用

deep-qa-1: Baseline model

InsuranceQA TensorFlow: CNN with TensorFlow

n-grams-get-started: N元模型

word2vec-get-started: 詞向量模型

聲明

聲明1 : insuranceqa-corpus-zh

本資料集使用翻譯insuranceQA而生成，程式碼發布憑證Chunsong Public License, version 1.0。資料僅限於研究用途，如果在發布的任何媒體、期刊、雜誌或部落格等內容時，必須註明引用和地址。

 InsuranceQA Corpus, Chatopera Inc., https://github.com/chatopera/insuranceqa-corpus-zh, 07 27, 2017

任何基於insuranceqa-corpus衍生的資料也需要開放並需要聲明和「聲明1」和「聲明2」一致的內容。

聲明2 : insuranceQA

此數據集僅作為研究目的提供。如果您使用這些資料發表任何內容，請引用我們的論文：Applying Deep Learning to Answer Selection: A Study and An Open Task。 Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015

展開

附加信息

版本 insuranceqa-corpus-zh
類型 Ai源碼
更新時間 2024-12-20
大小 10.61KB
來自於 Github

相關應用

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
志匯-餐飲外帶小程式9.2開源志匯餐飲志匯外帶志匯點餐超級餐飲外帶小程式zh_dianc

2023-01-11

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
node telegram bot api

Ai源碼

v0.50.0
typebot.io

Ai源碼

v3.1.2
python wechaty getting started

Ai源碼

1.0.0
waymo open dataset

其他源碼

December 2023 Update
termwind

其他類別

v2.3.0
wp functions

其他類別

1.0.0

相關資訊全部