xlnet_zh下載 - xlnet_zh原始碼下載

xlnet_zh

Ai源碼

1.0.0

下載

XLNet for Chinese, TensorFlow & PyTorch

XLNet中文預訓練模型

XLNet是CMU和Google大腦在2019年6月份，提出的一個新的預訓練模型。在多個任務的效能超越Bert。它是在保留自迴歸語言模型(Autoregressive Language Modeling)的形式下，

結合了自編碼語言模型(Autoencoding Language Modeling)的優勢，提出了排列語言模型(Permutation Language Modeling)。並且它基於Transfomer-XL,

有更好的處理長文本的能力。

本計畫參考[2]的工作，結合大量數據，訓練了一個24層的中文xlnet_zh _Large模型，含3億多參數。

訓練資料與計算資源Training Corpus & Training Details

訓練數據，包括新聞、互動討論、百科，超過30G原始文本，近100億個中文字；本項目與中文預訓練RoBERTa模型的RoBERTa_zh項目，使用相同的訓練數據。

使用Google TPU v3-256 訓練2天得到；包含32個v3-8機器，每個v3-8機器包含128G的顯存；訓練了20萬步，使用序列長度(sequence_length)512，批次(batch_size)為512。

注意事項Notices

xlnet_zh _Large還沒有完整測試，可能在你的任務中有極好的表現，也可能在部分任務中有糟糕的表現。我們預期既會有好消息，也有壞消息；但目前在句子對任務中(LCQMC任務)是壞消息。

提供您的測試比較Performance

如果你使用本項目的中文預訓練模型，請告訴你的測試對比效果：你可以直接發生pull request將你的任務中的測試對比添加到README.md中，或發在issue中；

你也可以加入中文預訓練模型transformers討論群(QQ:836811304)，並把測試對比告知我們。

XLNet中文預訓練模型-下載Download Pre-trained XLNet, for Chinese tasks

xlnet_zh _Large，百度網盤，或Google drive，TensorFlow版本

暂时没有去掉adam参数，去掉后模型会变成1.3G左右。

xlnet_zh _Large_L-24_H-1024_A-16.zip 
  |- xlnet_model.ckpt    # 模型权重
  |- xlnet_model.index   # 模型meta信息
  |- xlnet_model.meta    # 模型index新
  |- xlnet_config.json： # 配置文件
  |- spiece.model:       # 词汇表

PyTorch版本，可使用類似的命名來轉換，具體建pytorch_transformers專案：

 python -u -m pytorch_transformers.convert_tf_checkpoint_to_pytorch --tf_checkpoint_path XLNet-zh-Large-PyTorch/ --bert_config_file XLNet-zh-Large-PyTorch/config.json --pytorch_dump_path XLNet-zh-Large-PyTorch/ xlnet_zh _large_pytorch_model.bin

如何保留從左到右的方式預測（就像傳統的語言模型一樣），但還能利用下文的資訊？

 1.input_list:   [1, 2, 3, 4, 5, 6]
2.sampled_list: [2, 4, 6, 5, 3, 1]
3.array_2d:
                [[0. 1. 1. 1. 1. 1.]
                 [0. 0. 0. 0. 0. 0.]
                 [0. 1. 0. 1. 1. 1.]
                 [0. 1. 0. 0. 0. 0.]
                 [0. 1. 0. 1. 0. 1.]
                 [0. 1. 0. 1. 0. 0.]]

import numpy as np
import random
def xlnet_mask(input_list):
    """
    输入一个列表（如：[x1,x2,x3,x4]），采样到一个新的组合（如：[x3,x2,x4,x1]）返回一个矩阵
    要实现的是让当前单词Xi只能看到这个新顺序中自己前面的单词
    即：对于序列[x3,x2,x4,x1]
        x2能看到x3;
        x4能看到x3,x2
        x1能看到x3,x2,x4
        x3什么也看不到
    看到在程序里，是1，看不到是0.
    :param input_list:
    :return: matrix
    e.g
    [[0,1,1,1],  # x1
     [0,0,1,0],  # x2
     [0,0,0,0],  # x3
     [0,1,1,0]]  # x4

    """
    print("1.input_list:",input_list)
    random.shuffle(input_list) # 打乱循序
    sampled_list=input_list
    print("2.sampled_list:",sampled_list)
    num_size=len(input_list)
    
    array_2d=np.zeros((num_size,num_size))
    for index,current_element in enumerate(sampled_list):
        previous_element_list=sampled_list[0:index] # 被采样的组合中当前元素中自己前面的单词
        for previous_element in previous_element_list:
            array_2d[current_element-1][previous_element-1]=1
    
    print("3.array_2d:n",array_2d)
    return array_2d

input_list=[1,2,3,4,5,6]
array_2d=xlnet_mask(input_list)

效果測試與比較Performance

請您報告並新增。

資料集或任務不限，包括XNLI、LCQMC、閱讀理解資料集CMRC、CCF-Sentiment-Analysis等等。

模型載入（以Sentence Pair Matching即句子對任務，LCQMC為例）

預訓練

1、生成tfrecords:

xlnet_zh/tf_records_xlnet INPUT=gs://raw_text/data_2019_raw/*.txt nohup python -u data_utils.py --bsz_per_host=32 --num_core_per_host=8 --seq_len=512 --reuse_len=256 --input_glob=${INPUT} --save_dir=${SAVE_DIR} --num_passes=20 --bi_data=True --sp_path=spiece.model --mask_alpha=6 --mask_beta=1 --num_predict=85 --uncased=False --num_task=200 --task=1 &">

 SAVE_DIR=gs:// xlnet_zh /tf_records_xlnet
INPUT=gs://raw_text/data_2019_raw/*.txt 
nohup python -u data_utils.py 
    --bsz_per_host=32 
    --num_core_per_host=8 
    --seq_len=512 
    --reuse_len=256 
    --input_glob=${INPUT} 
    --save_dir=${SAVE_DIR} 
    --num_passes=20 
    --bi_data=True 
    --sp_path=spiece.model 
    --mask_alpha=6 
    --mask_beta=1 
    --num_predict=85 
    --uncased=False 
    --num_task=200 
    --task=1 &

第一步假設你已經有了詞彙表（本專案中的詞彙表位於src/spiece.model）；如果你需要建立生成自己的詞彙表見下方，更多資訊參考：SentencePiece

產生詞彙表： spm_train
--input=gs://raw_text/data_2019_raw/*.txt
--model_prefix=sp10m.cased.v3
--vocab_size=32000
--character_coverage=0.99995
--model_type=unigram
--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod>
--user_defined_symbols=<eop>,.,(,),",-,–,£,€
--shuffle_input_sentence
--input_sentence_size=200000000

2、訓練模型:

xlnet_zh/tf_records_xlnet/tfrecords/ MODEL_DIR=gs:// xlnet_zh / xlnet_zh _large TPU_NAME=xlnet-zh-large-v3-256 TPU_ZONE=europe-west4-a nohup python train.py --record_info_dir=$DATA --model_dir=$MODEL_DIR --train_batch_size=512 --num_hosts=32 --num_core_per_host=8 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --uncased=False --train_steps=200000 --save_steps=3000 --warmup_steps=10000 --max_save=30 --weight_decay=0.01 --adam_epsilon=1e-6 --learning_rate=1e-5 --dropout=0.1 --dropatt=0.1 --tpu=$TPU_NAME --tpu_zone=$TPU_ZONE --use_tpu=True --track_mean=True &">

 DATA=gs:// xlnet_zh /tf_records_xlnet/tfrecords/
MODEL_DIR=gs:// xlnet_zh / xlnet_zh _large
TPU_NAME=xlnet-zh-large-v3-256 
TPU_ZONE=europe-west4-a
nohup python train.py 
    --record_info_dir=$DATA 
    --model_dir=$MODEL_DIR 
    --train_batch_size=512 
    --num_hosts=32 
    --num_core_per_host=8 
    --seq_len=512 
    --reuse_len=256 
    --mem_len=384 
    --perm_size=256 
    --n_layer=24 
    --d_model=1024 
    --d_embed=1024 
    --n_head=16 
    --d_head=64 
    --d_inner=4096 
    --untie_r=True 
    --mask_alpha=6 
    --mask_beta=1 
    --num_predict=85 
    --uncased=False 
    --train_steps=200000 
    --save_steps=3000 
    --warmup_steps=10000 
    --max_save=30 
    --weight_decay=0.01 
    --adam_epsilon=1e-6 
    --learning_rate=1e-5 
    --dropout=0.1 
    --dropatt=0.1 
    --tpu=$TPU_NAME 
    --tpu_zone=$TPU_ZONE 
    --use_tpu=True 
    --track_mean=True &

fine-tuning(以LCQMC任務為例)

xlnet_zh _large MODEL_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01 DATA_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01/lcqmc_tfrecords RAW_DIR=gs://roberta_zh/compare_model_performance/lcqmc TPU_NAME=grpc://03.06.08.09:8470 TPU_ZONE=us-central1-a nohup python -u run_classifier.py --spiece_model_file=./spiece.model --model_config_path=${XLNET_DIR}/config.json --init_checkpoint=${XLNET_DIR}/model.ckpt-192000 --task_name=lcqmc --do_train=True --do_eval=True --eval_all_ckpt=True --uncased=False --data_dir=${RAW_DIR} --output_dir=${DATA_DIR} --model_dir=${MODEL_DIR} --train_batch_size=128 --eval_batch_size=8 --num_hosts=1 --num_core_per_host=8 --num_train_epochs=3 --max_seq_length=128 --learning_rate=2e-5 --save_steps=1000 --use_tpu=True --tpu=${TPU_NAME} --tpu_zone=${TPU_ZONE} >> xlnet_large_lcqmc_1.out & 注: TPU_NAME is dummy, you should change IP to real one">

 XLNET_DIR=gs:// xlnet_zh / xlnet_zh _large
MODEL_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01
DATA_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01/lcqmc_tfrecords
RAW_DIR=gs://roberta_zh/compare_model_performance/lcqmc
TPU_NAME=grpc://03.06.08.09:8470
TPU_ZONE=us-central1-a
nohup python -u run_classifier.py 
    --spiece_model_file=./spiece.model 
    --model_config_path=${XLNET_DIR}/config.json 
    --init_checkpoint=${XLNET_DIR}/model.ckpt-192000 
    --task_name=lcqmc 
    --do_train=True 
    --do_eval=True 
    --eval_all_ckpt=True 
    --uncased=False 
    --data_dir=${RAW_DIR} 
    --output_dir=${DATA_DIR} 
    --model_dir=${MODEL_DIR} 
    --train_batch_size=128 
    --eval_batch_size=8 
    --num_hosts=1 
    --num_core_per_host=8 
    --num_train_epochs=3 
    --max_seq_length=128 
    --learning_rate=2e-5 
    --save_steps=1000 
    --use_tpu=True 
    --tpu=${TPU_NAME} 
    --tpu_zone=${TPU_ZONE} >> xlnet_large_lcqmc_1.out &

注: TPU_NAME is dummy, you should change IP to real one