xlnet_zh Download - xlnet_zh Source code download

xlnet_zh

AI Source Code

1.0.0

Download

XLNet for Chinese, TensorFlow & PyTorch

XLNet Chinese pre-training model

XLNet is a new pre-training model proposed by CMU and Google Brain in June 2019. Outperforms Bert in multiple tasks. It is in the form of retaining the autoregressive language model (Autoregressive Language Modeling).

Combining the advantages of Autoencoding Language Modeling, the Permutation Language Modeling is proposed. And it is based on Transformer-XL,

Has better ability to handle long text.

This project refers to the work of [2] and combines massive data to train a 24-layer Chinese xlnet_zh _Large model with more than 300 million parameters.

Training data and computing resourcesTraining Corpus & Training Details

Training data, including news, interactive discussions, encyclopedias, more than 30G original text, nearly 10 billion Chinese characters; This project uses the same training data as the RoBERTa_zh project for pre-training the Chinese RoBERTa model.

Obtained after 2 days of training using Google TPU v3-256; including 32 v3-8 machines, each v3-8 machine contains 128G of video memory; trained for 200,000 steps, using sequence length (sequence_length) 512, batch (batch_size) is 512.

Notices

xlnet_zh _Large has not been fully tested. It may perform extremely well in your tasks, or it may perform poorly in some tasks. We expected there to be both good news and bad news; but currently in the sentence pair task (LCQMC task) it is bad news.

Provide your test comparison Performance

If you use the Chinese pre-training model of this project, please tell us your test comparison effect: you can directly make a pull request and add the test comparison in your task to README.md, or post it in an issue;

You can also join the Chinese pre-training model transformers discussion group (QQ: 836811304) and inform us of the test comparison.

XLNet Chinese pre-trained model-Download Download Pre-trained XLNet, for Chinese tasks

xlnet_zh _Large, Baidu Netdisk, or Google drive, TensorFlow version

暂时没有去掉adam参数，去掉后模型会变成1.3G左右。

xlnet_zh _Large_L-24_H-1024_A-16.zip 
  |- xlnet_model.ckpt    # 模型权重
  |- xlnet_model.index   # 模型meta信息
  |- xlnet_model.meta    # 模型index新
  |- xlnet_config.json： # 配置文件
  |- spiece.model:       # 词汇表

PyTorch version can be converted using similar naming, specifically create the pytorch_transformers project:

 python -u -m pytorch_transformers.convert_tf_checkpoint_to_pytorch --tf_checkpoint_path XLNet-zh-Large-PyTorch/ --bert_config_file XLNet-zh-Large-PyTorch/config.json --pytorch_dump_path XLNet-zh-Large-PyTorch/ xlnet_zh _large_pytorch_model.bin

How do you preserve left-to-right predictions (like traditional language models) but also exploit information from below?

 1.input_list:   [1, 2, 3, 4, 5, 6]
2.sampled_list: [2, 4, 6, 5, 3, 1]
3.array_2d:
                [[0. 1. 1. 1. 1. 1.]
                 [0. 0. 0. 0. 0. 0.]
                 [0. 1. 0. 1. 1. 1.]
                 [0. 1. 0. 0. 0. 0.]
                 [0. 1. 0. 1. 0. 1.]
                 [0. 1. 0. 1. 0. 0.]]

import numpy as np
import random
def xlnet_mask(input_list):
    """
    输入一个列表（如：[x1,x2,x3,x4]），采样到一个新的组合（如：[x3,x2,x4,x1]）返回一个矩阵
    要实现的是让当前单词Xi只能看到这个新顺序中自己前面的单词
    即：对于序列[x3,x2,x4,x1]
        x2能看到x3;
        x4能看到x3,x2
        x1能看到x3,x2,x4
        x3什么也看不到
    看到在程序里，是1，看不到是0.
    :param input_list:
    :return: matrix
    e.g
    [[0,1,1,1],  # x1
     [0,0,1,0],  # x2
     [0,0,0,0],  # x3
     [0,1,1,0]]  # x4

    """
    print("1.input_list:",input_list)
    random.shuffle(input_list) # 打乱循序
    sampled_list=input_list
    print("2.sampled_list:",sampled_list)
    num_size=len(input_list)
    
    array_2d=np.zeros((num_size,num_size))
    for index,current_element in enumerate(sampled_list):
        previous_element_list=sampled_list[0:index] # 被采样的组合中当前元素中自己前面的单词
        for previous_element in previous_element_list:
            array_2d[current_element-1][previous_element-1]=1
    
    print("3.array_2d:n",array_2d)
    return array_2d

input_list=[1,2,3,4,5,6]
array_2d=xlnet_mask(input_list)

Performance testing and comparisonPerformance

Please report and add it.

There is no limit to data sets or tasks, including XNLI, LCQMC, reading comprehension data set CMRC, CCF-Sentiment-Analysis, etc.

Model loading (taking Sentence Pair Matching (sentence pair task, LCQMC) as an example)

pre-training

1. Generate tfrecords:

xlnet_zh/tf_records_xlnet INPUT=gs://raw_text/data_2019_raw/*.txt nohup python -u data_utils.py --bsz_per_host=32 --num_core_per_host=8 --seq_len=512 --reuse_len=256 --input_glob=${INPUT} --save_dir=${SAVE_DIR} --num_passes=20 --bi_data=True --sp_path=spiece.model --mask_alpha=6 --mask_beta=1 --num_predict=85 --uncased=False --num_task=200 --task=1 &">

 SAVE_DIR=gs:// xlnet_zh /tf_records_xlnet
INPUT=gs://raw_text/data_2019_raw/*.txt 
nohup python -u data_utils.py 
    --bsz_per_host=32 
    --num_core_per_host=8 
    --seq_len=512 
    --reuse_len=256 
    --input_glob=${INPUT} 
    --save_dir=${SAVE_DIR} 
    --num_passes=20 
    --bi_data=True 
    --sp_path=spiece.model 
    --mask_alpha=6 
    --mask_beta=1 
    --num_predict=85 
    --uncased=False 
    --num_task=200 
    --task=1 &

The first step assumes that you already have a vocabulary (the vocabulary in this project is located in src/spiece.model); if you need to create and generate your own vocabulary, see below. For more information, refer to: SentencePiece

Generate vocabulary: spm_train
--input=gs://raw_text/data_2019_raw/*.txt
--model_prefix=sp10m.cased.v3
--vocab_size=32000
--character_coverage=0.99995
--model_type=unigram
--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod>
--user_defined_symbols=<eop>,.,(,),",-,–,£,€
--shuffle_input_sentence
--input_sentence_size=200000000

2. Training model:

xlnet_zh/tf_records_xlnet/tfrecords/ MODEL_DIR=gs:// xlnet_zh / xlnet_zh _large TPU_NAME=xlnet-zh-large-v3-256 TPU_ZONE=europe-west4-a nohup python train.py --record_info_dir=$DATA --model_dir=$MODEL_DIR --train_batch_size=512 --num_hosts=32 --num_core_per_host=8 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --uncased=False --train_steps=200000 --save_steps=3000 --warmup_steps=10000 --max_save=30 --weight_decay=0.01 --adam_epsilon=1e-6 --learning_rate=1e-5 --dropout=0.1 --dropatt=0.1 --tpu=$TPU_NAME --tpu_zone=$TPU_ZONE --use_tpu=True --track_mean=True &">

 DATA=gs:// xlnet_zh /tf_records_xlnet/tfrecords/
MODEL_DIR=gs:// xlnet_zh / xlnet_zh _large
TPU_NAME=xlnet-zh-large-v3-256 
TPU_ZONE=europe-west4-a
nohup python train.py 
    --record_info_dir=$DATA 
    --model_dir=$MODEL_DIR 
    --train_batch_size=512 
    --num_hosts=32 
    --num_core_per_host=8 
    --seq_len=512 
    --reuse_len=256 
    --mem_len=384 
    --perm_size=256 
    --n_layer=24 
    --d_model=1024 
    --d_embed=1024 
    --n_head=16 
    --d_head=64 
    --d_inner=4096 
    --untie_r=True 
    --mask_alpha=6 
    --mask_beta=1 
    --num_predict=85 
    --uncased=False 
    --train_steps=200000 
    --save_steps=3000 
    --warmup_steps=10000 
    --max_save=30 
    --weight_decay=0.01 
    --adam_epsilon=1e-6 
    --learning_rate=1e-5 
    --dropout=0.1 
    --dropatt=0.1 
    --tpu=$TPU_NAME 
    --tpu_zone=$TPU_ZONE 
    --use_tpu=True 
    --track_mean=True &

fine-tuning (taking the LCQMC task as an example)

xlnet_zh _large MODEL_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01 DATA_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01/lcqmc_tfrecords RAW_DIR=gs://roberta_zh/compare_model_performance/lcqmc TPU_NAME=grpc://03.06.08.09:8470 TPU_ZONE=us-central1-a nohup python -u run_classifier.py --spiece_model_file=./spiece.model --model_config_path=${XLNET_DIR}/config.json --init_checkpoint=${XLNET_DIR}/model.ckpt-192000 --task_name=lcqmc --do_train=True --do_eval=True --eval_all_ckpt=True --uncased=False --data_dir=${RAW_DIR} --output_dir=${DATA_DIR} --model_dir=${MODEL_DIR} --train_batch_size=128 --eval_batch_size=8 --num_hosts=1 --num_core_per_host=8 --num_train_epochs=3 --max_seq_length=128 --learning_rate=2e-5 --save_steps=1000 --use_tpu=True --tpu=${TPU_NAME} --tpu_zone=${TPU_ZONE} >> xlnet_large_lcqmc_1.out & 注: TPU_NAME is dummy, you should change IP to real one">

 XLNET_DIR=gs:// xlnet_zh / xlnet_zh _large
MODEL_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01
DATA_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01/lcqmc_tfrecords
RAW_DIR=gs://roberta_zh/compare_model_performance/lcqmc
TPU_NAME=grpc://03.06.08.09:8470
TPU_ZONE=us-central1-a
nohup python -u run_classifier.py 
    --spiece_model_file=./spiece.model 
    --model_config_path=${XLNET_DIR}/config.json 
    --init_checkpoint=${XLNET_DIR}/model.ckpt-192000 
    --task_name=lcqmc 
    --do_train=True 
    --do_eval=True 
    --eval_all_ckpt=True 
    --uncased=False 
    --data_dir=${RAW_DIR} 
    --output_dir=${DATA_DIR} 
    --model_dir=${MODEL_DIR} 
    --train_batch_size=128 
    --eval_batch_size=8 
    --num_hosts=1 
    --num_core_per_host=8 
    --num_train_epochs=3 
    --max_seq_length=128 
    --learning_rate=2e-5 
    --save_steps=1000 
    --use_tpu=True 
    --tpu=${TPU_NAME} 
    --tpu_zone=${TPU_ZONE} >> xlnet_large_lcqmc_1.out &

注: TPU_NAME is dummy, you should change IP to real one