XLNet for Chinese, TensorFlow & PyTorch
XLNet is a new pre-training model proposed by CMU and Google Brain in June 2019. Outperforms Bert in multiple tasks. It is in the form of retaining the autoregressive language model (Autoregressive Language Modeling).
Combining the advantages of Autoencoding Language Modeling, the Permutation Language Modeling is proposed. And it is based on Transformer-XL,
Has better ability to handle long text.
This project refers to the work of [2] and combines massive data to train a 24-layer Chinese xlnet_zh _Large model with more than 300 million parameters.
Training data, including news, interactive discussions, encyclopedias, more than 30G original text, nearly 10 billion Chinese characters; This project uses the same training data as the RoBERTa_zh project for pre-training the Chinese RoBERTa model.
Obtained after 2 days of training using Google TPU v3-256; including 32 v3-8 machines, each v3-8 machine contains 128G of video memory; trained for 200,000 steps, using sequence length (sequence_length) 512, batch (batch_size) is 512.
xlnet_zh _Large has not been fully tested. It may perform extremely well in your tasks, or it may perform poorly in some tasks. We expected there to be both good news and bad news; but currently in the sentence pair task (LCQMC task) it is bad news.
If you use the Chinese pre-training model of this project, please tell us your test comparison effect: you can directly make a pull request and add the test comparison in your task to README.md, or post it in an issue;
You can also join the Chinese pre-training model transformers discussion group (QQ: 836811304) and inform us of the test comparison.
xlnet_zh _Large, Baidu Netdisk, or Google drive, TensorFlow version
暂时没有去掉adam参数,去掉后模型会变成1.3G左右。
xlnet_zh _Large_L-24_H-1024_A-16.zip
|- xlnet_model.ckpt # 模型权重
|- xlnet_model.index # 模型meta信息
|- xlnet_model.meta # 模型index新
|- xlnet_config.json: # 配置文件
|- spiece.model: # 词汇表
PyTorch version can be converted using similar naming, specifically create the pytorch_transformers project:
python -u -m pytorch_transformers.convert_tf_checkpoint_to_pytorch --tf_checkpoint_path XLNet-zh-Large-PyTorch/ --bert_config_file XLNet-zh-Large-PyTorch/config.json --pytorch_dump_path XLNet-zh-Large-PyTorch/ xlnet_zh _large_pytorch_model.bin
1.input_list: [1, 2, 3, 4, 5, 6]
2.sampled_list: [2, 4, 6, 5, 3, 1]
3.array_2d:
[[0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0.]
[0. 1. 0. 1. 1. 1.]
[0. 1. 0. 0. 0. 0.]
[0. 1. 0. 1. 0. 1.]
[0. 1. 0. 1. 0. 0.]]
import numpy as np
import random
def xlnet_mask(input_list):
"""
输入一个列表(如:[x1,x2,x3,x4]),采样到一个新的组合(如:[x3,x2,x4,x1])返回一个矩阵
要实现的是让当前单词Xi只能看到这个新顺序中自己前面的单词
即:对于序列[x3,x2,x4,x1]
x2能看到x3;
x4能看到x3,x2
x1能看到x3,x2,x4
x3什么也看不到
看到在程序里,是1,看不到是0.
:param input_list:
:return: matrix
e.g
[[0,1,1,1], # x1
[0,0,1,0], # x2
[0,0,0,0], # x3
[0,1,1,0]] # x4
"""
print("1.input_list:",input_list)
random.shuffle(input_list) # 打乱循序
sampled_list=input_list
print("2.sampled_list:",sampled_list)
num_size=len(input_list)
array_2d=np.zeros((num_size,num_size))
for index,current_element in enumerate(sampled_list):
previous_element_list=sampled_list[0:index] # 被采样的组合中当前元素中自己前面的单词
for previous_element in previous_element_list:
array_2d[current_element-1][previous_element-1]=1
print("3.array_2d:n",array_2d)
return array_2d
input_list=[1,2,3,4,5,6]
array_2d=xlnet_mask(input_list)
Please report and add it.
There is no limit to data sets or tasks, including XNLI, LCQMC, reading comprehension data set CMRC, CCF-Sentiment-Analysis, etc.
1. Generate tfrecords:
SAVE_DIR=gs:// xlnet_zh /tf_records_xlnet
INPUT=gs://raw_text/data_2019_raw/*.txt
nohup python -u data_utils.py
--bsz_per_host=32
--num_core_per_host=8
--seq_len=512
--reuse_len=256
--input_glob=${INPUT}
--save_dir=${SAVE_DIR}
--num_passes=20
--bi_data=True
--sp_path=spiece.model
--mask_alpha=6
--mask_beta=1
--num_predict=85
--uncased=False
--num_task=200
--task=1 &
The first step assumes that you already have a vocabulary (the vocabulary in this project is located in src/spiece.model); if you need to create and generate your own vocabulary, see below. For more information, refer to: SentencePiece
Generate vocabulary: spm_train
--input=gs://raw_text/data_2019_raw/*.txt
--model_prefix=sp10m.cased.v3
--vocab_size=32000
--character_coverage=0.99995
--model_type=unigram
--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod>
--user_defined_symbols=<eop>,.,(,),",-,–,£,€
--shuffle_input_sentence
--input_sentence_size=200000000
2. Training model:
DATA=gs:// xlnet_zh /tf_records_xlnet/tfrecords/
MODEL_DIR=gs:// xlnet_zh / xlnet_zh _large
TPU_NAME=xlnet-zh-large-v3-256
TPU_ZONE=europe-west4-a
nohup python train.py
--record_info_dir=$DATA
--model_dir=$MODEL_DIR
--train_batch_size=512
--num_hosts=32
--num_core_per_host=8
--seq_len=512
--reuse_len=256
--mem_len=384
--perm_size=256
--n_layer=24
--d_model=1024
--d_embed=1024
--n_head=16
--d_head=64
--d_inner=4096
--untie_r=True
--mask_alpha=6
--mask_beta=1
--num_predict=85
--uncased=False
--train_steps=200000
--save_steps=3000
--warmup_steps=10000
--max_save=30
--weight_decay=0.01
--adam_epsilon=1e-6
--learning_rate=1e-5
--dropout=0.1
--dropatt=0.1
--tpu=$TPU_NAME
--tpu_zone=$TPU_ZONE
--use_tpu=True
--track_mean=True &
XLNET_DIR=gs:// xlnet_zh / xlnet_zh _large
MODEL_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01
DATA_DIR=gs:// xlnet_zh /fine_tuning_test/lcqmc_01/lcqmc_tfrecords
RAW_DIR=gs://roberta_zh/compare_model_performance/lcqmc
TPU_NAME=grpc://03.06.08.09:8470
TPU_ZONE=us-central1-a
nohup python -u run_classifier.py
--spiece_model_file=./spiece.model
--model_config_path=${XLNET_DIR}/config.json
--init_checkpoint=${XLNET_DIR}/model.ckpt-192000
--task_name=lcqmc
--do_train=True
--do_eval=True
--eval_all_ckpt=True
--uncased=False
--data_dir=${RAW_DIR}
--output_dir=${DATA_DIR}
--model_dir=${MODEL_DIR}
--train_batch_size=128
--eval_batch_size=8
--num_hosts=1
--num_core_per_host=8
--num_train_epochs=3
--max_seq_length=128
--learning_rate=2e-5
--save_steps=1000
--use_tpu=True
--tpu=${TPU_NAME}
--tpu_zone=${TPU_ZONE} >> xlnet_large_lcqmc_1.out &
注: TPU_NAME is dummy, you should change IP to real one
[1] XLNet: Generalized Autoregressive Pretraining for Language Understanding
[2] Chinese-PreTrained-XLNet
[3] XLNet: Operating mechanism and comparison of similarities and differences with Bert