keras bert
1.0.0
[中文|English]
BERT 的實施。可以載入官方預訓練模型進行特徵提取和預測。
pip install keras-bert
在特徵提取演示中,您應該能夠獲得與官方模型chinese_L-12_H-768_A-12
相同的提取結果。在預測演示中,可以預測句子中缺少的單字。
提取演示展示如何轉換為在 TPU 上運行的模型。
分類演示展示如何將模型應用於簡單的分類任務。
Tokenizer
類別用於分割文字和生成索引:
from keras_bert import Tokenizer
token_dict = {
'[CLS]' : 0 ,
'[SEP]' : 1 ,
'un' : 2 ,
'##aff' : 3 ,
'##able' : 4 ,
'[UNK]' : 5 ,
}
tokenizer = Tokenizer ( token_dict )
print ( tokenizer . tokenize ( 'unaffable' )) # The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]']`
indices , segments = tokenizer . encode ( 'unaffable' )
print ( indices ) # Should be `[0, 2, 3, 4, 1]`
print ( segments ) # Should be `[0, 0, 0, 0, 0]`
print ( tokenizer . tokenize ( first = 'unaffable' , second = '钢' ))
# The result should be `['[CLS]', 'un', '##aff', '##able', '[SEP]', '钢', '[SEP]']`
indices , segments = tokenizer . encode ( first = 'unaffable' , second = '钢' , max_len = 10 )
print ( indices ) # Should be `[0, 2, 3, 4, 1, 5, 1, 0, 0, 0]`
print ( segments ) # Should be `[0, 0, 0, 0, 0, 1, 1, 0, 0, 0]`
from tensorflow import keras
from keras_bert import get_base_dict , get_model , compile_model , gen_batch_inputs
# A toy input example
sentence_pairs = [
[[ 'all' , 'work' , 'and' , 'no' , 'play' ], [ 'makes' , 'jack' , 'a' , 'dull' , 'boy' ]],
[[ 'from' , 'the' , 'day' , 'forth' ], [ 'my' , 'arm' , 'changed' ]],
[[ 'and' , 'a' , 'voice' , 'echoed' ], [ 'power' , 'give' , 'me' , 'more' , 'power' ]],
]
# Build token dictionary
token_dict = get_base_dict () # A dict that contains some special tokens
for pairs in sentence_pairs :
for token in pairs [ 0 ] + pairs [ 1 ]:
if token not in token_dict :
token_dict [ token ] = len ( token_dict )
token_list = list ( token_dict . keys ()) # Used for selecting a random word
# Build & train the model
model = get_model (
token_num = len ( token_dict ),
head_num = 5 ,
transformer_num = 12 ,
embed_dim = 25 ,
feed_forward_dim = 100 ,
seq_len = 20 ,
pos_num = 20 ,
dropout_rate = 0.05 ,
)
compile_model ( model )
model . summary ()
def _generator ():
while True :
yield gen_batch_inputs (
sentence_pairs ,
token_dict ,
token_list ,
seq_len = 20 ,
mask_rate = 0.3 ,
swap_sentence_rate = 1.0 ,
)
model . fit_generator (
generator = _generator (),
steps_per_epoch = 1000 ,
epochs = 100 ,
validation_data = _generator (),
validation_steps = 100 ,
callbacks = [
keras . callbacks . EarlyStopping ( monitor = 'val_loss' , patience = 5 )
],
)
# Use the trained model
inputs , output_layer = get_model (
token_num = len ( token_dict ),
head_num = 5 ,
transformer_num = 12 ,
embed_dim = 25 ,
feed_forward_dim = 100 ,
seq_len = 20 ,
pos_num = 20 ,
dropout_rate = 0.05 ,
training = False , # The input layers and output layer will be returned if `training` is `False`
trainable = False , # Whether the model is trainable. The default value is the same with `training`
output_layer_num = 4 , # The number of layers whose outputs will be concatenated as a single output.
# Only available when `training` is `False`.
)
AdamWarmup
優化器用於預熱和衰減。學習率將在warmpup_steps
步中達到lr
,並在decay_steps
步中衰減到min_lr
。有一個輔助函數calc_train_steps
用來計算這兩個步驟:
import numpy as np
from keras_bert import AdamWarmup , calc_train_steps
train_x = np . random . standard_normal (( 1024 , 100 ))
total_steps , warmup_steps = calc_train_steps (
num_example = train_x . shape [ 0 ],
batch_size = 32 ,
epochs = 10 ,
warmup_proportion = 0.1 ,
)
optimizer = AdamWarmup ( total_steps , warmup_steps , lr = 1e-3 , min_lr = 1e-5 )
新增了幾個下載網址。您可以透過以下方式取得檢查點的下載和解壓路徑:
from keras_bert import get_pretrained , PretrainedList , get_checkpoint_paths
model_path = get_pretrained ( PretrainedList . multi_cased_base )
paths = get_checkpoint_paths ( model_path )
print ( paths . config , paths . checkpoint , paths . vocab )
如果您需要標記或句子的特徵(無需進一步調整),您可以使用輔助函數extract_embeddings
。提取所有token的特徵:
from keras_bert import extract_embeddings
model_path = 'xxx/yyy/uncased_L-12_H-768_A-12'
texts = [ 'all work and no play' , 'makes jack a dull boy~' ]
embeddings = extract_embeddings ( model_path , texts )
傳回的結果是一個與文字長度相同的清單。清單中的每個項目都是按輸入長度截斷的 numpy 陣列。本例輸出的形狀為(7, 768)
和(8, 768)
。
當輸入是成對句時,需要NSP
和最後 4 層 max-pooling 的輸出:
from keras_bert import extract_embeddings , POOL_NSP , POOL_MAX
model_path = 'xxx/yyy/uncased_L-12_H-768_A-12'
texts = [
( 'all work and no play' , 'makes jack a dull boy' ),
( 'makes jack a dull boy' , 'all work and no play' ),
]
embeddings = extract_embeddings ( model_path , texts , output_layer_num = 4 , poolings = [ POOL_NSP , POOL_MAX ])
結果中沒有令牌特徵。 NSP
和 max-pooling 的輸出將與最終形狀(768 x 4 x 2,)
連接。
輔助函數中的第二個參數是生成器。要從文件中提取特徵:
import codecs
from keras_bert import extract_embeddings
model_path = 'xxx/yyy/uncased_L-12_H-768_A-12'
with codecs . open ( 'xxx.txt' , 'r' , 'utf8' ) as reader :
texts = map ( lambda x : x . strip (), reader )
embeddings = extract_embeddings ( model_path , texts )