!您可以在本機上執行此筆記本(如果您擁有所有依賴項和 GPU),也可以在 Google Colab 上執行。
如果您要在本地工作,請確保滿足以下軟體要求:
python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23
SQuAD2.0 資料集結合了 SQuAD1.1 中的 100,000 個問題和眾包工作者對抗性編寫的 50,000 多個無法回答的問題,看起來與可回答的問題相似。
以下是小隊問答資料集的範例格式:
每個title
都有一個或多個paragraph
條目,每個段落條目由context
和question-answer entries (qas)
組成。
每個問答條目都有一個question
和一個全域唯一的id
布林標誌is_impossible
,顯示問題是否可回答:如果問題可回答,則一個answer
條目包含文字範圍及其在上下文中的起始字元索引。如果問題無法回答,則提供空answers
清單。
!對於 QA 任務,NVIDIA 工具包接受 SQuAD JSON 格式的資料。如果您有任何其他格式的數據,請務必將其轉換為 SQuAD 格式,如下所示。
{
"data" : [
{
"title" : "Super_Bowl_50" ,
"paragraphs" : [
{
"context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
"qas" : [
{
"question" : "Where did Super Bowl 50 take place?" ,
"is_impossible" : "false" ,
"id" : "56be4db0acb8001400a502ee" ,
"answers" : [
{
"answer_start" : "403" ,
"text" : "Santa Clara, California"
}
]
},
{
"question" : "What was the winning score of the Super Bowl 50?" ,
"is_impossible" : "true" ,
"id" : "56be4db0acb8001400a502ez" ,
"answers" : [
]
}
]
}
]
}
]
}
它本質上只是一個命令來運行資料預處理、訓練、微調、評估、推理和導出!所有設定都透過 YAML 規範檔案進行。您已經可以直接使用範例規格文件或作為參考來建立自己的規格文件。透過這些規範文件,您可以調整許多旋鈕,例如模型、資料集、超參數、優化器等。
為了訓練 NVIDIA 格式的 QA 模型,我們使用以下指令:
# set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"
# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"
trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )
model . save_to ( config . model . nemo_path )
有關這些參數的更多詳細信息,請參閱 Question_Answering.ipynb
!評估文件(用於驗證和測試)遵循上述格式,但它可以為相同問題提供多個答案。 answers
is_impossible
# Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )
eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
devices = eval_device ,
accelerator = config . trainer . accelerator ,
precision = 16 ,
logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )
def dump_json ( filepath , data ):
with open ( filepath , "w" ) as f :
json . dump ( data , f )
def create_inference_data_format ( context , question ):
squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
{
"context" : context ,
"qas" : [
{ "id" : 0 , "question" : question ,}
],
}
)
return squad_data
context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."
question = "Which country has the most?"
inference_filepath = "inference.json"
inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )
predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]
print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]
> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262
在本研究中,它來自 Question-answering-training-final.ipynb。您可以在這裡找到更多資訊。