!!!您可以在本地运行此笔记本(如果您拥有所有依赖项和 GPU),也可以在 Google Colab 上运行。
如果您要在本地工作,请确保满足以下软件要求:
python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23
SQuAD2.0 数据集结合了 SQuAD1.1 中的 100,000 个问题和众包工作者对抗性编写的 50,000 多个无法回答的问题,看起来与可回答的问题相似。
以下是小队问答数据集的示例格式:
每个title
都有一个或多个paragraph
条目,每个段落条目由context
和question-answer entries (qas)
组成。
每个问答条目都有一个question
和一个全局唯一的id
布尔标志is_impossible
,显示问题是否可回答:如果问题可回答,则一个answer
条目包含文本范围及其在上下文中的起始字符索引。如果问题无法回答,则提供空answers
列表。
!!!对于 QA 任务,NVIDIA 工具包接受 SQuAD JSON 格式的数据。如果您有任何其他格式的数据,请务必将其转换为 SQuAD 格式,如下所示。
{
"data" : [
{
"title" : "Super_Bowl_50" ,
"paragraphs" : [
{
"context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
"qas" : [
{
"question" : "Where did Super Bowl 50 take place?" ,
"is_impossible" : "false" ,
"id" : "56be4db0acb8001400a502ee" ,
"answers" : [
{
"answer_start" : "403" ,
"text" : "Santa Clara, California"
}
]
},
{
"question" : "What was the winning score of the Super Bowl 50?" ,
"is_impossible" : "true" ,
"id" : "56be4db0acb8001400a502ez" ,
"answers" : [
]
}
]
}
]
}
]
}
它本质上只是一个命令来运行数据预处理、训练、微调、评估、推理和导出!所有配置都通过 YAML 规范文件进行。您已经可以直接使用示例规范文件或作为参考来创建自己的规范文件。通过这些规范文件,您可以调整许多旋钮,例如模型、数据集、超参数、优化器等。
为了训练 NVIDIA 格式的 QA 模型,我们使用以下命令:
# set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"
# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"
trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )
model . save_to ( config . model . nemo_path )
有关这些参数的更多详细信息,请参阅 Question_Answering.ipynb
!!!评估文件(用于验证和测试)遵循上述格式,但它可以为同一问题提供多个答案。 !!!推理文件遵循上述格式,但不需要answers
和is_impossible
关键字。
# Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )
eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
devices = eval_device ,
accelerator = config . trainer . accelerator ,
precision = 16 ,
logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )
def dump_json ( filepath , data ):
with open ( filepath , "w" ) as f :
json . dump ( data , f )
def create_inference_data_format ( context , question ):
squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
{
"context" : context ,
"qas" : [
{ "id" : 0 , "question" : question ,}
],
}
)
return squad_data
context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."
question = "Which country has the most?"
inference_filepath = "inference.json"
inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )
predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]
print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]
> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262
在本研究中,它来自 Question-answering-training-final.ipynb。您可以在这里找到更多信息。