Question Answering BERT下载 - Question Answering BERT源码下载

Question Answering BERT

其他源码

1.0.0

下载

在 SQuAD v2.0（斯坦福问答数据集）上使用 BERT 进行提取问答

提取式问答的主要目标是在所提供的文本段落中找到给定问题最相关且最准确的答案。换句话说，模型直接从段落中提取答案，而不是生成新答案。
这提供了快速而准确的答案。
然而，值得注意的是，提取式问答受到所提供文本段落中包含的信息的限制，并且可能无法生成新颖或创造性的答案。这种方法已应用于客户服务聊天机器人、搜索引擎、语音助手等。

预购单

！！！您可以在本地运行此笔记本（如果您拥有所有依赖项和 GPU），也可以在 Google Colab 上运行。

如果您要在本地工作，请确保满足以下软件要求：

 python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23

SQuAD v2.0 和数据格式和转换

SQuAD2.0 数据集结合了 SQuAD1.1 中的 100,000 个问题和众包工作者对抗性编写的 50,000 多个无法回答的问题，看起来与可回答的问题相似。

以下是小队问答数据集的示例格式：

每个title都有一个或多个paragraph条目，每个段落条目由context和question-answer entries (qas)组成。
每个问答条目都有一个question和一个全局唯一的id
布尔标志is_impossible ，显示问题是否可回答：如果问题可回答，则一个answer条目包含文本范围及其在上下文中的起始字符索引。如果问题无法回答，则提供空answers列表。

！！！对于 QA 任务，NVIDIA 工具包接受 SQuAD JSON 格式的数据。如果您有任何其他格式的数据，请务必将其转换为 SQuAD 格式，如下所示。

{
    "data" : [
        {
            "title" : "Super_Bowl_50" ,
            "paragraphs" : [
                {
                    "context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
                    "qas" : [
                        {
                            "question" : "Where did Super Bowl 50 take place?" ,
                            "is_impossible" : "false" ,
                            "id" : "56be4db0acb8001400a502ee" ,
                            "answers" : [
                                {
                                    "answer_start" : "403" ,
                                    "text" : "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question" : "What was the winning score of the Super Bowl 50?" ,
                            "is_impossible" : "true" ,
                            "id" : "56be4db0acb8001400a502ez" ,
                            "answers" : [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

配置文件

它本质上只是一个命令来运行数据预处理、训练、微调、评估、推理和导出！所有配置都通过 YAML 规范文件进行。您已经可以直接使用示例规范文件或作为参考来创建自己的规范文件。通过这些规范文件，您可以调整许多旋钮，例如模型、数据集、超参数、优化器等。

在 SQuAD 数据集上微调 BERT QA

为了训练 NVIDIA 格式的 QA 模型，我们使用以下命令：

 # set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"

# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"

trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )

model . save_to ( config . model . nemo_path )

有关这些参数的更多详细信息，请参阅 Question_Answering.ipynb

BERT QA 推理

！！！评估文件（用于验证和测试）遵循上述格式，但它可以为同一问题提供多个答案。 ！！！推理文件遵循上述格式，但不需要answers和is_impossible关键字。

 # Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )

eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
    devices = eval_device ,
    accelerator = config . trainer . accelerator ,
    precision = 16 ,
    logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )

def dump_json ( filepath , data ):
    with open ( filepath , "w" ) as f :
        json . dump ( data , f )

def create_inference_data_format ( context , question ):

  squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
  squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
            {
                "context" : context ,
                "qas" : [
                    { "id" : 0 , "question" : question ,}
                ],
            }
        )
  return squad_data

context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."

question = "Which country has the most?"

inference_filepath = "inference.json"

inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )

predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]

print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )

 100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]

> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262