Question Answering BERT下載 - Question Answering BERT源碼下載

Question Answering BERT

其他源碼

1.0.0

下載

在 SQuAD v2.0（史丹佛問答資料集）上使用 BERT 進行提取問答

提取式問答的主要目標是在所提供的文本段落中找到給定問題最相關且最準確的答案。換句話說，模型直接從段落中提取答案，而不是產生新答案。
這提供了快速而準確的答案。
然而，值得注意的是，提取式問答受到所提供文本段落中包含的資訊的限制，並且可能無法產生新穎或創造性的答案。這種方法已應用於客戶服務聊天機器人、搜尋引擎、語音助理等。

預購單

！您可以在本機上執行此筆記本（如果您擁有所有依賴項和 GPU），也可以在 Google Colab 上執行。

如果您要在本地工作，請確保滿足以下軟體要求：

 python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23

SQuAD v2.0 和資料格式和轉換

SQuAD2.0 資料集結合了 SQuAD1.1 中的 100,000 個問題和眾包工作者對抗性編寫的 50,000 多個無法回答的問題，看起來與可回答的問題相似。

以下是小隊問答資料集的範例格式：

每個title都有一個或多個paragraph條目，每個段落條目由context和question-answer entries (qas)組成。
每個問答條目都有一個question和一個全域唯一的id
布林標誌is_impossible ，顯示問題是否可回答：如果問題可回答，則一個answer條目包含文字範圍及其在上下文中的起始字元索引。如果問題無法回答，則提供空answers清單。

！對於 QA 任務，NVIDIA 工具包接受 SQuAD JSON 格式的資料。如果您有任何其他格式的數據，請務必將其轉換為 SQuAD 格式，如下所示。

{
    "data" : [
        {
            "title" : "Super_Bowl_50" ,
            "paragraphs" : [
                {
                    "context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
                    "qas" : [
                        {
                            "question" : "Where did Super Bowl 50 take place?" ,
                            "is_impossible" : "false" ,
                            "id" : "56be4db0acb8001400a502ee" ,
                            "answers" : [
                                {
                                    "answer_start" : "403" ,
                                    "text" : "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question" : "What was the winning score of the Super Bowl 50?" ,
                            "is_impossible" : "true" ,
                            "id" : "56be4db0acb8001400a502ez" ,
                            "answers" : [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

設定檔

它本質上只是一個命令來運行資料預處理、訓練、微調、評估、推理和導出！所有設定都透過 YAML 規範檔案進行。您已經可以直接使用範例規格文件或作為參考來建立自己的規格文件。透過這些規範文件，您可以調整許多旋鈕，例如模型、資料集、超參數、優化器等。

在 SQuAD 資料集上微調 BERT QA

為了訓練 NVIDIA 格式的 QA 模型，我們使用以下指令：

 # set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"

# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"

trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )

model . save_to ( config . model . nemo_path )

有關這些參數的更多詳細信息，請參閱 Question_Answering.ipynb

BERT QA 推理

！評估文件（用於驗證和測試）遵循上述格式，但它可以為相同問題提供多個答案。 answers is_impossible

 # Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )

eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
    devices = eval_device ,
    accelerator = config . trainer . accelerator ,
    precision = 16 ,
    logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )

def dump_json ( filepath , data ):
    with open ( filepath , "w" ) as f :
        json . dump ( data , f )

def create_inference_data_format ( context , question ):

  squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
  squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
            {
                "context" : context ,
                "qas" : [
                    { "id" : 0 , "question" : question ,}
                ],
            }
        )
  return squad_data

context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."

question = "Which country has the most?"

inference_filepath = "inference.json"

inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )

predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]

print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )

 100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]

> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262