Question Answering BERTダウンロード - Question Answering BERTソースコードのダウンロード

Question Answering BERT

その他のソースコード

1.0.0

ダウンロード

SQuAD v2.0 上の BERT を使用した抽出的質問応答 (スタンフォード質問応答データセット)

抽出的質問応答の主な目的は、提供されたテキストの一節内で、特定の質問に対する最も適切で正確な回答を見つけることです。言い換えれば、モデルは新しい答えを生成するのではなく、パッセージから直接答えを抽出します。
これにより、迅速かつ正確な回答が得られます。
ただし、抽出的質問応答は、提供されたテキストの一節に含まれる情報によって制限され、斬新または創造的な回答を生成できない場合があることに注意することが重要です。このアプローチは、顧客サービスのチャットボット、検索エンジン、音声アシスタントなどに適用されています。

事前要件

!!!このノートブックはローカルで (すべての依存関係と GPU がある場合)、または Google Colab で実行できます。

ローカルで作業する場合は、次のソフトウェア要件を確認してください。

 python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23

SQuAD v2.0 とデータ形式と変換

SQuAD2.0 データセットは、SQuAD1.1 の 100,000 の質問と、クラウドワーカーによって敵対的に書かれた 50,000 を超える答えられない質問を組み合わせて、答えられる質問に似せています。

分隊の質問応答データセットの形式の例を次に示します。

各titleには 1 つまたは複数のparagraphエントリがあり、それぞれがcontextとquestion-answer entries (qas)で構成されます。
各質問と回答のエントリには、 questionとグローバルに一意のidあります。
ブール値フラグis_impossible 。質問が回答可能かどうかを示します。質問が回答可能な場合、1 つのanswerエントリには、コンテキスト内のテキストスパンとその開始文字インデックスが含まれます。質問に答えられない場合は、空のanswersリストが提供されます。

!!! QA タスクの場合、NVIDIA ツールキットは SQuAD JSON 形式のデータを受け入れます。他の形式のデータがある場合は、以下のように必ず SQuAD 形式に変換してください。

{
    "data" : [
        {
            "title" : "Super_Bowl_50" ,
            "paragraphs" : [
                {
                    "context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
                    "qas" : [
                        {
                            "question" : "Where did Super Bowl 50 take place?" ,
                            "is_impossible" : "false" ,
                            "id" : "56be4db0acb8001400a502ee" ,
                            "answers" : [
                                {
                                    "answer_start" : "403" ,
                                    "text" : "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question" : "What was the winning score of the Super Bowl 50?" ,
                            "is_impossible" : "true" ,
                            "id" : "56be4db0acb8001400a502ez" ,
                            "answers" : [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

設定ファイル

基本的に、データの前処理、トレーニング、微調整、評価、推論、エクスポートをそれぞれ 1 つのコマンドで実行するだけです。すべての構成は YAML 仕様ファイルを通じて行われます。直接使用したり、独自のファイルを作成するための参照として使用できるサンプル仕様ファイルがすでに用意されています。これらの仕様ファイルを通じて、モデル、データセット、ハイパーパラメーター、オプティマイザーなどの多くのノブを調整できます。

SQuAD データセットで BERT QA を微調整する

NVIDIA 形式で QA モデルをトレーニングするには、次のコマンドを使用します。

 # set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"

# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"

trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )

model . save_to ( config . model . nemo_path )

これらの引数の詳細については、Question_Answering.ipynb を参照してください。

BERT QA 推論

!!!評価ファイル (検証およびテスト用) は、同じ質問に対して複数の回答を提供できる点を除いて、上記の形式に従います。 !!!推論ファイルは、 answersとis_impossibleキーワードを必要としない点を除いて、上記の形式に従います。

 # Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )

eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
    devices = eval_device ,
    accelerator = config . trainer . accelerator ,
    precision = 16 ,
    logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )

def dump_json ( filepath , data ):
    with open ( filepath , "w" ) as f :
        json . dump ( data , f )

def create_inference_data_format ( context , question ):

  squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
  squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
            {
                "context" : context ,
                "qas" : [
                    { "id" : 0 , "question" : question ,}
                ],
            }
        )
  return squad_data

context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."

question = "Which country has the most?"

inference_filepath = "inference.json"

inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )

predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]

print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )

 100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]

> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262