ดาวน์โหลด Question Answering BERT - ดาวน์โหลดซอร์สโค้ด Question Answering BERT

Question Answering BERT

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

การตอบคำถามแบบแยกส่วนด้วย BERT บน SQuAD v2.0 (ชุดข้อมูลการตอบคำถามของ Stanford)

เป้าหมายหลักของการตอบคำถามแบบแยกส่วนคือการค้นหาคำตอบที่เกี่ยวข้องและแม่นยำที่สุดสำหรับคำถามที่กำหนดภายในข้อความที่ให้ไว้ กล่าวอีกนัยหนึ่ง แบบจำลองจะแยกคำตอบออกจากข้อความโดยตรง แทนที่จะสร้างคำตอบใหม่
สิ่งนี้ให้คำตอบที่รวดเร็วและแม่นยำ
อย่างไรก็ตาม สิ่งสำคัญคือต้องทราบว่าการตอบคำถามแบบแยกส่วนนั้นถูกจำกัดด้วยข้อมูลที่อยู่ในข้อความที่ให้ไว้ และอาจไม่สามารถสร้างคำตอบที่แปลกใหม่หรือสร้างสรรค์ได้ วิธีการนี้ได้ถูกนำไปใช้กับแชทบอทบริการลูกค้า เครื่องมือค้นหา ผู้ช่วยเสียง ฯลฯ

สารบัญ

ข้อกำหนดเบื้องต้น
SQuAD v2.0 และรูปแบบข้อมูลและการแปลง
ไฟล์การกำหนดค่า
ปรับแต่ง BERT QA บนชุดข้อมูล SQuAD
การอนุมานของ BERT QA

ข้อกำหนดเบื้องต้น

- คุณสามารถเรียกใช้สมุดบันทึกนี้ในเครื่อง (หากคุณมีการอ้างอิงและ GPU ทั้งหมด) หรือบน Google Colab

โปรดตรวจสอบข้อกำหนดซอฟต์แวร์ต่อไปนี้หากคุณจะทำงานในพื้นที่:

 python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23

SQuAD v2.0 และรูปแบบข้อมูลและการแปลง

ชุดข้อมูล SQuAD2.0 รวมคำถาม 100,000 ข้อใน SQuAD1.1 กับคำถามที่ตอบไม่ได้มากกว่า 50,000 คำถามที่เขียนโดยฝ่ายตรงข้ามเพื่อให้มีลักษณะคล้ายกับคำถามที่ตอบได้

ต่อไปนี้คือรูปแบบตัวอย่างสำหรับชุดข้อมูลการตอบคำถามแบบทีม:

title แต่ละรายการมีรายการ paragraph เดียวหรือหลายรายการ โดยแต่ละรายการประกอบด้วย context และ question-answer entries (qas)
รายการคำถาม-คำตอบแต่ละรายการมี question และ id ที่ไม่ซ้ำกันทั่วโลก
ธงบูลีน is_impossible ซึ่งแสดงว่าคำถามสามารถตอบได้หรือไม่: หากคำถามนั้นตอบได้ รายการ answer หนึ่งรายการจะมีช่วงข้อความและดัชนีอักขระเริ่มต้นในบริบท หากคำถามไม่สามารถตอบได้ จะมีรายการ answers ว่างให้

- สำหรับงาน QA ชุดเครื่องมือ NVIDIA ยอมรับข้อมูลในรูปแบบ SQuAD JSON หากคุณมีข้อมูลของคุณในรูปแบบอื่น ตรวจสอบให้แน่ใจว่าได้แปลงเป็นรูปแบบ SQuAD ตามด้านล่างนี้

{
    "data" : [
        {
            "title" : "Super_Bowl_50" ,
            "paragraphs" : [
                {
                    "context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
                    "qas" : [
                        {
                            "question" : "Where did Super Bowl 50 take place?" ,
                            "is_impossible" : "false" ,
                            "id" : "56be4db0acb8001400a502ee" ,
                            "answers" : [
                                {
                                    "answer_start" : "403" ,
                                    "text" : "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question" : "What was the winning score of the Super Bowl 50?" ,
                            "is_impossible" : "true" ,
                            "id" : "56be4db0acb8001400a502ez" ,
                            "answers" : [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

ไฟล์การกำหนดค่า

โดยพื้นฐานแล้วมันเป็นเพียงคำสั่งเดียวในการเรียกใช้การประมวลผลข้อมูลล่วงหน้า การฝึกอบรม การปรับแต่งอย่างละเอียด การประเมินผล การอนุมาน และการส่งออก! การกำหนดค่าทั้งหมดเกิดขึ้นผ่านไฟล์ข้อมูลจำเพาะ YAML มีไฟล์ข้อมูลจำเพาะตัวอย่างอยู่แล้วให้คุณใช้โดยตรงหรือเป็นข้อมูลอ้างอิงในการสร้างไฟล์ของคุณเอง ด้วยไฟล์ข้อมูลจำเพาะเหล่านี้ คุณสามารถปรับแต่งปุ่มต่างๆ ได้มากมาย เช่น โมเดล ชุดข้อมูล ไฮเปอร์พารามิเตอร์ เครื่องมือเพิ่มประสิทธิภาพ ฯลฯ

ปรับแต่ง BERT QA บนชุดข้อมูล SQuAD

สำหรับการฝึกโมเดล QA ในรูปแบบ NVIDIA เราใช้คำสั่งต่อไปนี้:

 # set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"

# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"

trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )

model . save_to ( config . model . nemo_path )

รายละเอียดเพิ่มเติมเกี่ยวกับข้อโต้แย้งเหล่านี้มีอยู่ใน Question_Answering.ipynb

การอนุมานของ BERT QA

- ไฟล์การประเมินผล (สำหรับการตรวจสอบและการทดสอบ) เป็นไปตามรูปแบบข้างต้น ยกเว้นว่าสามารถให้คำตอบได้มากกว่าหนึ่งคำตอบสำหรับคำถามเดียวกัน !!!ไฟล์อนุมานเป็นไปตามรูปแบบข้างต้น ยกเว้นว่าไฟล์นั้นไม่ต้องการ answers และคำหลัก is_impossible

 # Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )

eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
    devices = eval_device ,
    accelerator = config . trainer . accelerator ,
    precision = 16 ,
    logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )

def dump_json ( filepath , data ):
    with open ( filepath , "w" ) as f :
        json . dump ( data , f )

def create_inference_data_format ( context , question ):

  squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
  squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
            {
                "context" : context ,
                "qas" : [
                    { "id" : 0 , "question" : question ,}
                ],
            }
        )
  return squad_data

context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."

question = "Which country has the most?"

inference_filepath = "inference.json"

inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )

predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]

print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )

 100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]

> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262