!!! Anda dapat menjalankan notebook ini secara lokal (jika Anda memiliki semua dependensi dan GPU) atau di Google Colab.
Harap pastikan persyaratan perangkat lunak berikut jika Anda akan bekerja di lokal:
python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23
Kumpulan data SQuAD2.0 menggabungkan 100.000 pertanyaan di SQuAD1.1 dengan lebih dari 50.000 pertanyaan tak terjawab yang ditulis secara berlawanan oleh crowdworker agar terlihat serupa dengan pertanyaan yang dapat dijawab.
Berikut ini contoh format kumpulan data penjawab pertanyaan regu:
Setiap title
mempunyai satu atau beberapa entri paragraph
, masing-masing terdiri dari context
dan question-answer entries (qas)
.
Setiap entri tanya jawab memiliki question
dan id
unik secara global
Bendera Boolean is_impossible
, yang menunjukkan apakah suatu pertanyaan dapat dijawab atau tidak: Jika pertanyaan tersebut dapat dijawab, satu entri answer
berisi rentang teks dan indeks karakter awalnya dalam konteksnya. Jika pertanyaan tidak dapat dijawab, daftar answers
kosong disediakan.
!!! Untuk tugas QA, toolkit NVIDIA menerima data dalam format SQuAD JSON. Jika Anda memiliki data dalam format lain, pastikan untuk mengonversinya dalam format SQuAD seperti di bawah ini.
{
"data" : [
{
"title" : "Super_Bowl_50" ,
"paragraphs" : [
{
"context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
"qas" : [
{
"question" : "Where did Super Bowl 50 take place?" ,
"is_impossible" : "false" ,
"id" : "56be4db0acb8001400a502ee" ,
"answers" : [
{
"answer_start" : "403" ,
"text" : "Santa Clara, California"
}
]
},
{
"question" : "What was the winning score of the Super Bowl 50?" ,
"is_impossible" : "true" ,
"id" : "56be4db0acb8001400a502ez" ,
"answers" : [
]
}
]
}
]
}
]
}
Ini pada dasarnya hanya satu perintah untuk menjalankan prapemrosesan data, pelatihan, penyesuaian, evaluasi, inferensi, dan ekspor! Semua konfigurasi terjadi melalui file spesifikasi YAML. Ada contoh file spesifikasi yang sudah tersedia untuk Anda gunakan secara langsung atau sebagai referensi untuk membuatnya sendiri. Melalui file spesifikasi ini, Anda dapat menyetel banyak tombol seperti model, kumpulan data, hyperparameter, pengoptimal, dll.
Untuk melatih model QA dalam format NVIDIA, kami menggunakan perintah berikut:
# set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"
# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"
trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )
model . save_to ( config . model . nemo_path )
Rincian lebih lanjut tentang argumen ini ada di Question_Answering.ipynb
!!! File evaluasi (untuk validasi dan pengujian) mengikuti format di atas kecuali dapat memberikan lebih dari satu jawaban untuk pertanyaan yang sama. !!!File inferensi mengikuti format di atas kecuali tidak memerlukan answers
dan kata kunci is_impossible
.
# Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )
eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
devices = eval_device ,
accelerator = config . trainer . accelerator ,
precision = 16 ,
logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )
def dump_json ( filepath , data ):
with open ( filepath , "w" ) as f :
json . dump ( data , f )
def create_inference_data_format ( context , question ):
squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
{
"context" : context ,
"qas" : [
{ "id" : 0 , "question" : question ,}
],
}
)
return squad_data
context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."
question = "Which country has the most?"
inference_filepath = "inference.json"
inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )
predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]
print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]
> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262
Dalam penelitian ini digunakan dari tanya jawab-pelatihan-final.ipynb. Anda dapat menemukan informasi lebih lanjut di sini.