Question Answering BERT 다운로드 - Question Answering BERT 소스코드 다운로드

Question Answering BERT

기타 소스코드

1.0.0

다운로드

SQuAD v2.0에서 BERT를 사용한 추출적 질문 답변(Stanford 질문 답변 데이터 세트)

추출적 질문 답변의 주요 목표는 제공된 텍스트 구절 내에서 주어진 질문에 대한 가장 관련성이 높고 정확한 답변을 찾는 것입니다. 즉, 모델은 새로운 답을 생성하는 것이 아니라 지문에서 직접 답을 추출합니다.
빠르고 정확한 답변을 제공합니다.
그러나 추출적인 질의 응답은 제공된 텍스트 본문에 포함된 정보에 의해 제한되며 참신하거나 창의적인 답변을 생성하지 못할 수도 있다는 점에 유의하는 것이 중요합니다. 이 접근 방식은 고객 서비스 챗봇, 검색 엔진, 음성 도우미 등에 적용되었습니다.

사전 요청

!!! 이 노트북을 로컬로 실행하거나(모든 종속성과 GPU가 있는 경우) Google Colab에서 실행할 수 있습니다.

로컬에서 작업하려면 다음 소프트웨어 요구 사항을 확인하십시오.

 python 3.6 .9
docker - ce > 19.03 . 5
docker - API 1.40
nvidia - container - toolkit > 1.3 . 0 - 1
nvidia - container - runtime > 3.4 . 0 - 1
nvidia - docker2 > 2.5 . 0 - 1
nvidia - driver >= 455.23

SQuAD v2.0과 데이터 형식 및 변환

SQuAD2.0 데이터 세트는 SQuAD1.1의 100,000개 질문과 크라우드 작업자가 적대적으로 작성한 50,000개 이상의 답변할 수 없는 질문을 결합하여 답변 가능한 질문과 유사하게 보입니다.

다음은 분대 질문 답변 데이터세트의 형식 예입니다.

각 title 에는 하나 이상의 paragraph 항목이 있으며 각 항목은 context 및 question-answer entries (qas) 으로 구성됩니다.
각 질문-답변 항목에는 question 과 전역적으로 고유한 id 있습니다.
질문에 답변할 수 있는지 여부를 표시하는 부울 플래그 is_impossible : 질문에 답변할 수 있는 경우 하나의 answer 항목에는 텍스트 범위와 컨텍스트의 시작 문자 인덱스가 포함됩니다. 질문에 답변할 수 없는 경우 빈 answers 목록이 제공됩니다.

!!! QA 작업의 경우 NVIDIA 툴킷은 SQuAD JSON 형식의 데이터를 허용합니다. 다른 형식의 데이터가 있는 경우 반드시 아래와 같이 SQuAD 형식으로 변환하시기 바랍니다.

{
    "data" : [
        {
            "title" : "Super_Bowl_50" ,
            "paragraphs" : [
                {
                    "context" : "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24 u2013 10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the " golden anniversary " with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as " Super Bowl L " ), so that the logo could prominently feature the Arabic numerals 50." ,
                    "qas" : [
                        {
                            "question" : "Where did Super Bowl 50 take place?" ,
                            "is_impossible" : "false" ,
                            "id" : "56be4db0acb8001400a502ee" ,
                            "answers" : [
                                {
                                    "answer_start" : "403" ,
                                    "text" : "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question" : "What was the winning score of the Super Bowl 50?" ,
                            "is_impossible" : "true" ,
                            "id" : "56be4db0acb8001400a502ez" ,
                            "answers" : [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

구성 파일

기본적으로 데이터 전처리, 훈련, 미세 조정, 평가, 추론 및 내보내기를 실행하는 각각의 명령은 하나뿐입니다! 모든 구성은 YAML 사양 파일을 통해 이루어집니다. 직접 사용하거나 참조용으로 사용할 수 있는 샘플 사양 파일이 이미 있습니다. 이러한 사양 파일을 통해 모델, 데이터세트, 하이퍼파라미터, 옵티마이저 등과 같은 다양한 노브를 조정할 수 있습니다.

SQuAD 데이터세트에서 BERT QA를 미세 조정하세요.

NVIDIA 형식으로 QA 모델을 교육하기 위해 다음 명령을 사용합니다.

 # set language model and tokenizer to be used
config . model . language_model . pretrained_model_name = "bert-base-uncased"
config . model . tokenizer . tokenizer_name = "bert-base-uncased"

# path where model will be saved
config . model . nemo_path = f" { WORK_DIR } /checkpoints/bert_squad_v2_0.nemo"

trainer = pl . Trainer ( ** config . trainer )
model = BERTQAModel ( config . model , trainer = trainer )
trainer . fit ( model )
trainer . test ( model )

model . save_to ( config . model . nemo_path )

이러한 인수에 대한 자세한 내용은 Question_Answering.ipynb에 있습니다.

BERT QA 추론

!!! 평가 파일(검증 및 테스트용)은 동일한 질문에 대해 두 개 이상의 답변을 제공할 수 있다는 점을 제외하고 위 형식을 따릅니다. !!!추론 파일은 answers 및 is_impossible 키워드가 필요하지 않다는 점을 제외하고 위 형식을 따릅니다.

 # Load saved model
model = BERTQAModel . restore_from ( config . model . nemo_path )

eval_device = [ config . trainer . devices [ 0 ]] if isinstance ( config . trainer . devices , list ) else 1
model . trainer = pl . Trainer (
    devices = eval_device ,
    accelerator = config . trainer . accelerator ,
    precision = 16 ,
    logger = False ,
)
config . exp_manager . create_checkpoint_callback = False
exp_dir = exp_manager ( model . trainer , config . exp_manager )

def dump_json ( filepath , data ):
    with open ( filepath , "w" ) as f :
        json . dump ( data , f )

def create_inference_data_format ( context , question ):

  squad_data = { "data" : [{ "title" : "inference" , "paragraphs" : []}], "version" : "v2.1" }
  squad_data [ "data" ][ 0 ][ "paragraphs" ]. append (
            {
                "context" : context ,
                "qas" : [
                    { "id" : 0 , "question" : question ,}
                ],
            }
        )
  return squad_data

context = "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and Colombia with 10%."

question = "Which country has the most?"

inference_filepath = "inference.json"

inference_data = create_inference_data_format ( context , question )
dump_json ( inference_filepath , inference_data )

predictions = model . inference ( "inference.json" )
question = predictions [ 1 ][ 0 ][ 0 ][ "question" ]
answer = predictions [ 1 ][ 0 ][ 0 ][ "text" ]
probability = predictions [ 1 ][ 0 ][ 0 ][ "probability" ]

print ( f" n > Question: { question } n > Answer: { answer } n Probability: { probability } " )

 100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 184.29 it / s ]
100 % | ██████████ | 1 / 1 [ 00 : 00 < 00 : 00 , 8112.77 it / s ]

> Question : Which country has the most ?
> Answer : Brazil
Probability : 0.9688649039578262