Grounding_LLMs_with_online_RL 다운로드 - Grounding_LLMs_with_online

Grounding_LLMs_with_online_RL

AI 소스 코드

1.0.0

다운로드

온라인 강화 학습을 통해 대규모 언어 모델 기반 구축

이 저장소에는 온라인 강화 학습을 통한 대규모 언어 모델 접지 논문에 사용된 코드가 포함되어 있습니다.

자세한 내용은 당사 웹사이트에서 확인하실 수 있습니다.

GLAM 방법을 사용하여 BabyAI-Text에서 LLM 지식의 기능적 접지를 수행합니다. 주요 스키마

실험을 수행하기 위한 코드와 함께 BabyAI-Text 환경을 출시합니다(에이전트 교육 및 성능 평가 모두). 우리는 LLM을 사용하기 위해 Lamorel 라이브러리를 사용합니다.

우리 저장소는 다음과 같이 구성되어 있습니다:

? Grounding_LLMs_with_online_RL
┣ babyai-text -- BabyAI-Text 환경
┣ experiments -- 실험을 위한 코드
┃ ┣ agents - 모든 에이전트 구현
┃ ┃ ┣ bot - BabyAI의 봇을 활용하는 봇 에이전트
┃ ┃ ┣ random_agent -- 균일하게 무작위로 재생되는 에이전트
┃ ┃ ┣ drrn -- 여기에서 DRRN 에이전트
┃ ┃ ┣ ppo - PPO를 사용하는 에이전트
┃ ┃ ┃ ┣ symbolic_ppo_agent.py -- BabyAI의 PPO를 개조한 SymbolicPPO
┃ ┃ ┃ ┗ llm_ppo_agent.py - PPO를 사용하는 LLM 에이전트
┃ ┣ configs -- 실험을 위한 Lamorel 구성
┃ ┣ slurm -- SLURM 클러스터에서 실험을 시작하기 위한 유틸리티 스크립트
┃ ┣ campaign -- 실험을 시작하는 데 사용되는 SLURM 스크립트
┃ ┣ train_language_agent.py - BabyAI-Text(LLM 및 DRRN)를 사용하여 에이전트 교육 -> LLM에 대한 PPO 손실 구현과 LLM 위에 추가 헤드가 포함되어 있습니다.
┃ ┣ train_symbolic_ppo.py -- BabyAI에서 SymbolicPPO 훈련(BabyAI-Text의 작업 포함)
┃ ┣ post-training_tests.py -- 훈련된 에이전트의 일반화 테스트
┃ ┣ test_results.py -- 결과 형식을 지정하는 유틸리티
┃ ┗ clm_behavioral-cloning.py -- 궤적을 사용하여 LLM에서 행동 복제를 수행하는 코드

설치 단계

콘다 환경 생성

 conda create -n dlp python=3.10.8; conda activate dlp

파이토치 설치

 conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

우리 패키지에 필요한 패키지 설치

 pip install -r requirements.txt

BabyAI-Text 설치 : babyai-text 패키지에서 설치 세부 사항을 확인하세요.
라모렐 설치

 git clone https://github.com/flowersteam/lamorel.git; cd lamorel/lamorel; pip install -e .; cd ../..

시작하다

우리의 구성과 함께 Lamorel을 사용하십시오. 캠페인에서 교육 스크립트의 예를 찾을 수 있습니다.

언어 모델 훈련

BabyAI-Text 환경에서 언어 모델을 훈련하려면 train_language_agent.py 파일을 사용해야 합니다. Lamorel로 시작되는 이 스크립트는 다음 구성 항목을 사용합니다.

 rl_script_args :
  seed : 1
  number_envs : 2 # Number of parallel envs to launch (steps will be synchronized, i.e. a step call will return number_envs observations)
  num_steps : 1000 # Total number of training steps
  max_episode_steps : 3 # Maximum number of steps in a single episode
  frames_per_proc : 40 # The number of collected transitions to perform a PPO update will be frames_per_proc*number_envs
  discount : 0.99 # Discount factor used in PPO
  lr : 1e-6 # Learning rate used to finetune the LLM
  beta1 : 0.9 # PPO's hyperparameter
  beta2 : 0.999 # PPO's hyperparameter
  gae_lambda : 0.99 # PPO's hyperparameter
  entropy_coef : 0.01 # PPO's hyperparameter
  value_loss_coef : 0.5 # PPO's hyperparameter
  max_grad_norm : 0.5 # Maximum grad norm when updating the LLM's parameters
  adam_eps : 1e-5 # Adam's hyperparameter
  clip_eps : 0.2 # Epsilon used in PPO's losses clipping
  epochs : 4 # Number of PPO epochs performed on each set of collected trajectories
  batch_size : 16 # Minibatch size
  action_space : ["turn_left","turn_right","go_forward","pick_up","drop","toggle"] # Possible actions for the agent
  saving_path_logs : ??? # Where to store logs
  name_experiment : ' llm_mtrl ' # Useful for logging
  name_model : ' T5small ' # Useful for logging
  saving_path_model : ??? # Where to store the finetuned model
  name_environment : ' BabyAI-MixedTestLocal-v0 ' # BabiAI-Text's environment 
  load_embedding : true # Whether trained embedding layers should be loaded (useful when lm_args.pretrained=False). Setting both this and use_action_heads to True (lm_args.pretrained=False) creates our NPAE agent.
  use_action_heads : false # Whether action heads should be used instead of scoring. Setting both this and use_action_heads to True (lm_args.pretrained=False) creates our NPAE agent.
  template_test : 1 # Which prompt template to use to log evolution of action's probability (Section C of our paper). Choices or [1, 2].
  nbr_obs : 3 # Number of past observation used in the prompt

언어 모델 자체와 관련된 구성 항목은 Lamorel을 참조하세요.

테스트 에피소드의 성능 평가

테스트 작업에서 에이전트(예: 훈련된 LLM, BabyAI의 봇...)의 성능을 평가하려면 post-training_tests.py 사용하고 다음 구성 항목을 설정하십시오.

 rl_script_args :
  seed : 1
  number_envs : 2 # Number of parallel envs to launch (steps will be synchronized, i.e. a step call will return number_envs observations)
  max_episode_steps : 3 # Maximum number of steps in a single episode
  action_space : ["turn_left","turn_right","go_forward","pick_up","drop","toggle"] # Possible actions for the agent
  saving_path_logs : ??? # Where to store logs
  name_experiment : ' llm_mtrl ' # Useful for logging
  name_model : ' T5small ' # Useful for logging
  saving_path_model : ??? # Where to store the finetuned model
  name_environment : ' BabyAI-MixedTestLocal-v0 ' # BabiAI-Text's environment 
  load_embedding : true # Whether trained embedding layers should be loaded (useful when lm_args.pretrained=False). Setting both this and use_action_heads to True (lm_args.pretrained=False) creates our NPAE agent.
  use_action_heads : false # Whether action heads should be used instead of scoring. Setting both this and use_action_heads to True (lm_args.pretrained=False) creates our NPAE agent.
  nbr_obs : 3 # Number of past observation used in the prompt
  number_episodes : 10 # Number of test episodes
  language : ' english ' # Useful to perform the French experiment (Section H4)
  zero_shot : true # Whether the zero-shot LLM (i.e. without finetuning should be used)
  modified_action_space : false # Whether a modified action space (e.g. different from the one seen during training) should be used
  new_action_space : # ["rotate_left","rotate_right","move_ahead","take","release","switch"] # Modified action space
  im_learning : false # Whether a LLM produced with Behavioral Cloning should be used
  im_path : " " # Path to the LLM learned with Behavioral Cloning
  bot : false # Whether the BabyAI's bot agent should be used