Shalev Lifshitz*, Keiran Paster*, Harris Chan†, Jimmy Ba, Sheila McIlraith
Project Page | ArXiv | PDF
Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces a methodology, inspired by unCLIP, for instruction-tuning generative models of behavior without relying on a large dataset of instruction-labeled trajectories. Using this methodology, we create an instruction-tuned Video Pretraining (VPT) model called STEVE-1, which can follow short-horizon open-ended text and visual instructions in Minecraft™. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, reducing the need for costly human text annotations, and all for only $60 of compute. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 sets a new bar for open-ended instruction-following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines and robustly completing 12 of 13 tasks in our early-game evaluation suite. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.
.
├── README.md
├── steve1
│ ├── All agent, dataset, and training code.
├── run_agent
│ ├── Scripts for running the agent.
├── train
│ ├── Script for training the agent and generating the dataset.
We recommend running on linux using a conda environment, with python 3.10.
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install minedojo git+https://github.com/MineDojo/MineCLIP
pip install git+https://github.com/minerllabs/[email protected]
pip install gym==0.19 gym3 attrs opencv-python
pip install gdown tqdm accelerate==0.18.0 wandb
steve1
locally with: pip install -e .
If you are running on a headless server, you need to install xvfb
and run each python script with xvfb-run
. For example, xvfb-run python script_name.py
.
Also, notice that we use the MineRL environment, not the MineDojo environment. Thus, setting
MINEDOJO_HEADLESS=1
as mentioned in the 'MineDojo Installation' instructions will have no effect.
Run the following command to download the data and weights:
. download_weights.sh
To train STEVE-1 from scratch, please run the following steps:
. train/1_generate_dataset.sh
. train/2_create_sampling.sh
. train/3_train.sh
. train/4_train_prior.sh
We provided two scripts for testing out the agent with different prompts. To test out your own trained agents, please modify the --in_weights
argument in the scripts.
. run_agent/1_gen_paper_videos.sh
to generate the videos used in the paper.. run_agent/2_gen_vid_for_text_prompt.sh
to generate videos for arbitrary text prompts.. run_agent/3_run_interactive_session.sh
to start an interactive session with STEVE-1. This will not work in headless mode.Please cite our paper if you find STEVE-1 useful for your research:
@article{lifshitz2023steve1,
title={STEVE-1: A Generative Model for Text-to-Behavior in Minecraft},
author={Shalev Lifshitz and Keiran Paster and Harris Chan and Jimmy Ba and Sheila McIlraith},
year={2023},
eprint={2306.00937},
archivePrefix={arXiv},
primaryClass={cs.LG}
}