文档| tensortict |功能|示例,教程和演示|引用|安装|问一个问题|贡献
Torchrl是Pytorch的开源加固学习(RL)库。
阅读完整的论文,以进行库的更精心策划的描述。
检查我们的入门教程,以快速使用图书馆的基本功能!
Torchrl文档可以在此处找到。它包含教程和API参考。
Torchrl还提供了RL知识库,以帮助您调试代码,或者只需学习RL的基础知识即可。在这里检查一下。
我们有一些介绍性视频供您更好地了解图书馆,请查看它们:
Torchrl是域 - 不可思议的,您可以在许多不同的字段中使用它。这里有几个例子:
TensorDict
编写简化和便携式RL代码库RL算法是非常异构的,很难跨设置(例如,从在线到离线,从州基于基于像素的学习到基于像素的学习)很难回收代码库。 Torchrl通过TensorDict
解决了这个问题,Tensordict是一种方便的数据结构(1) ,可用于简化一个人的RL代码库。使用此工具,可以在不到100行代码的情况下编写完整的PPO培训脚本!
import torch
from tensordict . nn import TensorDictModule
from tensordict . nn . distributions import NormalParamExtractor
from torch import nn
from torchrl . collectors import SyncDataCollector
from torchrl . data . replay_buffers import TensorDictReplayBuffer ,
LazyTensorStorage , SamplerWithoutReplacement
from torchrl . envs . libs . gym import GymEnv
from torchrl . modules import ProbabilisticActor , ValueOperator , TanhNormal
from torchrl . objectives import ClipPPOLoss
from torchrl . objectives . value import GAE
env = GymEnv ( "Pendulum-v1" )
model = TensorDictModule (
nn . Sequential (
nn . Linear ( 3 , 128 ), nn . Tanh (),
nn . Linear ( 128 , 128 ), nn . Tanh (),
nn . Linear ( 128 , 128 ), nn . Tanh (),
nn . Linear ( 128 , 2 ),
NormalParamExtractor ()
),
in_keys = [ "observation" ],
out_keys = [ "loc" , "scale" ]
)
critic = ValueOperator (
nn . Sequential (
nn . Linear ( 3 , 128 ), nn . Tanh (),
nn . Linear ( 128 , 128 ), nn . Tanh (),
nn . Linear ( 128 , 128 ), nn . Tanh (),
nn . Linear ( 128 , 1 ),
),
in_keys = [ "observation" ],
)
actor = ProbabilisticActor (
model ,
in_keys = [ "loc" , "scale" ],
distribution_class = TanhNormal ,
distribution_kwargs = { "low" : - 1.0 , "high" : 1.0 },
return_log_prob = True
)
buffer = TensorDictReplayBuffer (
storage = LazyTensorStorage ( 1000 ),
sampler = SamplerWithoutReplacement (),
batch_size = 50 ,
)
collector = SyncDataCollector (
env ,
actor ,
frames_per_batch = 1000 ,
total_frames = 1_000_000 ,
)
loss_fn = ClipPPOLoss ( actor , critic )
adv_fn = GAE ( value_network = critic , average_gae = True , gamma = 0.99 , lmbda = 0.95 )
optim = torch . optim . Adam ( loss_fn . parameters (), lr = 2e-4 )
for data in collector : # collect data
for epoch in range ( 10 ):
adv_fn ( data ) # compute advantage
buffer . extend ( data )
for sample in buffer : # consume data
loss_vals = loss_fn ( sample )
loss_val = sum (
value for key , value in loss_vals . items () if
key . startswith ( "loss" )
)
loss_val . backward ()
optim . step ()
optim . zero_grad ()
print ( f"avg reward: { data [ 'next' , 'reward' ]. mean (). item (): 4.4f } " )
这是环境如何依赖TensorDict在推出执行过程中将数据从一个函数运送到另一个函数的示例:
TensorDict
使得可以轻松地跨环境,模型和算法重新使用代码。
例如,这是在Torchrl中编码推出的方法:
- obs, done = env.reset()
+ tensordict = env.reset()
policy = SafeModule(
model,
in_keys=["observation_pixels", "observation_vector"],
out_keys=["action"],
)
out = []
for i in range(n_steps):
- action, log_prob = policy(obs)
- next_obs, reward, done, info = env.step(action)
- out.append((obs, next_obs, action, log_prob, reward, done))
- obs = next_obs
+ tensordict = policy(tensordict)
+ tensordict = env.step(tensordict)
+ out.append(tensordict)
+ tensordict = step_mdp(tensordict) # renames next_observation_* keys to observation_*
- obs, next_obs, action, log_prob, reward, done = [torch.stack(vals, 0) for vals in zip(*out)]
+ out = torch.stack(out, 0) # TensorDict supports multiple tensor operations
使用此功能,Torchrl将模块,Env,收集器,重播缓冲区和库的输入 /输出签名抽象,从而使所有原始词都可以轻松地跨设置进行回收。
这是torchrl中跨政策训练循环的另一个示例(假设数据收集器,重播缓冲区,损失和优化器已实例化):
- for i, (obs, next_obs, action, hidden_state, reward, done) in enumerate(collector):
+ for i, tensordict in enumerate(collector):
- replay_buffer.add((obs, next_obs, action, log_prob, reward, done))
+ replay_buffer.add(tensordict)
for j in range(num_optim_steps):
- obs, next_obs, action, hidden_state, reward, done = replay_buffer.sample(batch_size)
- loss = loss_fn(obs, next_obs, action, hidden_state, reward, done)
+ tensordict = replay_buffer.sample(batch_size)
+ loss = loss_fn(tensordict)
loss.backward()
optim.step()
optim.zero_grad()
该训练循环可以跨算法重新使用,因为它对数据结构的假设数量最少。
TensOrdict在其设备和形状上支持多个张量操作(Tensordict的形状或批处理大小,是其所有包含张量的常见任意N第一维):
# stack and cat
tensordict = torch . stack ( list_of_tensordicts , 0 )
tensordict = torch . cat ( list_of_tensordicts , 0 )
# reshape
tensordict = tensordict . view ( - 1 )
tensordict = tensordict . permute ( 0 , 2 , 1 )
tensordict = tensordict . unsqueeze ( - 1 )
tensordict = tensordict . squeeze ( - 1 )
# indexing
tensordict = tensordict [: 2 ]
tensordict [:, 2 ] = sub_tensordict
# device and memory location
tensordict . cuda ()
tensordict . to ( "cuda:1" )
tensordict . share_memory_ ()
Tensordict带有专用的tensordict.nn
模块,其中包含您可能需要使用它来编写模型的所有内容。它是functorch
和torch.compile
兼容!
transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
+ td_module = SafeModule(transformer_model, in_keys=["src", "tgt"], out_keys=["out"])
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
+ tensordict = TensorDict({"src": src, "tgt": tgt}, batch_size=[20, 32])
- out = transformer_model(src, tgt)
+ td_module(tensordict)
+ out = tensordict["out"]
TensorDictSequential
类允许以高度模块化的方式分支nn.Module
实例的序列。例如,这里是使用编码器和解码器块的变压器实现:
encoder_module = TransformerEncoder (...)
encoder = TensorDictSequential ( encoder_module , in_keys = [ "src" , "src_mask" ], out_keys = [ "memory" ])
decoder_module = TransformerDecoder (...)
decoder = TensorDictModule ( decoder_module , in_keys = [ "tgt" , "memory" ], out_keys = [ "output" ])
transformer = TensorDictSequential ( encoder , decoder )
assert transformer . in_keys == [ "src" , "src_mask" , "tgt" ]
assert transformer . out_keys == [ "memory" , "output" ]
TensorDictSequential
允许通过查询一组所需的输入 /输出密钥来隔离子图:
transformer . select_subsequence ( out_keys = [ "memory" ]) # returns the encoder
transformer . select_subsequence ( in_keys = [ "tgt" , "memory" ]) # returns the decoder
检查Tensordict教程以了解更多信息!
一个支持常见库(OpenAi Gym,DeepMind Control Lab等)的环境的通用接口(1)和无州执行(例如基于模型的环境)。批处理环境容器允许并行执行(2) 。还提供了普通的张量张量规格类别。 Torchrl的环境API简单但严格且具体。检查文档和教程以了解更多信息!
env_make = lambda : GymEnv ( "Pendulum-v1" , from_pixels = True )
env_parallel = ParallelEnv ( 4 , env_make ) # creates 4 envs in parallel
tensordict = env_parallel . rollout ( max_steps = 20 , policy = None ) # random rollout (no policy given)
assert tensordict . shape == [ 4 , 20 ] # 4 envs, 20 steps rollout
env_parallel . action_spec . is_in ( tensordict [ "action" ]) # spec check returns True
同步或异步工作的多进程和分布式数据收集器(2) 。通过使用Tensordict,Torchrl的训练循环与监督学习中的常规培训循环非常相似(尽管“数据载体” - 读取数据收集器 - 经过修改):
env_make = lambda : GymEnv ( "Pendulum-v1" , from_pixels = True )
collector = MultiaSyncDataCollector (
[ env_make , env_make ],
policy = policy ,
devices = [ "cuda:0" , "cuda:0" ],
total_frames = 10000 ,
frames_per_batch = 50 ,
...
)
for i , tensordict_data in enumerate ( collector ):
loss = loss_module ( tensordict_data )
loss . backward ()
optim . step ()
optim . zero_grad ()
collector . update_policy_weights_ ()
检查我们的分布式收集器示例,以了解有关Torchrl超快速数据收集的更多信息。
有效(2)和通用(1)带有模块化存储的重播缓冲区:
storage = LazyMemmapStorage ( # memory-mapped (physical) storage
cfg . buffer_size ,
scratch_dir = "/tmp/"
)
buffer = TensorDictPrioritizedReplayBuffer (
alpha = 0.7 ,
beta = 0.5 ,
collate_fn = lambda x : x ,
pin_memory = device != torch . device ( "cpu" ),
prefetch = 10 , # multi-threaded sampling
storage = storage
)
还可以将重播缓冲区作为围绕脱机RL的通用数据集提供作为包装器:
from torchrl . data . replay_buffers import SamplerWithoutReplacement
from torchrl . data . datasets . d4rl import D4RLExperienceReplay
data = D4RLExperienceReplay (
"maze2d-open-v0" ,
split_trajs = True ,
batch_size = 128 ,
sampler = SamplerWithoutReplacement ( drop_last = True ),
)
for sample in data : # or alternatively sample = data.sample()
fun ( sample )
跨图像环境变换(1) ,以设备和矢量化的方式执行(2) ,该过程中的数据并准备了来自代理商使用的环境的数据:
env_make = lambda : GymEnv ( "Pendulum-v1" , from_pixels = True )
env_base = ParallelEnv ( 4 , env_make , device = "cuda:0" ) # creates 4 envs in parallel
env = TransformedEnv (
env_base ,
Compose (
ToTensorImage (),
ObservationNorm ( loc = 0.5 , scale = 1.0 )), # executes the transforms once and on device
)
tensordict = env . reset ()
assert tensordict . device == torch . device ( "cuda:0" )
其他转换包括:奖励缩放( RewardScaling
),形状操作(张量的串联,不合格等),连续操作的串联( CatFrames
),调整大小( Resize
)等。
与其他库不同,转换被堆叠为列表(而不是彼此包裹),这使得随意添加和删除它们很容易:
env . insert_transform ( 0 , NoopResetEnv ()) # inserts the NoopResetEnv transform at the index 0
但是,转换可以在父环境上访问和执行操作:
transform = env . transform [ 1 ] # gathers the second transform of the list
parent_env = transform . parent # returns the base environment of the second transform, i.e. the base env + the first transform
分布式学习的各种工具(例如内存映射张量) (2) ;
各种体系结构和模型(例如Actor-Critic) (1) :
# create an nn.Module
common_module = ConvNet (
bias_last_layer = True ,
depth = None ,
num_cells = [ 32 , 64 , 64 ],
kernel_sizes = [ 8 , 4 , 3 ],
strides = [ 4 , 2 , 1 ],
)
# Wrap it in a SafeModule, indicating what key to read in and where to
# write out the output
common_module = SafeModule (
common_module ,
in_keys = [ "pixels" ],
out_keys = [ "hidden" ],
)
# Wrap the policy module in NormalParamsWrapper, such that the output
# tensor is split in loc and scale, and scale is mapped onto a positive space
policy_module = SafeModule (
NormalParamsWrapper (
MLP ( num_cells = [ 64 , 64 ], out_features = 32 , activation = nn . ELU )
),
in_keys = [ "hidden" ],
out_keys = [ "loc" , "scale" ],
)
# Use a SafeProbabilisticTensorDictSequential to combine the SafeModule with a
# SafeProbabilisticModule, indicating how to build the
# torch.distribution.Distribution object and what to do with it
policy_module = SafeProbabilisticTensorDictSequential ( # stochastic policy
policy_module ,
SafeProbabilisticModule (
in_keys = [ "loc" , "scale" ],
out_keys = "action" ,
distribution_class = TanhNormal ,
),
)
value_module = MLP (
num_cells = [ 64 , 64 ],
out_features = 1 ,
activation = nn . ELU ,
)
# Wrap the policy and value funciton in a common module
actor_value = ActorValueOperator ( common_module , policy_module , value_module )
# standalone policy from this
standalone_policy = actor_value . get_policy_operator ()
勘探包装纸和模块可以轻松交换勘探和剥削(1) :
policy_explore = EGreedyWrapper ( policy )
with set_exploration_type ( ExplorationType . RANDOM ):
tensordict = policy_explore ( tensordict ) # will use eps-greedy
with set_exploration_type ( ExplorationType . DETERMINISTIC ):
tensordict = policy_explore ( tensordict ) # will not use eps-greedy
一系列有效的损耗模块和高度矢量化的功能回报和优势计算。
from torchrl . objectives import DQNLoss
loss_module = DQNLoss ( value_network = value_network , gamma = 0.99 )
tensordict = replay_buffer . sample ( batch_size )
loss = loss_module ( tensordict )
from torchrl . objectives . value . functional import vec_td_lambda_return_estimate
advantage = vec_td_lambda_return_estimate ( gamma , lmbda , next_state_value , reward , done , terminated )
一个通用的培训师类(1)执行上述训练循环。通过挂钩机制,它还支持任何给定时间的日志记录或数据转换操作。
构建与正在部署环境相对应的模型的各种食谱。
如果您认为图书馆缺少功能,请提交问题!如果您想为新功能做出贡献,请查看我们的贡献和贡献页面的呼吁。
提供了一系列最先进的实现,具有说明性的目的:
算法 | 编译支持** | 无张力的API | 模块化损失 | 连续和离散 |
DQN | 1.9倍 | + | na | +(通过ActionDiscretizer Transform) |
DDPG | 1.87倍 | + | + | - (仅连续) |
IQL | 3.22x | + | + | + |
CQL | 2.68x | + | + | + |
TD3 | 2.27倍 | + | + | - (仅连续) |
TD3+BC | 未经测试 | + | + | - (仅连续) |
A2C | 2.67倍 | + | - | + |
PPO | 2.42x | + | - | + |
囊 | 2.62x | + | - | + |
redq | 2.28x | + | - | - (仅连续) |
Dreamer V1 | 未经测试 | + | +(不同的类) | - (仅连续) |
决策变压器 | 未经测试 | + | na | - (仅连续) |
Crossq | 未经测试 | + | + | - (仅连续) |
盖尔 | 未经测试 | + | na | + |
黑斑羚 | 未经测试 | + | - | + |
IQL(MARL) | 未经测试 | + | + | + |
DDPG(MARL) | 未经测试 | + | + | - (仅连续) |
PPO(MARL) | 未经测试 | + | - | + |
QMIX-VDN(MARL) | 未经测试 | + | na | + |
囊(MARL) | 未经测试 | + | - | + |
RLHF | na | + | na | na |
**在CPU上执行时,数字指示与急切模式相比的预期加速。数字可能会根据体系结构和设备而有所不同。
还有更多!
还提供显示玩具代码片段和培训脚本的代码示例
检查示例目录,以获取有关处理各种配置设置的更多详细信息。
我们还提供教程和演示,使图书馆可以做什么。
如果您使用的是Torchrl,请参阅此Bibtex条目以引用此工作:
@misc{bou2023torchrl,
title={TorchRL: A data-driven decision-making library for PyTorch},
author={Albert Bou and Matteo Bettini and Sebastian Dittert and Vikash Kumar and Shagun Sodhani and Xiaomeng Yang and Gianni De Fabritiis and Vincent Moens},
year={2023},
eprint={2306.00577},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
创建一个将安装软件包的Conda环境。
conda create --name torch_rl python=3.9
conda activate torch_rl
Pytorch
根据要制作的功能孔的使用,您可能需要安装最新的(夜间)Pytorch版本或最新的Pytorch版本。请参阅此处,以获取详细的命令列表,包括pip3
或其他特殊安装说明。
TORCHRL
您可以使用
pip3 install torchrl
这应该在Linux,Windows 10和OSX(Intel或Silicon芯片)上使用。在某些Windows机器(Windows 11)上,应该在本地安装库(见下文)。
夜间构建可以通过
pip3 install torchrl-nightly
我们目前仅用于Linux和OSX(Intel)机器。重要的是,夜间制造也需要夜间建造Pytorch。
要安装额外的依赖项,请致电
pip3 install " torchrl[atari,dm_control,gym_continuous,rendering,tests,utils,marl,open_spiel,checkpointing] "
或其中的子集。
一个人也可能希望在本地安装库。三个主要原因可以激发这一点:
要在本地安装库,请首先克隆回购:
git clone https://github.com/pytorch/rl
而且不要忘记检查要用于构建的分支或标签:
git checkout v0.4.0
转到克隆torchrl存储库的目录并安装它(安装ninja
之后)
cd /path/to/torchrl/
pip3 install ninja -U
python setup.py develop
一个人还可以使用轮子来构建以分发给同事的车轮
python setup.py bdist_wheel
您的车轮将存储在那里./dist/torchrl<name>.whl
<name> .whl,可通过
pip install torchrl < name > .whl
警告:不幸的是, pip3 install -e .
目前不起作用。欢迎提供帮助解决此问题的贡献!
在M1机器上,这应该与Pytorch的夜间建造一起工作。如果MacOS M1中此工件的生成无法正常工作或执行消息(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e'))
,然后尝试尝试
ARCHFLAGS="-arch arm64" python setup.py develop
要运行快速的理智检查,请离开该目录(例如执行cd ~/
)并尝试导入库。
python -c "import torchrl"
这不应返回任何警告或错误。
可选依赖性
可以根据要对Torchrl的用法进行安装以下库:
# diverse
pip3 install tqdm tensorboard "hydra-core>=1.1" hydra-submitit-launcher
# rendering
pip3 install moviepy
# deepmind control suite
pip3 install dm_control
# gym, atari games
pip3 install "gym[atari]" "gym[accept-rom-license]" pygame
# tests
pip3 install pytest pyyaml pytest-instafail
# tensorboard
pip3 install tensorboard
# wandb
pip3 install wandb
故障排除
如果一个ModuleNotFoundError: No module named 'torchrl._torchrl
发生(或警告表明无法加载C ++二进制文件),则表示未安装C ++扩展名。
develop
模式下安装TORCHRL,则以下代码段应返回错误: cd ~/path/to/rl/repo
python -c 'from torchrl.envs.libs.gym import GymEnv'
python setup.py develop
后检查日志。一个常见原因是G ++/C ++版本的差异和/或ninja
库的问题。 wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
python collect_env.py
OS: macOS *** (arm64)
OS: macOS **** (x86_64)
版本控制问题可能导致类型undefined symbol
等的错误消息。为此,请参阅版本控制文档以进行完整的解释和提议的解决方法。
如果您在库中发现一个错误,请在此存储库中提出问题。
如果您对Pytorch中的RL有一个更通用的问题,请将其发布到Pytorch论坛上。
欢迎与Torchrl进行内部合作!随意提交问题和公关。您可以在此处查看详细的贡献指南。如上所述,可以在此处找到公开贡献的清单。
建议使用贡献者安装预加压挂钩(使用pre-commit install
)。当代码在本地提交时,预警将检查相关问题。您可以通过将-n
附加到您的提交命令: git commit -m <commit message> -n
来禁用检查
该库作为Pytorch Beta功能发布。可能会发生破坏BC的变化,但在几次释放周期后将通过折旧保修将其引入。
Torchrl获得了MIT许可证的许可。有关详细信息,请参见许可证。