Alphafold 3 在 Pytorch 中的实现
您可以在这里与其他研究人员讨论这项工作
谢尔盖对论文的评论
Elana P. Simon 绘制的图解指南
马克斯·贾德伯格的演讲
Alex 在此存储库中维护了一个具有完整 Lightning + Hydra 支持的分叉
可以在此处查看存储库中使用的生命分子的可视化并与之交互
Joseph 贡献了相对位置编码和平滑 LDDT 损失!
Felipe 贡献了加权刚性对齐、帧内表达坐标、计算对齐误差和中心随机增强模块!
Alex 修复了转录算法中的各种问题
Heng 指出了与论文不一致的地方并请求解决方案
Heng 发现了直方图损失的分子原子指数问题
Wei Lu 发现了一些错误的超参数
Alex 提供 PDB 数据集准备脚本!
Milot 用于优化 PDB 数据集聚类脚本!
Alex 基本上编写了从解析 PDB 一直到用于训练的分子和原子输入的整个庞大流程
Andrei 致力于加权 PDB 数据集采样!
Jimin 针对坐标传递到WeightedRigidAlign
问题提交了一个小修复
@xluo233 贡献了置信度测量、冲突惩罚排名和示例排名逻辑!
sj900 用于在PDBDataset
中集成和测试WeightedPDBSampler
以及添加对 MSA 和模板解析的初始支持!
@xluo233 再次贡献了计算模型选择分数以及未解决的 rasa 的逻辑!
Fandi 发现阐明的原子扩散模块与补充的一些不一致之处
Paolo 提出PDB neutral stable molecule
假说!
Dhuvi 修复了与Alphafold3Inputs
的金属离子分子 ID 分配相关的错误!
Dhuvi 承担了将Alphafold3Input
转换为BioMolecule
并保存到 mmCIF 的逻辑!
Tom(来自 Discord 频道)发现了该代码库的直方图和模板单位向量计算与 OpenFold 之间的差异(Andrei 帮助解决了直方图问题)!
Kaihui 发现了聚合物残基中非标准原子处理方式的错误!
Andrei 负责 gradio 前端界面!
Patrick 负责 jaxtyping,Florian 负责 einx,当然还有 Alex 负责 einops
Soumith 和 Pytorch 组织给了我开源这项工作的机会
$ pip install alphafold3-pytorch
import torch
from alphafold3_pytorch import Alphafold3
from alphafold3_pytorch . utils . model_utils import exclusive_cumsum
alphafold3 = Alphafold3 (
dim_atom_inputs = 77 ,
dim_template_feats = 108
)
# mock inputs
seq_len = 16
molecule_atom_indices = torch . randint ( 0 , 2 , ( 2 , seq_len )). long ()
molecule_atom_lens = torch . full (( 2 , seq_len ), 2 ). long ()
atom_seq_len = molecule_atom_lens . sum ( dim = - 1 ). amax ()
atom_offsets = exclusive_cumsum ( molecule_atom_lens )
atom_inputs = torch . randn ( 2 , atom_seq_len , 77 )
atompair_inputs = torch . randn ( 2 , atom_seq_len , atom_seq_len , 5 )
additional_molecule_feats = torch . randint ( 0 , 2 , ( 2 , seq_len , 5 ))
additional_token_feats = torch . randn ( 2 , seq_len , 33 )
is_molecule_types = torch . randint ( 0 , 2 , ( 2 , seq_len , 5 )). bool ()
is_molecule_mod = torch . randint ( 0 , 2 , ( 2 , seq_len , 4 )). bool ()
molecule_ids = torch . randint ( 0 , 32 , ( 2 , seq_len ))
template_feats = torch . randn ( 2 , 2 , seq_len , seq_len , 108 )
template_mask = torch . ones (( 2 , 2 )). bool ()
msa = torch . randn ( 2 , 7 , seq_len , 32 )
msa_mask = torch . ones (( 2 , 7 )). bool ()
additional_msa_feats = torch . randn ( 2 , 7 , seq_len , 2 )
# required for training, but omitted on inference
atom_pos = torch . randn ( 2 , atom_seq_len , 3 )
distogram_atom_indices = molecule_atom_lens - 1
distance_labels = torch . randint ( 0 , 37 , ( 2 , seq_len , seq_len ))
resolved_labels = torch . randint ( 0 , 2 , ( 2 , atom_seq_len ))
# offset indices correctly
distogram_atom_indices += atom_offsets
molecule_atom_indices += atom_offsets
# train
loss = alphafold3 (
num_recycling_steps = 2 ,
atom_inputs = atom_inputs ,
atompair_inputs = atompair_inputs ,
molecule_ids = molecule_ids ,
molecule_atom_lens = molecule_atom_lens ,
additional_molecule_feats = additional_molecule_feats ,
additional_msa_feats = additional_msa_feats ,
additional_token_feats = additional_token_feats ,
is_molecule_types = is_molecule_types ,
is_molecule_mod = is_molecule_mod ,
msa = msa ,
msa_mask = msa_mask ,
templates = template_feats ,
template_mask = template_mask ,
atom_pos = atom_pos ,
distogram_atom_indices = distogram_atom_indices ,
molecule_atom_indices = molecule_atom_indices ,
distance_labels = distance_labels ,
resolved_labels = resolved_labels
)
loss . backward ()
# after much training ...
sampled_atom_pos = alphafold3 (
num_recycling_steps = 4 ,
num_sample_steps = 16 ,
atom_inputs = atom_inputs ,
atompair_inputs = atompair_inputs ,
molecule_ids = molecule_ids ,
molecule_atom_lens = molecule_atom_lens ,
additional_molecule_feats = additional_molecule_feats ,
additional_msa_feats = additional_msa_feats ,
additional_token_feats = additional_token_feats ,
is_molecule_types = is_molecule_types ,
is_molecule_mod = is_molecule_mod ,
msa = msa ,
msa_mask = msa_mask ,
templates = template_feats ,
template_mask = template_mask
)
sampled_atom_pos . shape # (2, , 3)
分子级输入处理的示例
import torch
from alphafold3_pytorch import Alphafold3 , Alphafold3Input
contrived_protein = 'AG'
mock_atompos = [
torch . randn ( 5 , 3 ), # alanine has 5 non-hydrogen atoms
torch . randn ( 4 , 3 ) # glycine has 4 non-hydrogen atoms
]
train_alphafold3_input = Alphafold3Input (
proteins = [ contrived_protein ],
atom_pos = mock_atompos
)
eval_alphafold3_input = Alphafold3Input (
proteins = [ contrived_protein ]
)
# training
alphafold3 = Alphafold3 (
dim_atom_inputs = 3 ,
dim_atompair_inputs = 5 ,
atoms_per_window = 27 ,
dim_template_feats = 108 ,
num_molecule_mods = 0 ,
confidence_head_kwargs = dict (
pairformer_depth = 1
),
template_embedder_kwargs = dict (
pairformer_stack_depth = 1
),
msa_module_kwargs = dict (
depth = 1
),
pairformer_stack = dict (
depth = 2
),
diffusion_module_kwargs = dict (
atom_encoder_depth = 1 ,
token_transformer_depth = 1 ,
atom_decoder_depth = 1 ,
)
)
loss = alphafold3 . forward_with_alphafold3_inputs ([ train_alphafold3_input ])
loss . backward ()
# sampling
alphafold3 . eval ()
sampled_atom_pos = alphafold3 . forward_with_alphafold3_inputs ( eval_alphafold3_input )
assert sampled_atom_pos . shape == ( 1 , ( 5 + 4 ), 3 )
要获取 AlphaFold 3 PDB 数据集,请首先下载蛋白质数据库 (PDB) 中的所有首次组装(和不对称单元)复合物,然后使用下面引用的脚本对其进行预处理。 PDB 可以从 RCSB 下载:https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb。下面的两个 Python 脚本(即, filter_pdb_{train,val,test}_mmcifs.py
和cluster_pdb_{train,val,test}_mmcifs.py
)假设您已下载mmCIF 文件格式的 PDB,并将其第一个程序集和非对称单元 mmCIF 文件分别位于data/pdb_data/unfiltered_assembly_mmcifs/
和data/pdb_data/unfiltered_asym_mmcifs/
。
为了重现性,我们建议使用 AWS 快照下载 PDB(例如20240101
)。为此,请参阅 AWS 文档以在本地设置 AWS CLI。或者,在 RCSB 网站上,导航至“下载协议”,然后根据您所在的位置按照下载说明进行操作。
例如,可以使用以下命令将 PDB 下载为两个 mmCIF 文件集合:
# For `assembly1` complexes, use the PDB's `20240101` AWS snapshot:
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs
# Or as a fallback, use rsync:
rsync -rlpt -v -z --delete --port=33444
rsync.rcsb.org::ftp_data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs/
# For asymmetric unit complexes, also use the PDB's `20240101` AWS snapshot:
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs
# Or as a fallback, use rsync:
rsync -rlpt -v -z --delete --port=33444
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs/
警告:下载 PDB 最多可能占用 700GB 空间。
注意:PDB 在此处托管所有可用的 AWS 快照:https://pdbsnapshots.s3.us-west-2.amazonaws.com/index.html。
下载后,您应该有两个格式如下的目录:https://files.rcsb.org/pub/pdb/data/assemblies/mmCIF/divided/ 和 https://files.rcsb.org/pub/pdb/data /结构/划分/mmCIF/
00/
01/
02/
..
zz/
对于这些目录,解压缩所有文件:
find ./data/pdb_data/unfiltered_assembly_mmcifs/ -type f -name " *.gz " -exec gzip -d {} ;
find ./data/pdb_data/unfiltered_asym_mmcifs/ -type f -name " *.gz " -exec gzip -d {} ;
接下来运行命令
wget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz
wget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/component-models/complete/chem_comp_model.cif.gz
从项目的根目录下载最新版本的PDB化学成分词典(CCD)及其结构模型。使用以下命令提取每个文件:
find data/ccd_data/ -type f -name " *.gz " -exec gzip -d {} ;
然后运行以下命令,并将pdb_assembly_dir
、 pdb_asym_dir
、 ccd_dir
和mmcif_output_dir
替换为第一个程序集 PDB、非对称单元 PDB、CCD 的本地副本的位置以及所需的数据集输出目录(即./data/pdb_data/unfiltered_assembly_mmcifs/
、 ./data/pdb_data/unfiltered_asym_mmcifs/
、 ./data/ccd_data/
和./data/pdb_data/{train,val,test}_mmcifs/
)。
python scripts/filter_pdb_train_mmcifs.py --mmcif_assembly_dir < pdb_assembly_dir > --mmcif_asym_dir < pdb_asym_dir > --ccd_dir < ccd_dir > --output_dir < mmcif_output_dir >
python scripts/filter_pdb_val_mmcifs.py --mmcif_assembly_dir < pdb_assembly_dir > --mmcif_asym_dir < pdb_asym_dir > --output_dir < mmcif_output_dir >
python scripts/filter_pdb_test_mmcifs.py --mmcif_assembly_dir < pdb_assembly_dir > --mmcif_asym_dir < pdb_asym_dir > --output_dir < mmcif_output_dir >
请参阅脚本以获取更多选项。每个成功通过所有处理步骤的第一装配 mmCIF 将被写入mmcif_output_dir
中的子目录,该子目录根据 mmCIF 的第二个和第三个 PDB ID 字符(例如5c
)命名。
接下来,运行以下命令,并将mmcif_dir
和{train,val,test}_clustering_output_dir
分别替换为使用上面的数据集过滤脚本创建的本地输出目录以及所需的聚类输出目录(即./data/pdb_data/{train,val,test}_mmcifs/
和./data/pdb_data/data_caches/{train,val,test}_clusterings/
):
python scripts/cluster_pdb_train_mmcifs.py --mmcif_dir < mmcif_dir > --output_dir < train_clustering_output_dir > --clustering_filtered_pdb_dataset
python scripts/cluster_pdb_val_mmcifs.py --mmcif_dir < mmcif_dir > --reference_clustering_dir < train_clustering_output_dir > --output_dir < val_clustering_output_dir > --clustering_filtered_pdb_dataset
python scripts/cluster_pdb_test_mmcifs.py --mmcif_dir < mmcif_dir > --reference_1_clustering_dir < train_clustering_output_dir > --reference_2_clustering_dir < val_clustering_output_dir > --output_dir < test_clustering_output_dir > --clustering_filtered_pdb_dataset
注意:当使用上面的脚本对过滤后的 PDB 数据集进行聚类时,建议使用--clustering_filtered_pdb_dataset
标志,因为该标志将在此上下文中实现更快的运行时间(因为过滤使每个链的残基 ID 从 1 开始)。但是,在对 mmCIF 文件的其他(即非 PDB)数据集进行聚类时,不得提供此标志。否则,界面聚类可能无法正确执行,因为这些数据集的 mmCIF 文件可能不会对每个链使用严格的基于 1 的残基索引。
注意:可以改为下载 PDB 20240101 的预处理(即过滤)mmCIF( train
/ val
/ test
)文件(~25GB,包括 148k 复合体)和链/接口聚类( train
/ val
/ test
)文件(~ 20240101
)通过共享 OneDrive 文件夹进行 AWS 快照。每个tar.gz
存档都应在data/pdb_data/
目录中解压缩,例如通过tar -xzf data_caches.tar.gz -C data/pdb_data/
。还可以使用脚本scripts/distillation_data_download.sh
作为参考来下载和准备 PDB 蒸馏数据。下载后,可以运行scripts/reduce_uniprot_predictions_to_pdb.py
来过滤此数据集,仅获取与至少一个 PDB 条目关联的示例。此外,为方便起见,用于 PDB 蒸馏数据训练的 UniProt 登录 ID 到 PDB ID 的映射已下载并提取为data/afdb_data/data_caches/uniprot_to_pdb_id_mapping.dat
。
在项目根目录下,运行
$ sh ./contribute.sh
然后,将模块添加到alphafold3_pytorch/alphafold3.py
,将测试添加到tests/test_af3.py
,然后提交拉取请求。您可以在本地运行测试
$ pytest tests/
包含的Dockerfile
包含运行包以及使用 PyTorch 和 GPU 进行训练/推理所需的依赖项。
默认基础映像是pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
并从main
GitHub 分支安装此包的最新版本。
# # Build Docker Container
docker build -t af3 .
或者,使用构建参数使用不同的软件版本重建映像:
PYTORCH_TAG
:更改基础映像,从而使用不同的 PyTorch、CUDA 和/或 cuDNN 版本进行构建。GIT_TAG
:更改此存储库的标签以克隆并安装包。例如:
# # Use build argument to change versions
docker build --build-arg " PYTORCH_TAG=2.2.1-cuda12.1-cudnn8-devel " --build-arg " GIT_TAG=0.1.15 " -t af3 .
然后,使用 GPU 运行容器并使用以下命令安装本地卷(用于训练):
# # Run Container
docker run -v .:/data --gpus all -it af3
@article { Abramson2024-fj ,
title = " Accurate structure prediction of biomolecular interactions with
{AlphaFold} 3 " ,
author = " Abramson, Josh and Adler, Jonas and Dunger, Jack and Evans,
Richard and Green, Tim and Pritzel, Alexander and Ronneberger,
Olaf and Willmore, Lindsay and Ballard, Andrew J and Bambrick,
Joshua and Bodenstein, Sebastian W and Evans, David A and Hung,
Chia-Chun and O'Neill, Michael and Reiman, David and
Tunyasuvunakool, Kathryn and Wu, Zachary and {v Z}emgulyt{.e},
Akvil{.e} and Arvaniti, Eirini and Beattie, Charles and
Bertolli, Ottavia and Bridgland, Alex and Cherepanov, Alexey and
Congreve, Miles and Cowen-Rivers, Alexander I and Cowie, Andrew
and Figurnov, Michael and Fuchs, Fabian B and Gladman, Hannah and
Jain, Rishub and Khan, Yousuf A and Low, Caroline M R and Perlin,
Kuba and Potapenko, Anna and Savy, Pascal and Singh, Sukhdeep and
Stecula, Adrian and Thillaisundaram, Ashok and Tong, Catherine
and Yakneen, Sergei and Zhong, Ellen D and Zielinski, Michal and
{v Z}{'i}dek, Augustin and Bapst, Victor and Kohli, Pushmeet
and Jaderberg, Max and Hassabis, Demis and Jumper, John M " ,
journal = " Nature " ,
month = " May " ,
year = 2024
}
@inproceedings { Darcet2023VisionTN ,
title = { Vision Transformers Need Registers } ,
author = { Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski } ,
year = { 2023 } ,
url = { https://api.semanticscholar.org/CorpusID:263134283 }
}
@article { Arora2024SimpleLA ,
title = { Simple linear attention language models balance the recall-throughput tradeoff } ,
author = { Simran Arora and Sabri Eyuboglu and Michael Zhang and Aman Timalsina and Silas Alberti and Dylan Zinsley and James Zou and Atri Rudra and Christopher R'e } ,
journal = { ArXiv } ,
year = { 2024 } ,
volume = { abs/2402.18668 } ,
url = { https://api.semanticscholar.org/CorpusID:268063190 }
}
@article { Puny2021FrameAF ,
title = { Frame Averaging for Invariant and Equivariant Network Design } ,
author = { Omri Puny and Matan Atzmon and Heli Ben-Hamu and Edward James Smith and Ishan Misra and Aditya Grover and Yaron Lipman } ,
journal = { ArXiv } ,
year = { 2021 } ,
volume = { abs/2110.03336 } ,
url = { https://api.semanticscholar.org/CorpusID:238419638 }
}
@article { Duval2023FAENetFA ,
title = { FAENet: Frame Averaging Equivariant GNN for Materials Modeling } ,
author = { Alexandre Duval and Victor Schmidt and Alex Hernandez Garcia and Santiago Miret and Fragkiskos D. Malliaros and Yoshua Bengio and David Rolnick } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2305.05577 } ,
url = { https://api.semanticscholar.org/CorpusID:258564608 }
}
@article { Wang2022DeepNetST ,
title = { DeepNet: Scaling Transformers to 1, 000 Layers } ,
author = { Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei } ,
journal = { ArXiv } ,
year = { 2022 } ,
volume = { abs/2203.00555 } ,
url = { https://api.semanticscholar.org/CorpusID:247187905 }
}
@inproceedings { Ainslie2023CoLT5FL ,
title = { CoLT5: Faster Long-Range Transformers with Conditional Computation } ,
author = { Joshua Ainslie and Tao Lei and Michiel de Jong and Santiago Ontan'on and Siddhartha Brahma and Yury Zemlyanskiy and David Uthus and Mandy Guo and James Lee-Thorp and Yi Tay and Yun-Hsuan Sung and Sumit Sanghai } ,
year = { 2023 }
}
@article { Ash2019OnTD ,
title = { On the Difficulty of Warm-Starting Neural Network Training } ,
author = { Jordan T. Ash and Ryan P. Adams } ,
journal = { ArXiv } ,
year = { 2019 } ,
volume = { abs/1910.08475 } ,
url = { https://api.semanticscholar.org/CorpusID:204788802 }
}
@ARTICLE { Heinzinger2023.07.23.550085 ,
author = { Michael Heinzinger and Konstantin Weissenow and Joaquin Gomez Sanchez and Adrian Henkel and Martin Steinegger and Burkhard Rost } ,
title = { ProstT5: Bilingual Language Model for Protein Sequence and Structure } ,
year = { 2023 } ,
doi = { 10.1101/2023.07.23.550085 } ,
journal = { bioRxiv }
}
@article { Lin2022.07.20.500902 ,
author = { Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Santos Costa, Allan dos and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and Rives, Alexander } ,
title = { Language models of protein sequences at the scale of evolution enable accurate structure prediction } ,
elocation-id = { 2022.07.20.500902 } ,
year = { 2022 } ,
doi = { 10.1101/2022.07.20.500902 } ,
publisher = { Cold Spring Harbor Laboratory } ,
URL = { https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902 } ,
eprint = { https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902.full.pdf } ,
journal = { bioRxiv }
}
@article { Li2024SwitchEA ,
title = { Switch EMA: A Free Lunch for Better Flatness and Sharpness } ,
author = { Siyuan Li and Zicheng Liu and Juanxi Tian and Ge Wang and Zedong Wang and Weiyang Jin and Di Wu and Cheng Tan and Tao Lin and Yang Liu and Baigui Sun and Stan Z. Li } ,
journal = { ArXiv } ,
year = { 2024 } ,
volume = { abs/2402.09240 } ,
url = { https://api.semanticscholar.org/CorpusID:267657558 }
}
@article { Nguyen2023MitigatingOI ,
title = { Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals } ,
author = { Tam Nguyen and Tan M. Nguyen and Richard G. Baraniuk } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2312.00751 } ,
url = { https://api.semanticscholar.org/CorpusID:264300597 }
}
@inproceedings { Zhou2024ValueRL ,
title = { Value Residual Learning For Alleviating Attention Concentration In Transformers } ,
author = { Zhanchao Zhou and Tianyi Wu and Zhiyun Jiang and Zhenzhong Lan } ,
year = { 2024 } ,
url = { https://api.semanticscholar.org/CorpusID:273532030 }
}