Alphafold 3 在 Pytorch 的實現
您可以在這裡與其他研究人員討論這項工作
謝爾蓋對論文的評論
Elana P. Simon 繪製的圖解指南
馬克斯·賈德伯格的演講
Alex 在此儲存庫中維護了一個具有完整 Lightning + Hydra 支援的分叉
可以在此處查看存儲庫中使用的生命分子的可視化並與之交互
Joseph 貢獻了相對位置編碼和平滑 LDDT 損失!
Felipe 貢獻了加權剛性對齊、幀內表達座標、計算對齊誤差和中心隨機增強模組!
Alex 修復了轉錄演算法中的各種問題
Heng 指出了與論文不一致的地方並要求解決方案
Heng 發現了直方圖損失的分子原子指數問題
Wei Lu 發現了一些錯誤的超參數
Alex 提供 PDB 資料集準備腳本!
Milot 用於優化 PDB 資料集聚類腳本!
Alex 基本上編寫了從解析 PDB 一直到用於訓練的分子和原子輸入的整個龐大流程
Andrei 致力於加權 PDB 資料集採樣!
Jimin 針對座標傳遞到WeightedRigidAlign
問題提交了一個小修復
@xluo233 貢獻了置信度測量、衝突懲罰排名和範例排名邏輯!
sj900 用於在PDBDataset
中整合和測試WeightedPDBSampler
以及添加對 MSA 和模板解析的初始支援!
@xluo233 再次貢獻了計算模型選擇分數以及未解決的 rasa 的邏輯!
Fandi 發現闡明的原子擴散模組與補充的一些不一致之處
Paolo 提出PDB neutral stable molecule
假說!
Dhuvi 修復了與Alphafold3Inputs
的金屬離子分子 ID 分配相關的錯誤!
Dhuvi 承擔了將Alphafold3Input
轉換為BioMolecule
並保存到 mmCIF 的邏輯!
Tom(來自 Discord 頻道)發現了該程式碼庫的直方圖和模板單位向量計算與 OpenFold 之間的差異(Andrei 幫助解決了直方圖問題)!
Kaihui 發現了聚合物殘基中非標準原子處理方式的錯誤!
Andrei 負責 gradio 前端介面!
Patrick 負責 jaxtyping,Florian 負責 einx,當然還有 Alex 負責 einops
Soumith 和 Pytorch 組織給了我一個開源這項工作的機會
$ pip install alphafold3-pytorch
import torch
from alphafold3_pytorch import Alphafold3
from alphafold3_pytorch . utils . model_utils import exclusive_cumsum
alphafold3 = Alphafold3 (
dim_atom_inputs = 77 ,
dim_template_feats = 108
)
# mock inputs
seq_len = 16
molecule_atom_indices = torch . randint ( 0 , 2 , ( 2 , seq_len )). long ()
molecule_atom_lens = torch . full (( 2 , seq_len ), 2 ). long ()
atom_seq_len = molecule_atom_lens . sum ( dim = - 1 ). amax ()
atom_offsets = exclusive_cumsum ( molecule_atom_lens )
atom_inputs = torch . randn ( 2 , atom_seq_len , 77 )
atompair_inputs = torch . randn ( 2 , atom_seq_len , atom_seq_len , 5 )
additional_molecule_feats = torch . randint ( 0 , 2 , ( 2 , seq_len , 5 ))
additional_token_feats = torch . randn ( 2 , seq_len , 33 )
is_molecule_types = torch . randint ( 0 , 2 , ( 2 , seq_len , 5 )). bool ()
is_molecule_mod = torch . randint ( 0 , 2 , ( 2 , seq_len , 4 )). bool ()
molecule_ids = torch . randint ( 0 , 32 , ( 2 , seq_len ))
template_feats = torch . randn ( 2 , 2 , seq_len , seq_len , 108 )
template_mask = torch . ones (( 2 , 2 )). bool ()
msa = torch . randn ( 2 , 7 , seq_len , 32 )
msa_mask = torch . ones (( 2 , 7 )). bool ()
additional_msa_feats = torch . randn ( 2 , 7 , seq_len , 2 )
# required for training, but omitted on inference
atom_pos = torch . randn ( 2 , atom_seq_len , 3 )
distogram_atom_indices = molecule_atom_lens - 1
distance_labels = torch . randint ( 0 , 37 , ( 2 , seq_len , seq_len ))
resolved_labels = torch . randint ( 0 , 2 , ( 2 , atom_seq_len ))
# offset indices correctly
distogram_atom_indices += atom_offsets
molecule_atom_indices += atom_offsets
# train
loss = alphafold3 (
num_recycling_steps = 2 ,
atom_inputs = atom_inputs ,
atompair_inputs = atompair_inputs ,
molecule_ids = molecule_ids ,
molecule_atom_lens = molecule_atom_lens ,
additional_molecule_feats = additional_molecule_feats ,
additional_msa_feats = additional_msa_feats ,
additional_token_feats = additional_token_feats ,
is_molecule_types = is_molecule_types ,
is_molecule_mod = is_molecule_mod ,
msa = msa ,
msa_mask = msa_mask ,
templates = template_feats ,
template_mask = template_mask ,
atom_pos = atom_pos ,
distogram_atom_indices = distogram_atom_indices ,
molecule_atom_indices = molecule_atom_indices ,
distance_labels = distance_labels ,
resolved_labels = resolved_labels
)
loss . backward ()
# after much training ...
sampled_atom_pos = alphafold3 (
num_recycling_steps = 4 ,
num_sample_steps = 16 ,
atom_inputs = atom_inputs ,
atompair_inputs = atompair_inputs ,
molecule_ids = molecule_ids ,
molecule_atom_lens = molecule_atom_lens ,
additional_molecule_feats = additional_molecule_feats ,
additional_msa_feats = additional_msa_feats ,
additional_token_feats = additional_token_feats ,
is_molecule_types = is_molecule_types ,
is_molecule_mod = is_molecule_mod ,
msa = msa ,
msa_mask = msa_mask ,
templates = template_feats ,
template_mask = template_mask
)
sampled_atom_pos . shape # (2, , 3)
分子級輸入處理的範例
import torch
from alphafold3_pytorch import Alphafold3 , Alphafold3Input
contrived_protein = 'AG'
mock_atompos = [
torch . randn ( 5 , 3 ), # alanine has 5 non-hydrogen atoms
torch . randn ( 4 , 3 ) # glycine has 4 non-hydrogen atoms
]
train_alphafold3_input = Alphafold3Input (
proteins = [ contrived_protein ],
atom_pos = mock_atompos
)
eval_alphafold3_input = Alphafold3Input (
proteins = [ contrived_protein ]
)
# training
alphafold3 = Alphafold3 (
dim_atom_inputs = 3 ,
dim_atompair_inputs = 5 ,
atoms_per_window = 27 ,
dim_template_feats = 108 ,
num_molecule_mods = 0 ,
confidence_head_kwargs = dict (
pairformer_depth = 1
),
template_embedder_kwargs = dict (
pairformer_stack_depth = 1
),
msa_module_kwargs = dict (
depth = 1
),
pairformer_stack = dict (
depth = 2
),
diffusion_module_kwargs = dict (
atom_encoder_depth = 1 ,
token_transformer_depth = 1 ,
atom_decoder_depth = 1 ,
)
)
loss = alphafold3 . forward_with_alphafold3_inputs ([ train_alphafold3_input ])
loss . backward ()
# sampling
alphafold3 . eval ()
sampled_atom_pos = alphafold3 . forward_with_alphafold3_inputs ( eval_alphafold3_input )
assert sampled_atom_pos . shape == ( 1 , ( 5 + 4 ), 3 )
若要取得 AlphaFold 3 PDB 資料集,請先下載蛋白質資料庫 (PDB) 中的所有首次組裝(和不對稱單元)複合物,然後使用下面引用的腳本對其進行預處理。 PDB 可從 RCSB 下載:https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb。下面的兩個 Python 腳本(即, filter_pdb_{train,val,test}_mmcifs.py
和cluster_pdb_{train,val,test}_mmcifs.py
)假設您已下載mmCIF 檔案格式的PDB,並將其第一個程序集合和非對稱單元 mmCIF 檔案分別位於data/pdb_data/unfiltered_assembly_mmcifs/
和data/pdb_data/unfiltered_asym_mmcifs/
。
為了重現性,我們建議使用 AWS 快照下載 PDB(例如20240101
)。為此,請參閱 AWS 文件以在本機設定 AWS CLI。或者,在 RCSB 網站上,導航至“下載協議”,然後根據您所在的位置按照下載說明進行操作。
例如,可以使用以下命令將 PDB 下載為兩個 mmCIF 檔案集合:
# For `assembly1` complexes, use the PDB's `20240101` AWS snapshot:
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs
# Or as a fallback, use rsync:
rsync -rlpt -v -z --delete --port=33444
rsync.rcsb.org::ftp_data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs/
# For asymmetric unit complexes, also use the PDB's `20240101` AWS snapshot:
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs
# Or as a fallback, use rsync:
rsync -rlpt -v -z --delete --port=33444
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs/
警告:下載 PDB 最多可能佔用 700GB 空間。
注意:PDB 在此處託管所有可用的 AWS 快照:https://pdbsnapshots.s3.us-west-2.amazonaws.com/index.html。
下載後,您應該有兩種格式如下的目錄:https://files.rcsb.org/pub/pdb/data/assemblies/mmCIF/divided/ 和https://files.rcsb.org/pub/pdb/ data /結構/劃分/mmCIF/
00/
01/
02/
..
zz/
對於這些目錄,解壓縮所有檔案:
find ./data/pdb_data/unfiltered_assembly_mmcifs/ -type f -name " *.gz " -exec gzip -d {} ;
find ./data/pdb_data/unfiltered_asym_mmcifs/ -type f -name " *.gz " -exec gzip -d {} ;
接下來運行命令
wget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz
wget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/component-models/complete/chem_comp_model.cif.gz
從專案的根目錄下載最新版本的PDB化學成分字典(CCD)及其結構模型。使用以下命令提取每個檔案:
find data/ccd_data/ -type f -name " *.gz " -exec gzip -d {} ;
然後執行以下命令,並將pdb_assembly_dir
、 pdb_asym_dir
、 ccd_dir
和mmcif_output_dir
替換為第一個組件 PDB、非對稱單元 PDB、CCD 的本機副本的位置以及所需的資料集輸出目錄(即./data/pdb_data/unfiltered_assembly_mmcifs/
、 ./data/pdb_data/unfiltered_asym_mmcifs/
、 ./data/ccd_data/
和./data/pdb_data/{train,val,test}_mmcifs/
)。
python scripts/filter_pdb_train_mmcifs.py --mmcif_assembly_dir < pdb_assembly_dir > --mmcif_asym_dir < pdb_asym_dir > --ccd_dir < ccd_dir > --output_dir < mmcif_output_dir >
python scripts/filter_pdb_val_mmcifs.py --mmcif_assembly_dir < pdb_assembly_dir > --mmcif_asym_dir < pdb_asym_dir > --output_dir < mmcif_output_dir >
python scripts/filter_pdb_test_mmcifs.py --mmcif_assembly_dir < pdb_assembly_dir > --mmcif_asym_dir < pdb_asym_dir > --output_dir < mmcif_output_dir >
請參閱腳本以取得更多選項。每個成功通過所有處理步驟的第一個組裝 mmCIF 將寫入mmcif_output_dir
中的子目錄,該子目錄根據 mmCIF 的第二個和第三個 PDB ID 字元(例如5c
)命名。
接下來,執行以下命令,並將mmcif_dir
和{train,val,test}_clustering_output_dir
分別替換為使用上面的資料集過濾腳本建立的本地輸出目錄以及所需的聚類輸出目錄(即./data/pdb_data/{train,val,test}_mmcifs/
和./data/pdb_data/data_caches/{train,val,test}_clusterings/
):
python scripts/cluster_pdb_train_mmcifs.py --mmcif_dir < mmcif_dir > --output_dir < train_clustering_output_dir > --clustering_filtered_pdb_dataset
python scripts/cluster_pdb_val_mmcifs.py --mmcif_dir < mmcif_dir > --reference_clustering_dir < train_clustering_output_dir > --output_dir < val_clustering_output_dir > --clustering_filtered_pdb_dataset
python scripts/cluster_pdb_test_mmcifs.py --mmcif_dir < mmcif_dir > --reference_1_clustering_dir < train_clustering_output_dir > --reference_2_clustering_dir < val_clustering_output_dir > --output_dir < test_clustering_output_dir > --clustering_filtered_pdb_dataset
注意:當使用上面的腳本對過濾後的 PDB 資料集進行聚類時,建議使用--clustering_filtered_pdb_dataset
標誌,因為該標誌將在此上下文中實現更快的運行時間(因為過濾使每個鏈的殘基ID 從1 開始)。但是,在對 mmCIF 檔案的其他(即非 PDB)資料集進行聚類時,不得提供此標誌。否則,界面聚類可能無法正確執行,因為這些資料集的 mmCIF 檔案可能不會對每個鏈使用嚴格的基於 1 的殘基索引。
注意:可改為下載 PDB 20240101 的預處理(即過濾)mmCIF( train
/ val
/ test
)檔案(~25GB,包括 148k 複合體)和鏈/介面聚類( train
/ val
/ test
)檔案(~ 20240101
)透過共用 OneDrive 資料夾進行 AWS 快照。每個tar.gz
檔案都應在data/pdb_data/
目錄中解壓縮,例如透過tar -xzf data_caches.tar.gz -C data/pdb_data/
。也可以使用腳本scripts/distillation_data_download.sh
作為參考來下載和準備 PDB 蒸餾資料。下載後,可以執行scripts/reduce_uniprot_predictions_to_pdb.py
來過濾此資料集,僅取得與至少一個 PDB 條目關聯的範例。此外,為方便起見,用於 PDB 蒸餾資料訓練的 UniProt 登入 ID 到 PDB ID 的對應已下載並提取為data/afdb_data/data_caches/uniprot_to_pdb_id_mapping.dat
。
在專案根目錄下,執行
$ sh ./contribute.sh
然後,將模組新增至alphafold3_pytorch/alphafold3.py
,將測試新增至tests/test_af3.py
,然後提交拉取請求。您可以在本地運行測試
$ pytest tests/
包含的Dockerfile
包含運行套件以及使用 PyTorch 和 GPU 進行訓練/推理所需的依賴項。
預設基礎映像是pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
並從main
GitHub 分支安裝此套件的最新版本。
# # Build Docker Container
docker build -t af3 .
或者,使用建置參數使用不同的軟體版本重建映像:
PYTORCH_TAG
:更改基礎映像,從而使用不同的 PyTorch、CUDA 和/或 cuDNN 版本進行建置。GIT_TAG
:更改此儲存庫的標籤以複製並安裝套件。例如:
# # Use build argument to change versions
docker build --build-arg " PYTORCH_TAG=2.2.1-cuda12.1-cudnn8-devel " --build-arg " GIT_TAG=0.1.15 " -t af3 .
然後,使用 GPU 運行容器並使用以下命令安裝本機磁碟區(用於訓練):
# # Run Container
docker run -v .:/data --gpus all -it af3
@article { Abramson2024-fj ,
title = " Accurate structure prediction of biomolecular interactions with
{AlphaFold} 3 " ,
author = " Abramson, Josh and Adler, Jonas and Dunger, Jack and Evans,
Richard and Green, Tim and Pritzel, Alexander and Ronneberger,
Olaf and Willmore, Lindsay and Ballard, Andrew J and Bambrick,
Joshua and Bodenstein, Sebastian W and Evans, David A and Hung,
Chia-Chun and O'Neill, Michael and Reiman, David and
Tunyasuvunakool, Kathryn and Wu, Zachary and {v Z}emgulyt{.e},
Akvil{.e} and Arvaniti, Eirini and Beattie, Charles and
Bertolli, Ottavia and Bridgland, Alex and Cherepanov, Alexey and
Congreve, Miles and Cowen-Rivers, Alexander I and Cowie, Andrew
and Figurnov, Michael and Fuchs, Fabian B and Gladman, Hannah and
Jain, Rishub and Khan, Yousuf A and Low, Caroline M R and Perlin,
Kuba and Potapenko, Anna and Savy, Pascal and Singh, Sukhdeep and
Stecula, Adrian and Thillaisundaram, Ashok and Tong, Catherine
and Yakneen, Sergei and Zhong, Ellen D and Zielinski, Michal and
{v Z}{'i}dek, Augustin and Bapst, Victor and Kohli, Pushmeet
and Jaderberg, Max and Hassabis, Demis and Jumper, John M " ,
journal = " Nature " ,
month = " May " ,
year = 2024
}
@inproceedings { Darcet2023VisionTN ,
title = { Vision Transformers Need Registers } ,
author = { Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski } ,
year = { 2023 } ,
url = { https://api.semanticscholar.org/CorpusID:263134283 }
}
@article { Arora2024SimpleLA ,
title = { Simple linear attention language models balance the recall-throughput tradeoff } ,
author = { Simran Arora and Sabri Eyuboglu and Michael Zhang and Aman Timalsina and Silas Alberti and Dylan Zinsley and James Zou and Atri Rudra and Christopher R'e } ,
journal = { ArXiv } ,
year = { 2024 } ,
volume = { abs/2402.18668 } ,
url = { https://api.semanticscholar.org/CorpusID:268063190 }
}
@article { Puny2021FrameAF ,
title = { Frame Averaging for Invariant and Equivariant Network Design } ,
author = { Omri Puny and Matan Atzmon and Heli Ben-Hamu and Edward James Smith and Ishan Misra and Aditya Grover and Yaron Lipman } ,
journal = { ArXiv } ,
year = { 2021 } ,
volume = { abs/2110.03336 } ,
url = { https://api.semanticscholar.org/CorpusID:238419638 }
}
@article { Duval2023FAENetFA ,
title = { FAENet: Frame Averaging Equivariant GNN for Materials Modeling } ,
author = { Alexandre Duval and Victor Schmidt and Alex Hernandez Garcia and Santiago Miret and Fragkiskos D. Malliaros and Yoshua Bengio and David Rolnick } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2305.05577 } ,
url = { https://api.semanticscholar.org/CorpusID:258564608 }
}
@article { Wang2022DeepNetST ,
title = { DeepNet: Scaling Transformers to 1, 000 Layers } ,
author = { Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei } ,
journal = { ArXiv } ,
year = { 2022 } ,
volume = { abs/2203.00555 } ,
url = { https://api.semanticscholar.org/CorpusID:247187905 }
}
@inproceedings { Ainslie2023CoLT5FL ,
title = { CoLT5: Faster Long-Range Transformers with Conditional Computation } ,
author = { Joshua Ainslie and Tao Lei and Michiel de Jong and Santiago Ontan'on and Siddhartha Brahma and Yury Zemlyanskiy and David Uthus and Mandy Guo and James Lee-Thorp and Yi Tay and Yun-Hsuan Sung and Sumit Sanghai } ,
year = { 2023 }
}
@article { Ash2019OnTD ,
title = { On the Difficulty of Warm-Starting Neural Network Training } ,
author = { Jordan T. Ash and Ryan P. Adams } ,
journal = { ArXiv } ,
year = { 2019 } ,
volume = { abs/1910.08475 } ,
url = { https://api.semanticscholar.org/CorpusID:204788802 }
}
@ARTICLE { Heinzinger2023.07.23.550085 ,
author = { Michael Heinzinger and Konstantin Weissenow and Joaquin Gomez Sanchez and Adrian Henkel and Martin Steinegger and Burkhard Rost } ,
title = { ProstT5: Bilingual Language Model for Protein Sequence and Structure } ,
year = { 2023 } ,
doi = { 10.1101/2023.07.23.550085 } ,
journal = { bioRxiv }
}
@article { Lin2022.07.20.500902 ,
author = { Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Santos Costa, Allan dos and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and Rives, Alexander } ,
title = { Language models of protein sequences at the scale of evolution enable accurate structure prediction } ,
elocation-id = { 2022.07.20.500902 } ,
year = { 2022 } ,
doi = { 10.1101/2022.07.20.500902 } ,
publisher = { Cold Spring Harbor Laboratory } ,
URL = { https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902 } ,
eprint = { https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902.full.pdf } ,
journal = { bioRxiv }
}
@article { Li2024SwitchEA ,
title = { Switch EMA: A Free Lunch for Better Flatness and Sharpness } ,
author = { Siyuan Li and Zicheng Liu and Juanxi Tian and Ge Wang and Zedong Wang and Weiyang Jin and Di Wu and Cheng Tan and Tao Lin and Yang Liu and Baigui Sun and Stan Z. Li } ,
journal = { ArXiv } ,
year = { 2024 } ,
volume = { abs/2402.09240 } ,
url = { https://api.semanticscholar.org/CorpusID:267657558 }
}
@article { Nguyen2023MitigatingOI ,
title = { Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals } ,
author = { Tam Nguyen and Tan M. Nguyen and Richard G. Baraniuk } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2312.00751 } ,
url = { https://api.semanticscholar.org/CorpusID:264300597 }
}
@inproceedings { Zhou2024ValueRL ,
title = { Value Residual Learning For Alleviating Attention Concentration In Transformers } ,
author = { Zhanchao Zhou and Tianyi Wu and Zhiyun Jiang and Zhenzhong Lan } ,
year = { 2024 } ,
url = { https://api.semanticscholar.org/CorpusID:273532030 }
}