esm下載 - esm原始碼下載

進化尺度模型

阿特拉斯

2023 年 4 月更新：關於蛋白質設計的兩個同時預印本的程式碼現已發布！「語言模型泛化到天然蛋白質之外」的程式碼位於 Examples/lm-design/ 下。「用於產生蛋白質設計的高階程式語言」的程式碼位於範例/蛋白質程式語言/下方。

該儲存庫包含來自 Meta Fundamental AI 研究蛋白質團隊 (FAIR) 的Transformer 蛋白質語言模型的程式碼和預訓練權重，包括我們最先進的ESM-2和ESMFold以及MSA Transformer 、 ESM-1v用於預測變異效應， ESM -IF1用於反向摺疊。 Transformer 蛋白質語言模型在論文《將無監督學習擴展到 2.5 億個蛋白質序列中產生的生物結構和功能》的 2019 年預印本中引入。 ESM-2 在一系列結構預測任務中優於所有經過測試的單序列蛋白質語言模型。 ESMFold 利用 ESM-2 語言模型直接從蛋白質序列產生準確的端到端結構預測。

2022 年 11 月，我們發布了 ESM 宏基因組圖譜v0 ，這是一個包含 6.17 億個預測宏基因組蛋白質結構的開放圖集。該地圖集於 2023 年 3 月與 EBI 合作更新。新的v2023_02為 Atlas 增加了另外 1.5 億個預測結構，以及預先計算的 ESM2 嵌入。大量下載、部落格文章以及 Atlas 網站上提供的資源均記錄在本自述文件中。

2022 年 12 月，我們同時發布了兩本有關蛋白質設計的預印本。

「語言模型超越了天然蛋白質」（PAPER、CODE）使用 ESM2 來設計從頭蛋白質。與預印本相關的程式碼和資料可以在此處找到。
「用於產生蛋白質設計的高階程式語言」（PAPER、CODE）使用 ESMFold 根據高階程式語言設計蛋白質。

引文

對於 ESM2、ESMFold 和 ESM Atlas：```bibtex @article{lin2023evolutionary，標題 = {使用語言模型對原子級蛋白質結構的進化規模預測}，作者 = {Zeming Lin 和 Halil Akin 以及 Roshan Rao 和 Brian Hie以及Zhongkai Zhu和Wenting Lu和Nikita Smetanin和Robert Verkuil和Ori Kabeli和Yaniv Shmueli和Allan dos Santos Costa和Maryam Fazel-Zarandi和Tom Sercu和Salvatore Candido和Alexander Rives}，期刊= {科學}，卷= {379} ，編號= {6637}，頁數= {1123-1130}，年份= {2023}，doi = {10.1126/science.ade2574}，網址= {https://www.science.org/doi/abs/10.1126/ science .ade2574}, note={作為預印本的早期版本：bioRxiv 2022.07.20.500902}, } ```

對於變壓器蛋白語言模型：

 @article { rives2021biological ,
  title = { Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences } ,
  author = { Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and others } ,
  journal = { Proceedings of the National Academy of Sciences } ,
  volume = { 118 } ,
  number = { 15 } ,
  pages = { e2016239118 } ,
  year = { 2021 } ,
  publisher = { National Acad Sciences } ,
  note = { bioRxiv 10.1101/622803 } ,
  doi = { 10.1073/pnas.2016239118 } ,
  url = { https://www.pnas.org/doi/full/10.1073/pnas.2016239118 } ,
}

您應該使用的主要型號
用法
- 快速入門
- 開始使用此儲存庫
- ESM折疊結構預測
- 從 FASTA 批次計算嵌入
- 用於大型模型推理的 CPU 卸載
- 零樣本變異預測
- 反向折疊
ESM宏基因體圖譜
筆記型電腦
可用模型和資料集
- 預訓練模型
- ESM 結構分割資料集
- 預訓練資料集分割
- 與相關作品的比較
引文
執照

什麼是新的

2023 年 4 月：在 Examples/lm-design/ 下發布的蛋白質設計預印本程式碼。
2023 年 3 月：我們發布了 ESM 宏基因組圖譜v2023_02的更新。請參閱網站和批量下載詳細資訊。
2022 年12 月：元基礎人工智慧研究蛋白質團隊(FAIR) 同時發布了兩份關於蛋白質設計的預印本：「語言模型泛化超越天然蛋白質」（Verkuil、Kabeli 等人，2022 年）和「一種高階程式語言生成蛋白質設計」（Hie, Candido, et al., 2022）。
2022 年 11 月：ESM 宏基因組圖譜，發布了 6 億多個宏基因組結構的存儲庫，請參閱網站和批量下載詳細信息
2022 年 11 月：ESMFold - 發布新的端到端結構預測模型（參見 Lin 等人，2022 年）
2022 年 8 月：ESM-2 - 發布新的 SOTA 語言模型（參見 Lin 等人，2022）
2022 年 4 月：發布新的反向折疊模型 ESM-IF1，在 CATH 和 UniRef50 預測結構上進行訓練。
2021 年 8 月：為標記產生器增加了靈活性，允許按順序使用空格和特殊標記（如<mask> ）。
2021 年 7 月：發布了新的預訓練模型 ESM-1v，在 UniRef90 上進行訓練（參見 Meier 等人，2021）。
2021 年 7 月：發布了新的 MSA Transformer，對行位置嵌入 ( ESM-MSA-1b ) 進行了小幅修復。
2021 年 2 月：新增了 MSA Transformer（請參閱 Rao 等人，2021 年）。筆記本中的用法範例。
2020 年 12 月：所有預訓練模型的自註意力接觸（參見 Rao 等人，2020）
2020 年 12 月：新增了新的預訓練模式 ESM-1b（請參閱 Rives 等人，2019 年附錄 B）
2020 年 12 月：ESM 結構分割資料集（參見 Rives 等人，2019 年附錄 A.10）

您應該使用的主要型號

速記	`esm.pretrained.`	數據集	描述
ESM-2	`esm2_t36_3B_UR50D()` `esm2_t48_15B_UR50D()`	UR50（樣品 UR90）	SOTA 通用蛋白質語言模型。可用於直接從單一序列預測結構、功能和其他蛋白質特性。與 Lin 等人一起發布。 2022 年（2022 年 8 月更新）。
ESM折疊	`esmfold_v1()`	PDB+UR50	端對端單序列 3D 結構預測器（2022 年 11 月更新）。
ESM-MSA-1b	`esm_msa1b_t12_100M_UR50S()`	UR50 + MSA	MSA Transformer 語言模式。可用於從 MSA 中提取嵌入。啟用結構的 SOTA 推斷。與 Rao 等人一起發布。 2021 年（ICML'21 版本，2021 年 6 月）。
ESM-1v	`esm1v_t33_650M_UR90S_1()` ... `esm1v_t33_650M_UR90S_5()`	UR90	專門用於預測變數效應的語言模型。實現序列變異功能影響的 SOTA 零樣本預測。與 ESM-1b 相同的架構，但在 UniRef90 上進行訓練。與 Meier 等人一起發布。 2021 年。
ESM-IF1	`esm_if1_gvp4_t16_142M_UR50()`	導管+UR50	反向折疊模型。可用於設計給定結構的序列，或預測給定結構的序列變異的功能效應。支援 SOTA 固定主幹序列設計。與 Hsu 等人一起發布。 2022 年。

有關可用模型的完整清單以及詳細資訊和發行說明，請參閱預訓練模型。

用法

快速啟動

一種簡單的入門方法是透過 HuggingFace 轉換器庫載入 ESM 或 ESMFold，該庫簡化了 ESMFold 依賴項，並提供標準化 API 和工具來處理最先進的預訓練模型。

或者，ColabFold 整合了 ESMFold，以便您可以輕鬆地直接在 Google Colab 實例上的瀏覽器中運行它。

我們也提供了一個API，您可以透過curl 或在ESM 宏基因組圖譜網頁上存取該API。

 curl -X POST --data "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL" https://api.esmatlas.com/foldSequence/v1/pdb/

對於 ESM-MSA-1b、ESM-IF1 或任何其他型號，您可以按照以下說明直接使用我們儲存庫中的原始實作。

開始使用此儲存庫

作為先決條件，您必須安裝 PyTorch 才能使用此儲存庫。

您可以使用此單行程式進行安裝，使用最新版本的 esm：

pip install fair-esm  # latest release, OR:
pip install git+https://github.com/facebookresearch/esm.git  # bleeding edge, current repo main branch

要使用 ESMFold 模型，請確保從安裝了 python <= 3.9 和 pytorch 的環境開始。然後將[esmfold]選項新增至 pip 安裝中，這將自動安裝 OpenFold 的依賴項。 Openfold 安裝需要nvcc 。

pip install " fair-esm[esmfold] "
# OpenFold and its remaining dependency
pip install ' dllogger @ git+https://github.com/NVIDIA/dllogger.git '
pip install ' openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307 '

注意：如果 openfold 安裝失敗，請仔細檢查nvcc是否可用以及是否已安裝與 cuda 相容的 PyTorch 版本。

或者，我們提供esmfold conda 環境，可以透過conda env create -f environment.yml建置。

我們也支援 PyTorch Hub，這樣就無需自己複製和/或安裝此儲存庫：

 import torch
model , alphabet = torch . hub . load ( "facebookresearch/esm:main" , "esm2_t33_650M_UR50D" )

pip install 後，您可以載入並使用預訓練模型，如下所示：

 import torch
import esm

# Load ESM-2 model
model , alphabet = esm . pretrained . esm2_t33_650M_UR50D ()
batch_converter = alphabet . get_batch_converter ()
model . eval ()  # disables dropout for deterministic results

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ( "protein1" , "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" ),
    ( "protein2" , "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE" ),
    ( "protein2 with mask" , "KALTARQQEVFDLIRD<mask>ISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE" ),
    ( "protein3" ,  "K A <mask> I S Q" ),
]
batch_labels , batch_strs , batch_tokens = batch_converter ( data )
batch_lens = ( batch_tokens != alphabet . padding_idx ). sum ( 1 )

# Extract per-residue representations (on CPU)
with torch . no_grad ():
    results = model ( batch_tokens , repr_layers = [ 33 ], return_contacts = True )
token_representations = results [ "representations" ][ 33 ]

# Generate per-sequence representations via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_representations = []
for i , tokens_len in enumerate ( batch_lens ):
    sequence_representations . append ( token_representations [ i , 1 : tokens_len - 1 ]. mean ( 0 ))

# Look at the unsupervised self-attention map contact predictions
import matplotlib . pyplot as plt
for ( _ , seq ), tokens_len , attention_contacts in zip ( data , batch_lens , results [ "contacts" ]):
    plt . matshow ( attention_contacts [: tokens_len , : tokens_len ])
    plt . title ( seq )
    plt . show ()

ESM折疊結構預測

使用[esmfold]選項安裝後，可以使用ESMFold結構預測模型，如下所示：

 import torch
import esm

model = esm . pretrained . esmfold_v1 ()
model = model . eval (). cuda ()

# Optionally, uncomment to set a chunk size for axial attention. This can help reduce memory.
# Lower sizes will have lower memory requirements at the cost of increased speed.
# model.set_chunk_size(128)

sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
# Multimer prediction can be done with chains separated by ':'

with torch . no_grad ():
    output = model . infer_pdb ( sequence )

with open ( "result.pdb" , "w" ) as f :
    f . write ( output )

import biotite . structure . io as bsio
struct = bsio . load_structure ( "result.pdb" , extra_fields = [ "b_factor" ])
print ( struct . b_factor . mean ())  # this will be the pLDDT
# 88.3

除了我們推薦使用的性能最佳模型esm.pretrained.esmfold_v1()之外，我們還提供了用於 Lin 等人的實驗的esm.pretrained.esmfold_v0() 。 2022 年。

我們還提供了一個命令列介面 ( esm-fold )，可以使用 ESMFold 從 FASTA 檔案中有效地批量預測結構：

 usage: esm-fold [-h] -i FASTA -o PDB [--num-recycles NUM_RECYCLES]
                [--max-tokens-per-batch MAX_TOKENS_PER_BATCH]
                [--chunk-size CHUNK_SIZE] [--cpu-only] [--cpu-offload]

optional arguments:
  -h, --help            show this help message and exit
  -i FASTA, --fasta FASTA
                        Path to input FASTA file
  -o PDB, --pdb PDB     Path to output PDB directory
  --num-recycles NUM_RECYCLES
                        Number of recycles to run. Defaults to number used in
                        training (4).
  --max-tokens-per-batch MAX_TOKENS_PER_BATCH
                        Maximum number of tokens per gpu forward-pass. This
                        will group shorter sequences together for batched
                        prediction. Lowering this can help with out of memory
                        issues, if these occur on short sequences.
  --chunk-size CHUNK_SIZE
                        Chunks axial attention computation to reduce memory
                        usage from O(L^2) to O(L). Equivalent to running a for
                        loop over chunks of of each dimension. Lower values
                        will result in lower memory usage at the cost of
                        speed. Recommended values: 128, 64, 32. Default: None.
  --cpu-only            CPU only
  --cpu-offload         Enable CPU offloading

該指令將為 fasta 檔案中的每個序列進行一次預測。多聚體是可以預測的，並且應該作為單一序列輸入到 fasta 檔案中，鏈之間以“:”字元分隔。

預設情況下，預測將被批次處理在一起，以便同時預測較短的序列。可以透過設定--max-tokens-per-batch=0來停用此功能。批次處理可以顯著提高較短序列的預測速度。

--cpu-offload標誌對於對較長序列進行預測非常有用。它會嘗試將一些參數卸載到 CPU RAM，而不是儲存在 GPU 上。

最後，Lin 等人對不同尺寸的 LM 進行了燒蝕實驗。第 2022 個表 S1 以esm.pretrained.esmfold_structure_module_only_*()發布。我們不建議使用這些模型進行結構預測。

從 FASTA 批次計算嵌入

我們提供了一個命令列介面 ( esm-extract )，可以有效地從 ESM 中批次提取 FASTA 檔案的嵌入：

 usage: esm-extract [-h] [--toks_per_batch TOKS_PER_BATCH]
                   [--repr_layers REPR_LAYERS [REPR_LAYERS ...]] --include
                   {mean,per_tok,bos,contacts}
                   [{mean,per_tok,bos,contacts} ...]
                   [--truncation_seq_length TRUNCATION_SEQ_LENGTH]
                   model_location fasta_file output_dir

Extract per-token representations and model outputs for sequences in a FASTA
file

positional arguments:
  model_location        PyTorch model file OR name of pretrained model to
                        download (see README for models)
  fasta_file            FASTA file on which to extract representations
  output_dir            output directory for extracted representations

optional arguments:
  -h, --help            show this help message and exit
  --toks_per_batch TOKS_PER_BATCH
                        maximum batch size
  --repr_layers REPR_LAYERS [REPR_LAYERS ...]
                        layers indices from which to extract representations
                        (0 to num_layers, inclusive)
  --include {mean,per_tok,bos,contacts} [{mean,per_tok,bos,contacts} ...]
                        specify which representations to return
  --truncation_seq_length TRUNCATION_SEQ_LENGTH
                        truncate sequences longer than the given value

以下命令允許從 ESM-2 模型中提取 FASTA 檔案的最後層嵌入：

esm-extract esm2_t33_650M_UR50D examples/data/some_proteins.fasta 
  examples/data/some_proteins_emb_esm2 --repr_layers 0 32 33 --include

python scripts/extract.py esm2_t33_650M_UR50D examples/data/some_proteins.fasta 
  examples/data/some_proteins_emb_esm2 --repr_layers 0 32 33 --include mean per_tok

cuda 設備是可選的，將自動偵測。

目錄some_proteins_emb_esm2/現在每個 FASTA 序列包含一個.pt檔；使用torch.load()來載入它們。 scripts/extract.py具有確定.pt檔案中包含的內容的標誌：

--repr-layers （預設值：僅最終）選擇要包含嵌入的圖層。
--include指定要儲存的嵌入。您可以使用以下內容：
- per_tok包括完整序列，每個氨基酸嵌入 (seq_len xhidden_dim)。
- mean包括每層整個序列上平均的嵌入。
- bos包括來自序列開頭標記的嵌入。（注意：不要與預先訓練的模型一起使用 - 我們在沒有 bos-token 監督的情況下進行訓練）

用於大型模型推理的 CPU 卸載

如果您想要載入非常大的模型（例如 15B）和/或在電腦上對長序列進行推理，則常規 GPU 推理可能會導致 OOM 錯誤。我們展示如何使用 Fairscale 的完全分片資料並行 (FSDP) 載入模型並使用其 CPU 卸載功能。這允許在單一 GPU 上進行大型模型的推理。請查看examples/esm2_infer_fairscale_fsdp_cpu_offloading.py以了解更多詳細資訊。

零樣本變異預測

請參閱“examples/variant-prediction/”，以了解語言模型中描述的 ESM-1v 模型的代碼和預訓練權重，可以零樣本預測突變對蛋白質功能的影響。（Meier 等人，2021）。

請注意，ESM-2 也可用於變異預測，並且預計具有與 ESM-1v 類似的性能。

反向折疊

有關詳細的使用者指南，請參閱「examples/inverse_folding/」。 ESM-IF1 模型在《從數百萬個預測結構中學習逆折疊》中被描述為GVPTransformer 。（Hsu 等人，2022）。

我們還提供用於序列設計和序列評分功能的 Colab 筆記本。

ESM-IF1 反向折疊模型旨在根據骨架原子座標預測蛋白質序列。我們在這裡提供腳本 1) 對給定結構的序列設計進行取樣，2) 對給定結構的序列進行評分。

ESM-IF1 模型使用AlphaFold2 預測的12M 蛋白質結構進行訓練，由不變的幾何輸入處理層和序列到序列轉換器組成，在結構保留的主幹上實現了51% 的天然序列恢復，對埋藏的主幹實現了72% 的恢復。該模型還使用跨度掩蔽進行訓練，以容忍缺失的主幹座標，因此可以預測部分掩蔽結構的序列。

給定結構的範例序列設計

環境設定在範例/inverse_folding 的本小節中進行了描述。

若要以 PDB 或 mmCIF 格式對給定結構進行序列取樣，請使用sample_sequences.py腳本。輸入檔可以使用.pdb或.cif作為後綴。

例如，要對高爾基體酪蛋白激酶結構的 3 個序列設計進行採樣（PDB 5YH2；2022 年 1 月起的本月 PDB 分子），我們可以從 esm 根目錄運行以下命令：

python examples/inverse_folding/sample_sequences.py examples/inverse_folding/data/5YH2.pdb 
  --chain C --temperature 1 --num-samples 3 --outpath examples/inverse_folding/output/sampled_sequences.fasta

取樣序列將以 fasta 格式儲存到指定的輸出檔案。

溫度參數控制序列採樣機率分佈的銳度。較高的採樣溫度會產生更多樣的序列，但天然序列的回收率可能較低。預設採樣溫度為 1。

評分順序

若要對給定結構條件下的序列的條件對數似然進行評分，請使用score_log_likelihoods.py腳本。

例如，要根據examples/inverse_folding/data/5YH2_mutated_seqs.fasta中的結構對examples/inverse_folding/data/5YH2.pdb中的序列進行評分，我們可以從esm根目錄執行以下指令：

 python examples/inverse_folding/score_log_likelihoods.py examples/inverse_folding/data/5YH2.pdb 
  examples/inverse_folding/data/5YH2_mutated_seqs.fasta --chain C 
  --outpath examples/inverse_folding/output/5YH2_mutated_seqs_scores.csv

條件對數似然以 csv 格式儲存在指定的輸出路徑中。輸出值是序列中所有胺基酸的平均對數似然。

有關更多信息，請參閱“./examples/inverse_folding/”以獲取詳細的用戶指南。

ESM宏基因體圖譜

請造訪 ESM 宏基因組圖譜網站，並查看我們的部落格文章以了解更多資訊。

大量下載說明可在此處的單獨自述文件中找到。

Atlas 資源包括一個使用 ESMFold 折疊序列的頁面，按結構或序列搜尋 ESM Atlas 的子集，以及用於以程式設計方式存取這些資源的 API。

Foldseek 提供針對 Atlas 的搜索，此處沒有長度限制。

筆記型電腦

反向折疊 - 基於主幹結構預測或評分序列

ESM-IF1 反向折疊模型根據骨架原子座標預測蛋白質序列，並使用 AlphaFold2 預測的 12M 蛋白質結構進行訓練。本筆記指導您完成取樣序列、計算條件對數似然以及提取編碼器輸出作為結構表示的範例。

監督變數預測 - 在嵌入上訓練分類器

為了幫助您開始使用嵌入，此 jupyter Notebook 教學課程展示如何使用 ESM-1 中的嵌入來訓練監督變數預測器。您可以採用類似的協定來為任何下游任務訓練模型，即使資料有限。首先，您可以透過按照筆記本中的說明下載預先計算的嵌入或執行以下命令來取得examples/data/P62593.fasta的嵌入：

 # Obtain the embeddings
python scripts/extract.py esm1v_t33_650M_UR90S_1 examples/data/P62593.fasta 
  examples/data/P62593_emb_esm1v --repr_layers 33 --include mean

然後，請按照教程中的其餘說明進行操作。您也可以在 Colab 筆記本中執行本教學。

請注意，或使用較新的指令進行零樣本變異預測，該指令無需任何監督訓練即可預測突變效應。

無監督接觸預測

此 jupyter Notebook 教學示範了使用 ESM-2 和 MSA Transformer (ESM-MSA-1) 模型進行接觸預測。接觸預測基於模型注意力圖的邏輯迴歸。此方法基於我們的 ICLR 2021 論文 Transformer 蛋白質語言模型是無監督結構學習器。 (Rao et al. 2020) MSA Transformer (ESM-MSA-1) 採用多序列比對 (MSA) 作為輸入，並以相同的方式使用捆綁行自註意力圖。請參閱 MSA 變壓器。（Rao 等人，2021）。

若要獲得無監督的基於注意力的聯絡人，請呼叫model.predict_contacts(tokens)或model(tokens, return_contacts=True) 。

ESMStructuralSplitDataset 和 self-attention 接觸預測

此 jupyter Notebook 教學展示如何載入和索引ESMStructuralSplitDataset ，並使用 ESM-2 計算自註意力圖無監督接觸預測。

可用模型和資料集

預訓練模型

速記	`esm.pretrained.`	#層數	#參數	數據集	嵌入暗淡	模型 URL（自動下載到`~/.cache/torch/hub/checkpoints` ）
ESM-2	`esm2_t48_15B_UR50D`	48	15B	UR50/D 2021_04	5120	https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t48_15B_UR50D.pt
	`esm2_t36_3B_UR50D`	36	3B	UR50/D 2021_04	2560	https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt
	`esm2_t33_650M_UR50D`	33	650M	UR50/D 2021_04	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt
	`esm2_t30_150M_UR50D`	30	150M	UR50/D 2021_04	640	https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t30_150M_UR50D.pt
	`esm2_t12_35M_UR50D`	12	35M	UR50/D 2021_04	第480章	https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t12_35M_UR50D.pt
	`esm2_t6_8M_UR50D`	6	8M	UR50/D 2021_04	320	https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t6_8M_UR50D.pt
ESM折疊	`esmfold_v1`	48 (+36)	690M（+3B）	UR50/D 2021_04	-	https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt
	`esmfold_v0`	48 (+36)	690M（+3B）	UR50/D 2021_04	-	https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v0.pt
	`esmfold_structure_module_only_*`	0（+各種）	各種各樣的	UR50/D 2021_04	-	https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_struct_module_only_*
ESM-IF1	`esm_if1_gvp4_t16_142M_UR50`	20	124M	CATH 4.3 + UR50 的預測結構	第512章	https://dl.fbaipublicfiles.com/fair-esm/models/esm_if1_gvp4_t16_142M_UR50.pt
ESM-1v	`esm1v_t33_650M_UR90S_[1-5]`	33	650M	UR90/S 2020_03	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt
ESM-MSA-1b	`esm_msa1b_t12_100M_UR50S`	12	100M	UR50/S + MSA 2018_03	第768章	https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1b_t12_100M_UR50S.pt
ESM-MSA-1	`esm_msa1_t12_100M_UR50S`	12	100M	UR50/S + MSA 2018_03	第768章	https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1_t12_100M_UR50S.pt
ESM-1b	`esm1b_t33_650M_UR50S`	33	650M	UR50/S 2018_03	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
ESM-1	`esm1_t34_670M_UR50S`	34	670M	UR50/S 2018_03	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt
	`esm1_t34_670M_UR50D`	34	670M	UR50/D 2018_03	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt
	`esm1_t34_670M_UR100`	34	670M	UR100 2018_03	1280	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt
	`esm1_t12_85M_UR50S`	12	85M	UR50/S 2018_03	第768章	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt
	`esm1_t6_43M_UR50S`	6	43M	UR50/S 2018_03	第768章	https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

以下是已發布模型的時間順序清單以及介紹它們的論文：

速記	發行說明
ESM-1	與 Rives 等人一起發布。 2019 年（2020 年 8 月更新）。
ESM-1b	與 Rives 等人一起發布。 2019 年（2020 年 12 月更新）。參見附錄 B。
ESM-MSA-1	與 Rao 等人一起發布。 2021（預印本 v1）。
ESM-MSA-1b	與 Rao 等人一起發布。 2021 年（ICML'21 版本，2021 年 6 月）。
ESM-1v	與 Meier 等人一起發布。 2021 年。
ESM-IF1	與 Hsu 等人一起發布。 2022 年。
ESM-2	與 Lin 等人一起發布。 2022 年。

ESM 結構分割資料集

這是蛋白質結構域結構的五重交叉驗證資料集，可用於測量不同結構差異層級上表示的泛化。此資料集在族、超族和折疊層級實現了結構保留。 SCOPe 資料庫用於對領域進行分類。對於結構保留的每個級別，域被獨立地分成5個相等的組，即5組折疊、超族或族。這確保對於五個分區中的每一個，具有相同分類的結構不會同時出現在訓練集和測試集中。對於給定的分類級別，每個結構在測試集中出現一次，因此在交叉驗證實驗中，每個結構將被精確評估一次。

此資料集提供 3D 座標、距離圖和二級結構標籤。有關數據集構建的更多詳細信息，請參閱 Rives 等人。 2019 年附錄 A.10。

此 jupyter Notebook 教學展示如何載入和索引ESMStructuralSplitDataset 。

ESMStructuralSplitDataset在初始化時將下載splits和pkl 。我們也為每個網域提供msas 。數據可以直接在下面下載。

姓名	描述	網址
分裂	訓練/有效分割	https://dl.fbaipublicfiles.com/fair-esm/structural-data/splits.tar.gz
PKL	包含序列、SSP 標籤、距離圖和 3d 座標的 pkl 對象	https://dl.fbaipublicfiles.com/fair-esm/structural-data/pkl.tar.gz
管理服務	包含每個域的 MSA 的 a3m 文件	https://dl.fbaipublicfiles.com/fair-esm/structural-data/msas.tar.gz

預訓練資料集分割

建立 UniRef50 群集的分割文件被用作 Rives 等人的預訓練的保留評估集。 2019 年和 Rao 等人。 2021 年可以在這裡找到：

評估集的 UniRef50 ID：3.016 M 簇
評估集的 UniRef100 ID：13.745 M 蛋白質，擴展了相同的 UniRef50 簇。

這些檔案僅包含與 UniRef 資料庫 2018-03 版本相對應的 UniRef50 ID 和 UniRef100 ID，該資料庫由 UniProt 聯盟根據知識共享署名 (CC BY 4.0) 授權發布。

與相關作品的比較

任務	無監督接觸預測			結構預測
測試集	大有效	CASP14	客串（2022 年 4 月至 6 月）	CASP14	客串（2022 年 4 月至 6 月）
小魔怪（波茨）	39.3
磁帶	11.2
ProtBert-BFD	34.1
Prot-T5-XL-BFD	35.6			46.1	62.6
Prot-T5-XL-Ur50 (3B)	47.9			49.8	69.4
ESM-1	33.7
ESM-1b	41.1	24.4	39	41.6	64.5
ESM-1v	35.3
ESM-MSA-1b	57.4
ESM-2（8M）	15.9	9.8	15.7	36.7	48.1
ESM-2（35M）	28.8	16.4	28.4	41.4	56.4
ESM-2（150M）	42.2	26.8	40.1	49.0	64.9
ESM-2（700M）	50.1	32.5	47.6	51.3	70.1
ESM-2 (3B)	52.7	34.0	49.9	52.5	71.8
ESM-2 (15B)	54.5	37.0	51.7	55.4	72.1

與相關蛋白質語言模型在結構預測任務上的比較。

所有聯繫號碼均為 top-L,LR 精確度度量，其中長範圍意味著至少 24 個殘基的序列分離
對於無監督接觸預測，使用注意力頭的稀疏線性組合來直接預測蛋白質接觸，並在 20 個結構上擬合邏輯回歸。有關該方法的更多詳細信息，請參閱 Rao 等人。 2020.
對於結構預測，直接從凍結的語言模型嵌入中訓練 AlphaFold2 結構模組。有關該方法的更多詳細信息，請參閱 Lin 等人。 2022 年。
直接耦合分析方法（Gremlin、mfDCA、Psicov）和 ESM-MSA-1 使用 trRosetta MSA，而其他方法則根據單一序列進行預測。

引文

如果您發現這些模型對您的研究有用，我們要求您引用相關論文：

 @article { rives2019biological ,
  author = { Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob } ,
  title = { Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences } ,
  year = { 2019 } ,
  doi = { 10.1101/622803 } ,
  url = { https://www.biorxiv.org/content/10.1101/622803v4 } ,
  journal = { PNAS }
}

對於自註意力接觸預測：

 @article { rao2020transformer ,
  author = { Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander } ,
  title = { Transformer protein language models are unsupervised structure learners } ,
  year = { 2020 } ,
  doi = { 10.1101/2020.12.15.422761 } ,
  url = { https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1 } ,
  journal = { bioRxiv }
}

對於 MSA 變壓器：

 @article { rao2021msa ,
  author = { Rao, Roshan and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John F. and Abbeel, Pieter and Sercu, Tom and Rives, Alexander } ,
  title = { MSA Transformer } ,
  year = { 2021 } ,
  doi = { 10.1101/2021.02.12.430858 } ,
  url = { https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1 } ,
  journal = { bioRxiv }
}

對於使用 ESM-1v 的變異預測：

 @article { meier2021language ,
  author = { Meier, Joshua and Rao, Roshan and Verkuil, Robert and Liu, Jason and Sercu, Tom and Rives, Alexander } ,
  title = { Language models enable zero-shot prediction of the effects of mutations on protein function } ,
  year = { 2021 } ,
  doi = { 10.1101/2021.07.09.450648 } ,
  url = { https://www.biorxiv.org/content/10.1101/2021.07.09.450648v1 } ,
  journal = { bioRxiv }
}

對於使用 ESM-IF1 的反向折疊：

 @article { hsu2022learning ,
	author = { Hsu, Chloe and Verkuil, Robert and Liu, Jason and Lin, Zeming and Hie, Brian and Sercu, Tom and Lerer, Adam and Rives, Alexander } ,
	title = { Learning inverse folding from millions of predicted structures } ,
	year = { 2022 } ,
	doi = { 10.1101/2022.04.10.487779 } ,
	url = { https://www.biorxiv.org/content/early/2022/04/10/2022.04.10.487779 } ,
	journal = { ICML }
}

對於 ESM-2 語言模型和 ESMFold：

 @article { lin2022language ,
  title = { Language models of protein sequences at the scale of evolution enable accurate structure prediction } ,
  author = { Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and others } ,
  journal = { bioRxiv } ,
  year = { 2022 } ,
  publisher = { Cold Spring Harbor Laboratory }
}