progres下載 - progres原始碼下載

progres

其他源碼

v0.2.7

下載

Progres - 蛋白質圖嵌入搜索

此儲存庫包含預印本中的方法：

Greener JG 和 Jamali K。 bioRxiv (2022) - 鏈接

它提供了progres Python 包，可讓您根據預先嵌入的結構資料庫搜尋結構、對結構進行評分以及用於搜尋的預嵌入資料集。搜尋通常需要 1-2 秒，對於多個查詢來說速度要快得多。對於 AlphaFold 資料庫，初始資料載入大約需要一分鐘，但後續搜尋每個查詢需要十分之一秒。

目前提供了 SCOPe、CATH、ECOD、整個 PDB、21 種模式生物的 AlphaFold 結構以及 AlphaFold 資料庫 TED 域供搜尋。搜尋是按網域完成的，但 Chainsaw 可用於自動將查詢結構拆分為網域。

安裝

需要 Python 3.8 或更高版本。該軟體獨立於作業系統。
依照您的系統安裝 PyTorch 1.11 或更高版本、PyTorch Scatter、PyTorch Geometric、FAISS 和 STRIDE。 GPU 不是必需的，但在某些情況下可以提供加速。 Linux（和其他作業系統，除了 STRIDE 安裝）的範例命令：

conda create -n prog python=3.9
conda activate prog
conda install pytorch=1.11 faiss-cpu -c pytorch
conda install pytorch-scatter pyg -c pyg
conda install kimlab::stride

執行pip install progres ，這也會安裝 Biopython、mmtf-python、einops 和 pydantic（如果它們尚不存在）。
第一次使用該軟體搜尋時，經過訓練的模型和預嵌入資料庫（約 660 MB）將從 Zenodo 下載到套件目錄，這需要網路連線。這可能需要幾分鐘的時間。您可以設定環境變數PROGRES_DATA_DIR來變更此資料的儲存位置，例如，如果您無法寫入套件目錄。請記住下次執行 Progres 時保持它的設定。
第一次搜尋 AlphaFold 資料庫 TED 網域時，將以類似方式下載預嵌入資料庫（約 33 GB）。這可能需要一段時間。確保您有足夠的磁碟空間！

或者， docker目錄中提供了 Docker 檔案。

用法

在 Unix 系統上，可執行檔progres將在安裝過程中新增到路徑中。在 Windows 上，如果無法存取可執行文件，可以使用 python 呼叫bin/progres腳本。

執行progres -h以查看幫助文本， progres {mode} -h以查看每種模式的幫助文本。下面描述了這些模式，但幫助文本中概述了其他選項。例如， -d標誌設定要運作的設備；預設是cpu ，因為這通常是搜尋速度最快的，但在使用 Chainsaw 分割域、搜尋許多查詢或嵌入資料集時， cuda可能會更快。如果性能很重要，請嘗試兩者。

根據資料庫搜尋結構

要根據 SCOPe 資料庫中的網域搜尋 PDB 檔案query.pdb （可以在data目錄中找到）並列印輸出：

progres search -q query.pdb -t scope95

 # QUERY_NUM: 1
# QUERY: query.pdb
# DOMAIN_NUM: 1
# DOMAIN_SIZE: 150 residues (1-150)
# DATABASE: scope95
# PARAMETERS: minsimilarity 0.8, maxhits 100, chainsaw no, faiss no, progres v0.2.7
# HIT_N  DOMAIN   HIT_NRES  SIMILARITY  NOTES
      1  d1a6ja_       150      1.0000  d.112.1.1 - Nitrogen regulatory bacterial protein IIa-ntr {Escherichia coli [TaxId: 562]}
      2  d2a0ja_       146      0.9988  d.112.1.0 - automated matches {Neisseria meningitidis [TaxId: 122586]}
      3  d3urra1       151      0.9983  d.112.1.0 - automated matches {Burkholderia thailandensis [TaxId: 271848]}
      4  d3lf6a_       154      0.9971  d.112.1.1 - automated matches {Artificial gene [TaxId: 32630]}
      5  d3oxpa1       147      0.9968  d.112.1.0 - automated matches {Yersinia pestis [TaxId: 214092]}
...

-q是查詢結構檔案的路徑。或者， -l是一個文字文件，每行一個查詢文件路徑，每個結果將依序列印。對於多個查詢來說，這要快得多，因為設定只發生一次並且可以使用多個工作線程。僅考慮每個文件中的第一個鏈。
-t是要搜尋的預嵌入資料庫。目前，這必須是下面列出的資料庫之一，或是使用progres embed產生的預先嵌入資料集的檔案路徑。
-f決定查詢結構的檔案格式（ guess 、 pdb 、 mmcif 、 mmtf或coords ）。預設情況下，這是根據檔案副檔名猜測的，如果無法猜測，則選擇pdb 。 coords是指每行以空格分隔的 Cα 原子座標的文字檔。
-s是進度分數 (0 -> 1)，高於該分數則回傳命中，預設為 0.8。如論文中所討論的，0.8 表示相同的倍數。
-m是回傳的最大命中數，預設100。
-c表示使用 Chainsaw 將查詢結構拆分為多個域，並分別對每個域進行搜尋。如果 Chainsaw 未找到網域，則不會傳回任何結果。僅考慮每個文件中的第一個鏈。運行鏈鋸可能需要幾秒鐘。

用於將查詢結構拆分為網域的其他工具包括 Merizo 和 SWORD2。您也可以使用軟體（例如 pdb-tools 中的pdb_selres指令）手動分割網域。

命中描述的解釋取決於正在搜尋的資料庫。網域通常包含對對應PDB檔案的引用，例如d1a6ja_指的是PDB ID 1A6J鏈A，可以在RCSB PDB結構視圖中開啟以快速檢視。對於 AlphaFold 資料庫 TED 網域，可以從此類連結下載文件，其中AF-A0A6J8EXE6-F1是命中註釋的第一部分，後面是網域的殘基範圍。

可用資料庫

可用的預嵌入資料庫有：

姓名	描述	域名數量	搜尋時間（1 個查詢）	搜尋時間（100 個查詢）
`scope95`	SCOPe 2.08 域的 ASTRAL 集合聚集在 95% seq ID 處	35,371	1.35秒	2.81秒
`scope40`	SCOPe 2.08 域的 ASTRAL 集合聚集在 40% seq ID 處	15,127	1.32秒	2.36秒
`cath40`	來自 CATH 23/11/22 的 S40 非冗餘域	31,884	1.38秒	2.79秒
`ecod70`	ECOD 開發的 F70 代表域287	71,635	1.46秒	3.82秒
`pdb100`	截至 2024 年 2 月 8 日，所有 PDB 蛋白鏈均已透過 Chainsaw 分割成結構域	1,177,152	2.90秒	27.3秒
`af21org`	透過 CATH-Assign 將 21 種模型生物體的 AlphaFold 結構分為多個域	338,258	2.21秒	11.0秒
`afted`	AlphaFold 資料庫結構透過 TED 分為多個域，並以 50% 序列同一性進行聚類	53,344,209	67.7秒	73.1秒

搜尋時間是在具有 256 GB RAM 和 PyTorch 1.11 的 Intel i9-10980XE CPU 上搜尋 150 個殘基蛋白質（PDB 格式的 d1a6ja_）。給出 1 或 100 個查詢的時間。請注意， afted使用詳盡的 FAISS 搜尋。這不會改變找到的命中，但相似性分數會略有不同 - 請參閱論文。

計算兩個結構之間的分數

要計算兩個蛋白質域之間的 Progres 分數：

progres score struc_1.pdb struc_2.pdb

 0.7265280485153198

-f和-g確定上述第一個和第二個結構的檔案格式（ guess 、 pdb 、 mmcif 、 mmtf或coords ）。

域的順序不影響分數。 0.8 或更高的分數表示相同的倍數。

預先嵌入要搜尋的資料集

要嵌入結構資料集，允許對其進行搜尋：

progres embed -l filepaths.txt -o searchdb.pt

-l是一個文字文件，每行包含一個結構的訊息，每個結構都是輸出中的一個條目。空格應將結構和網域的檔案路徑分隔開，並且可以選擇將任何附加文字視為結果註釋列的註釋。
-o是 PyTorch 檔案的輸出檔案路徑，其中包含帶有嵌入和關聯資料的字典。可以使用torch.load讀取它。
-f確定上述每個結構的檔案格式（ guess 、 pdb 、 mmcif 、 mmtf或coords ）。

同樣，結構應對應於單一蛋白質結構域。嵌入儲存為 Float16，這對搜尋效能沒有明顯影響。

例如，您可以從data目錄執行上述命令來產生具有兩種結構的資料庫。

Python庫

progres也可以在 Python 中使用，允許它整合到其他方法中：

 import progres as pg

# Search as above, returns a list where each entry is a dictionary for a query
# A generator is also available as pg.progres_search_generator
results = pg . progres_search ( querystructure = "query.pdb" , targetdb = "scope95" )
results [ 0 ]. keys () # dict_keys(['query_num', 'query', 'query_size', 'database', 'minsimilarity',
                  #            'maxhits', 'domains', 'hits_nres', 'similarities', 'notes'])

# Score as above, returns a float (similarity score 0 to 1)
pg . progres_score ( "struc_1.pdb" , "struc_2.pdb" )

# Pre-embed as above, saves a dictionary
pg . progres_embed ( structurelist = "filepaths.txt" , outputfile = "searchdb.pt" )
import torch
torch . load ( "searchdb.pt" ). keys () # dict_keys(['ids', 'embeddings', 'nres', 'notes'])

# Read a structure file into a PyTorch Geometric graph
graph = pg . read_graph ( "query.pdb" )
graph # Data(x=[150, 67], edge_index=[2, 2758], coords=[150, 3])

# Embed a single structure
embedding = pg . embed_structure ( "query.pdb" )
embedding . shape # torch.Size([128])

# Load and reuse the model for speed
model = pg . load_trained_model ()
embedding = pg . embed_structure ( "query.pdb" , model = model )

# Embed Cα coordinates and search with the embedding
# This is useful for using progres in existing pipelines that give out Cα coordinates
# queryembeddings should have shape (128) or (n, 128)
#   and should be normalised across the 128 dimension
coords = pg . read_coords ( "query.pdb" )
embedding = pg . embed_coords ( coords ) # Can take a list of coords or a tensor of shape (nres, 3)
results = pg . progres_search ( queryembeddings = embedding , targetdb = "scope95" )

# Get the similarity score (0 to 1) between two embeddings
# The distance (1 - similarity) is also available as pg.embedding_distance
score = pg . embedding_similarity ( embedding , embedding )
score # tensor(1.) in this case since they are the same embedding

# Get all-v-all similarity scores between 1000 embeddings
embs = torch . nn . functional . normalize ( torch . randn ( 1000 , 128 ), dim = 1 )
scores = pg . embedding_similarity ( embs . unsqueeze ( 0 ), embs . unsqueeze ( 1 ))
scores . shape # torch.Size([1000, 1000])