progres下载 - progres源代码下载

progres

其他源码

v0.2.7

下载

Progres - 蛋白质图嵌入搜索

该存储库包含预印本中的方法：

Greener JG 和 Jamali K。使用结构图嵌入进行快速蛋白质结构搜索。 bioRxiv (2022) - 链接

它提供了progres Python 包，可让您根据预嵌入的结构数据库搜索结构、对结构进行评分以及用于搜索的预嵌入数据集。搜索通常需要 1-2 秒，对于多个查询来说速度要快得多。对于 AlphaFold 数据库，初始数据加载大约需要一分钟，但后续搜索每次查询需要十分之一秒。

目前提供了 SCOPe、CATH、ECOD、整个 PDB、21 种模式生物的 AlphaFold 结构和 AlphaFold 数据库 TED 域供搜索。搜索是按域完成的，但 Chainsaw 可用于自动将查询结构拆分为域。

安装

需要 Python 3.8 或更高版本。该软件独立于操作系统。
根据您的系统安装 PyTorch 1.11 或更高版本、PyTorch Scatter、PyTorch Geometric、FAISS 和 STRIDE。 GPU 不是必需的，但在某些情况下可以提供加速。 Linux（和其他操作系统，除了 STRIDE 安装）的示例命令：

conda create -n prog python=3.9
conda activate prog
conda install pytorch=1.11 faiss-cpu -c pytorch
conda install pytorch-scatter pyg -c pyg
conda install kimlab::stride

运行pip install progres ，这也会安装 Biopython、mmtf-python、einops 和 pydantic（如果它们尚不存在）。
第一次使用该软件搜索时，经过训练的模型和预嵌入数据库（约 660 MB）将从 Zenodo 下载到包目录，这需要互联网连接。这可能需要几分钟的时间。您可以设置环境变量PROGRES_DATA_DIR来更改此数据的存储位置，例如，如果您无法写入包目录。请记住在下次运行 Progres 时保持它的设置。
第一次搜索 AlphaFold 数据库 TED 域时，将以类似方式下载预嵌入数据库（约 33 GB）。这可能需要一段时间。确保您有足够的磁盘空间！

或者， docker目录中提供了 Docker 文件。

用法

在 Unix 系统上，可执行文件progres将在安装过程中添加到路径中。在 Windows 上，如果无法访问可执行文件，可以使用 python 调用bin/progres脚本。

运行progres -h以查看帮助文本， progres {mode} -h以查看每种模式的帮助文本。下面描述了这些模式，但帮助文本中概述了其他选项。例如， -d标志设置要运行的设备；默认情况下是cpu ，因为这通常是搜索速度最快的，但在使用 Chainsaw 分割域、搜索许多查询或嵌入数据集时， cuda可能会更快。如果性能很重要，请尝试两者。

根据数据库搜索结构

要根据 SCOPe 数据库中的域搜索 PDB 文件query.pdb （可以在data目录中找到）并打印输出：

progres search -q query.pdb -t scope95

 # QUERY_NUM: 1
# QUERY: query.pdb
# DOMAIN_NUM: 1
# DOMAIN_SIZE: 150 residues (1-150)
# DATABASE: scope95
# PARAMETERS: minsimilarity 0.8, maxhits 100, chainsaw no, faiss no, progres v0.2.7
# HIT_N  DOMAIN   HIT_NRES  SIMILARITY  NOTES
      1  d1a6ja_       150      1.0000  d.112.1.1 - Nitrogen regulatory bacterial protein IIa-ntr {Escherichia coli [TaxId: 562]}
      2  d2a0ja_       146      0.9988  d.112.1.0 - automated matches {Neisseria meningitidis [TaxId: 122586]}
      3  d3urra1       151      0.9983  d.112.1.0 - automated matches {Burkholderia thailandensis [TaxId: 271848]}
      4  d3lf6a_       154      0.9971  d.112.1.1 - automated matches {Artificial gene [TaxId: 32630]}
      5  d3oxpa1       147      0.9968  d.112.1.0 - automated matches {Yersinia pestis [TaxId: 214092]}
...

-q是查询结构文件的路径。或者， -l是一个文本文件，每行一个查询文件路径，每个结果将依次打印。对于多个查询来说，这要快得多，因为设置只发生一次并且可以使用多个工作线程。仅考虑每个文件中的第一个链。
-t是要搜索的预嵌入数据库。目前，这必须是下面列出的数据库之一，或者是使用progres embed生成的预嵌入数据集的文件路径。
-f确定查询结构的文件格式（ guess 、 pdb 、 mmcif 、 mmtf或coords ）。默认情况下，这是根据文件扩展名猜测的，如果无法猜测，则选择pdb 。 coords是指每行以空格分隔的 Cα 原子坐标的文本文件。
-s是进度分数 (0 -> 1)，高于该分数则返回命中，默认为 0.8。正如论文中所讨论的，0.8 表示相同的倍数。
-m是返回的最大命中数，默认100。
-c表示使用 Chainsaw 将查询结构拆分为多个域，并分别对每个域进行搜索。如果 Chainsaw 未找到域，则不会返回任何结果。仅考虑每个文件中的第一个链。运行链锯可能需要几秒钟。

用于将查询结构拆分为域的其他工具包括 Merizo 和 SWORD2。您还可以使用软件（例如 pdb-tools 中的pdb_selres命令）手动划分域。

对命中描述的解释取决于正在搜索的数据库。域名通常包含对相应PDB文件的引用，例如d1a6ja_指的是PDB ID 1A6J链A，可以在RCSB PDB结构视图中打开以快速查看。对于 AlphaFold 数据库 TED 域，可以从此类链接下载文件，其中AF-A0A6J8EXE6-F1是命中注释的第一部分，后面是域的残基范围。

可用数据库

可用的预嵌入数据库有：

姓名	描述	域名数量	搜索时间（1 条查询）	搜索时间（100 个查询）
`scope95`	SCOPe 2.08 域的 ASTRAL 集聚集在 95% seq ID 处	35,371	1.35秒	2.81秒
`scope40`	SCOPe 2.08 域的 ASTRAL 集聚集在 40% seq ID 处	15,127	1.32秒	2.36秒
`cath40`	来自 CATH 23/11/22 的 S40 非冗余域	31,884	1.38秒	2.79秒
`ecod70`	ECOD 开发的 F70 代表域287	71,635	1.46秒	3.82秒
`pdb100`	截至 2024 年 2 月 8 日，所有 PDB 蛋白链均已通过 Chainsaw 分割成结构域	1,177,152	2.90秒	27.3秒
`af21org`	通过 CATH-Assign 将 21 种模型生物体的 AlphaFold 结构分为多个域	338,258	2.21秒	11.0秒
`afted`	AlphaFold 数据库结构通过 TED 分为多个域，并以 50% 序列同一性进行聚类	53,344,209	67.7秒	73.1秒

搜索时间是在具有 256 GB RAM 和 PyTorch 1.11 的 Intel i9-10980XE CPU 上搜索 150 个残基蛋白质（PDB 格式的 d1a6ja_）。给出 1 或 100 个查询的时间。请注意， afted使用详尽的 FAISS 搜索。这不会改变找到的命中，但相似性分数会略有不同 - 请参阅论文。

计算两个结构之间的分数

要计算两个蛋白质域之间的 Progres 分数：

progres score struc_1.pdb struc_2.pdb

 0.7265280485153198

-f和-g确定上述第一个和第二个结构的文件格式（ guess 、 pdb 、 mmcif 、 mmtf或coords ）。

域的顺序不影响分数。 0.8 或更高的分数表示相同的倍数。

预先嵌入要搜索的数据集

要嵌入结构数据集，允许对其进行搜索：

progres embed -l filepaths.txt -o searchdb.pt

-l是一个文本文件，每行包含一个结构的信息，每个结构都是输出中的一个条目。空格应将结构和域名的文件路径分隔开，并且可以选择将任何其他文本视为结果注释列的注释。
-o是 PyTorch 文件的输出文件路径，其中包含带有嵌入和关联数据的字典。可以使用torch.load读取它。
-f确定上述每个结构的文件格式（ guess 、 pdb 、 mmcif 、 mmtf或coords ）。

同样，结构应对应于单个蛋白质结构域。嵌入存储为 Float16，这对搜索性能没有明显影响。

例如，您可以从data目录运行上述命令来生成具有两种结构的数据库。

Python库

progres也可以在 Python 中使用，允许它集成到其他方法中：

 import progres as pg

# Search as above, returns a list where each entry is a dictionary for a query
# A generator is also available as pg.progres_search_generator
results = pg . progres_search ( querystructure = "query.pdb" , targetdb = "scope95" )
results [ 0 ]. keys () # dict_keys(['query_num', 'query', 'query_size', 'database', 'minsimilarity',
                  #            'maxhits', 'domains', 'hits_nres', 'similarities', 'notes'])

# Score as above, returns a float (similarity score 0 to 1)
pg . progres_score ( "struc_1.pdb" , "struc_2.pdb" )

# Pre-embed as above, saves a dictionary
pg . progres_embed ( structurelist = "filepaths.txt" , outputfile = "searchdb.pt" )
import torch
torch . load ( "searchdb.pt" ). keys () # dict_keys(['ids', 'embeddings', 'nres', 'notes'])

# Read a structure file into a PyTorch Geometric graph
graph = pg . read_graph ( "query.pdb" )
graph # Data(x=[150, 67], edge_index=[2, 2758], coords=[150, 3])

# Embed a single structure
embedding = pg . embed_structure ( "query.pdb" )
embedding . shape # torch.Size([128])

# Load and reuse the model for speed
model = pg . load_trained_model ()
embedding = pg . embed_structure ( "query.pdb" , model = model )

# Embed Cα coordinates and search with the embedding
# This is useful for using progres in existing pipelines that give out Cα coordinates
# queryembeddings should have shape (128) or (n, 128)
#   and should be normalised across the 128 dimension
coords = pg . read_coords ( "query.pdb" )
embedding = pg . embed_coords ( coords ) # Can take a list of coords or a tensor of shape (nres, 3)
results = pg . progres_search ( queryembeddings = embedding , targetdb = "scope95" )

# Get the similarity score (0 to 1) between two embeddings
# The distance (1 - similarity) is also available as pg.embedding_distance
score = pg . embedding_similarity ( embedding , embedding )
score # tensor(1.) in this case since they are the same embedding

# Get all-v-all similarity scores between 1000 embeddings
embs = torch . nn . functional . normalize ( torch . randn ( 1000 , 128 ), dim = 1 )
scores = pg . embedding_similarity ( embs . unsqueeze ( 0 ), embs . unsqueeze ( 1 ))
scores . shape # torch.Size([1000, 1000])