blastn下载blastn源代码下载

blastn

其他源码

1.0.0

下载

如何自动下载blastn数据库、运行blastn并提取blastn命中序列？

作者：Asad Prodhan 博士https://asadprodhan.github.io/

内容

01. Blastn数据库下载和更新[由NCBI工具自动化]

02. Blastn数据库下载和更新[由bash脚本自动执行]

03. Blastn 执行 [用户交互式 bash 脚本 & Nextflow DSL2 脚本]

04.Blastn命中序列提取【用户交互式bash脚本】

05.Blastn常见错误及解决方法

使用 NCBI 提供的脚本下载或更新blastn数据库

使用 NCBI 提供的脚本

为blastn创建conda环境

conda create -n blastn_db

激活blastn环境

conda activate blastn_db

从以下链接复制最新的blast可执行文件的链接

https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

下载可执行文件如下

wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.15.0+-x64-linux.tar.gz

将下载的文件解压如下

tar -zxvf ncbi-blast-2.15.0+-x64-linux.tar.gz

cd ncbi-blast-2.15.0+/bin

将此路径添加到 PATH 环境变量中。请参阅我的教程，了解如何执行此操作：

https://github.com/asadprodhan/About-the-PATH

将ncbi-blast-2.15.0+/bin目录中的update_blastdb.pl复制到要下载blastn数据库的目录

cp ./update_blastdb.pl databaseDirectory

运行脚本如下

run ./update_blastdb.pl --decompress nt

下载将自动开始

下载完成后，确认所有nt文件都已下载。您可以通过交叉检查 https://ftp.ncbi.nlm.nih.gov/blast/db/ 和您的目录之间的 nt 文件编号来做到这一点

如果您的目录中没有所有 nt 文件，那么您将收到“BLAST 数据库错误：无法在引用的别名文件中找到卷或别名文件 nt.xxx”

您可以使用以下 bash 脚本下载丢失的 nt 文件

当所有nt文件下载完毕后，您可以删除md5文件，如下所示：

rm -r *.md5

使用 bash 脚本

准备一个包含所有nt.??.tar.gz 文件列表的metadata.tsv 文件。 nt.??.tar.gz 文件位于

https://ftp.ncbi.nlm.nih.gov/blast/db/

文件metadata.tsv 如下所示：

图 1：Blastn 数据库 nt 文件。

将metadata.tsv文件和以下blastn脚本放在要下载blastn数据库的目录中
检查文件格式如下：

file *

所有文件都必须采用 UNIX 格式，即仅限 ASCII 文本。在 Windows 计算机中编写的文件将具有 Windows 格式，即 ASCII 文本，带有 CRLF 行终止符。通过运行以下命令将这些文件转换为 unix 格式：

dos2unix *

检查文件是否可执行

ls -l

运行以下命令使文件可执行

chmod +x *

自动下载blastn数据库的Bash脚本

下载

#!/bin/bash

#metadata
metadata=./*.tsv
#
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute
while IFS=, read -r field1   

do  
    echo "${Red}${Bold}Downloading ${reset}: "${field1}"" 
    echo ""
    wget https://ftp.ncbi.nlm.nih.gov/blast/db/"${field1}" 
    echo "${Green}${Bold}Downloaded ${reset}: ${field1}"
    echo ""
    echo "${Green}${Bold}Extracting ${reset}: ${field1}"
    tar -xvzf "${field1}"
    echo "${Green}${Bold}Extracted ${reset}: ${field1}"
    echo ""
    echo "${Green}${Bold}Deleting zipped file ${reset}: ${field1}"
    rm -r "${field1}"
    echo "${Green}${Bold}Deleted ${reset}: ${field1}"
    echo ""

done < ${metadata}

x：告诉 tar 提取文件

v：“v”代表“verbose”，在解压继续时列出所有文件

z：告诉 tar 命令解压/解压缩文件 (gzip)

f：告诉 tar 将分配一个文件来使用

pkill -9 wget # 中止正在运行的 wget 下载

像“ tar.gz”这样的通配符不适用于“tar”。因为带有“ ”的 tar不仅将自身限制为目录中现有的 tar 文件，而且还扩展到虚构的文件名 (!)，例如 abc.tar.gz def.tar.gz ghi.tar .gz 或 1.gz、2.gz 和 3.gz 等。由于这些文件不存在，tar 无法找到它们并产生“在存档中找不到”错误。当您有多个 tar 文件需要解压缩时，以下循环函数可以解决此问题。

for file in *.tar.gz; do tar -xvzf "$file"; done

运行爆炸

用于运行blastn的交互式bash脚本

下载

#!/bin/bash -i
# ask for query file
echo Enter your input file name including extension and hit ENTER
read -e F
# ask for an output directory name
echo Enter an output directory name and hit ENTER
read -e outDir
# ask for the blast database path
echo Enter the path to the blast database and hit ENTER
read -e BlastDB
echo ""
# start monitoring run time
SECONDS=0
# make blast results directory
mkdir ${outDir}
# prepare output file name prefix
baseName=$(basename $F .fasta)
# Run blastn with .asn output
echo blastn in progress...
blastn -db ${BlastDB} -num_alignments 1 -num_threads 16 -outfmt 11 -query $PWD/$F > $PWD/${outDir}/${baseName}.asn
# convert output file from asn to xml format
echo converting output file from asn to xml format
blast_formatter -archive $PWD/${outDir}/${baseName}.asn -outfmt 5 > $PWD/${outDir}/${baseName}.xml
# convert output file from asn to tsv format
echo converting output file from asn to tsv format
blast_formatter -archive $PWD/${outDir}/${baseName}.asn -outfmt 0 > $PWD/${outDir}/${baseName}.tsv
# display the compute time
if (( $SECONDS > 3600 )) ; then
    let "hours=SECONDS/3600"
    let "minutes=(SECONDS%3600)/60"
    let "seconds=(SECONDS%3600)%60"
    echo "Completed in $hours hour(s), $minutes minute(s) and $seconds second(s)"
elif (( $SECONDS > 60 )) ; then
    let "minutes=(SECONDS%3600)/60"
    let "seconds=(SECONDS%3600)%60"
    echo "Completed in $minutes minute(s) and $seconds second(s)"
else
    echo "Completed in $SECONDS seconds"
fi

默认身份百分比为 90%

默认查询覆盖率是 0%

E值越小，匹配越好

参考：https://www.metagenomics.wiki/tools/blast/evalue

位分数越高，序列相似性越好

用于运行blastn的Nextflow脚本

该脚本允许您

修改默认的blastn参数
使用容器在本地和远程计算机上运行blastn（这不需要安装和更新blastn软件。但是，您需要安装Nextflow和Singularity）
自动对多个样品进行喷砂分析

下载

Nextflow main.nf 脚本

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

//data_location
params.in = "$PWD/*.fasta"
params.outdir = './results'
params.db = "./blastn_db"
params.evalue='0.05'
params.identity='90'
params.qcov='90'

// blastn

process blastn {

	errorStrategy 'ignore'
	tag { file }
	publishDir "${params.outdir}/blastn", mode:'copy'

	input:
	path (file) 
	path db 

	output:
	path "${file.simpleName}_blast.xml"
	path "${file.simpleName}_blast.html"
	path "${file.simpleName}_blast_sort_withHeader.tsv"

	script:
	"""
	blastn 
		-query $file -db ${params.db}/nt 
		-outfmt 11 -out ${file.simpleName}_blast.asn 
		-evalue ${params.evalue} 
		-perc_identity ${params.identity} 
		-qcov_hsp_perc ${params.qcov} 
		-num_threads ${task.cpus}

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-outfmt 5 -out ${file.simpleName}_blast.xml

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-html -out ${file.simpleName}_blast.html

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-outfmt "6 qaccver saccver pident length evalue bitscore stitle" -out ${file.simpleName}_blast_unsort.tsv

	sort -k1,1 -k5,5n -k4,4nr -k6,6nr ${file.simpleName}_blast_unsort.tsv > ${file.simpleName}_blast_sort.tsv
	awk 'BEGIN{print "qaccvertsaccvertpidenttlengthtevaluetbitscoretmismatchtgapopentqstarttqendtsstarttsendtstitle"}1' ${file.simpleName}_blast_sort.tsv > ${file.simpleName}_blast_sort_withHeader.tsv

	"""
}

workflow {

	query_ch = Channel.fromPath(params.in)
	db = file( params.db )
	blastn (query_ch, db)
        
}

Nextflow nextflow.config 脚本

下载

resume = true

process {
    withName:'blastn|blastIndex'                 { container = 'quay.io/biocontainers/blast:2.14.1--pl5321h6f7f691_0' }
}

singularity {
 enabled = true
 autoMounts = true
 //runOptions = '-e TERM=xterm-256color'
 envWhitelist = 'TERM'
}

命令

nextflow run main.nf --evalue=0.05 --identity='90' --qcov='0' --db="/path/to/blastn_database"

提取blastn命中的序列

Bash 脚本自动提取blastn命中的序列

下载

#!/bin/bash -i

#
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute

# ask for blastn output file
echo ""
echo ""
echo "${Red}${Bold}Enter blastn output tsv file and hit ENTER ${reset}" 
echo ""
read -e F
echo ""
# ask for the key word
echo "${Red}${Bold}Enter filter word (CASE-SENSITIVE) and hit ENTER ${reset}" 
echo ""
read -e KeyWord
echo ""

# ask for the blastn query fasta file
echo "${Red}${Bold}Enter blastn query fasta file and hit ENTER ${reset}" 
echo ""
read -e Query
echo ""

# prepare output file name prefix
baseName=$(basename $F .tsv)
echo ""

# filtering the selected blastn hits
echo ""
echo "${Green}${Bold}Filtering the blastn hits containing ${reset}: "${KeyWord}"" 
echo ""
grep ${KeyWord} $F > ${baseName}_${KeyWord}.tsv

# collecting the query IDs from the selected blastn hits

echo "${Green}${Bold}Collecting the query IDs from the selected blastn hits ${reset}: "${KeyWord}"" 
echo ""
awk '{print $1}' ${baseName}_${KeyWord}.tsv > IDs.txt

# extracting the sequences for the selected blastn hits

echo "${Green}${Bold}Extracting the sequences for the selected blastn hits ${reset}: "${KeyWord}"" 
echo ""
bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"n"$seq}' ${Query} > ${baseName}_${KeyWord}.fasta
echo ""

echo "${Green}${Bold}Done ${reset}: "${KeyWord}"" 
echo ""
echo ""

用于提取blastn命中序列的bash脚本是用户交互的。它将请求输入，自动处理它们，并生成一个包含预期 fasta 序列的文件。请参阅下面的屏幕截图：

图 2：blastn_hits_sequences_extraction_auto_AP 脚本的工作原理。