blastn下載blastn原始碼下載

blastn

其他源碼

1.0.0

下載

如何自動下載blastn資料庫、運行blastn並提取blastn命中序列？

作者：Asad Prodhan 博士https://asadprodhan.github.io/

內容

01. Blastn資料庫下載更新[由NCBI工具自動化]

02. Blastn資料庫下載更新[由bash腳本自動執行]

03. Blastn 執行 [使用者互動式 bash 腳本 & Nextflow DSL2 腳本]

04.Blastn命中序列擷取【使用者互動式bash腳本】

05.Blastn常見錯誤及解決方法

使用 NCBI 提供的腳本下載或更新blastn資料庫

使用 NCBI 提供的腳本

為blastn創建conda環境

conda create -n blastn_db

啟動blastn環境

conda activate blastn_db

從以下鏈接複製最新的blast可執行檔的鏈接

https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

下載可執行檔如下

wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.15.0+-x64-linux.tar.gz

將下載的檔案解壓縮如下

tar -zxvf ncbi-blast-2.15.0+-x64-linux.tar.gz

cd ncbi-blast-2.15.0+/bin

將此路徑加入 PATH 環境變數。請參閱我的教程，以了解如何執行此操作：

https://github.com/asadprodhan/About-the-PATH

將ncbi-blast-2.15.0+/bin目錄中的update_blastdb.pl複製到要下載blastn資料庫的目錄

cp ./update_blastdb.pl databaseDirectory

運行腳本如下

run ./update_blastdb.pl --decompress nt

下載將自動開始

下載完成後，確認所有nt檔案都已下載。您可以透過交叉檢查 https://ftp.ncbi.nlm.nih.gov/blast/db/ 和您的目錄之間的 nt 檔案編號來做到這一點

如果您的目錄中沒有所有 nt 文件，那麼您將收到“BLAST 資料庫錯誤：無法在引用的別名檔案中找到磁碟區或別名檔案 nt.xxx”

您可以使用以下 bash 腳本下載遺失的 nt 文件

當所有nt文件下載完畢後，您可以刪除md5文件，如下所示：

rm -r *.md5

使用 bash 腳本

準備一個包含所有nt.??.tar.gz 檔案清單的metadata.tsv 檔案。 nt.??.tar.gz 檔案位於

https://ftp.ncbi.nlm.nih.gov/blast/db/

文件metadata.tsv 如下所示：

圖 1：Blastn 資料庫 nt 檔。

將metadata.tsv檔案和以下blastn腳本放在要下載blastn資料庫的目錄中
檢查文件格式如下：

file *

所有文件都必須採用 UNIX 格式，即僅限 ASCII 文字。在 Windows 電腦中編寫的文件將具有 Windows 格式，即 ASCII 文本，帶有 CRLF 行終止符。透過執行以下命令將這些檔案轉換為 unix 格式：

dos2unix *

檢查文件是否可執行

ls -l

執行以下命令使檔案可執行

chmod +x *

自動下載blastn資料庫的Bash腳本

下載

#!/bin/bash

#metadata
metadata=./*.tsv
#
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute
while IFS=, read -r field1   

do  
    echo "${Red}${Bold}Downloading ${reset}: "${field1}"" 
    echo ""
    wget https://ftp.ncbi.nlm.nih.gov/blast/db/"${field1}" 
    echo "${Green}${Bold}Downloaded ${reset}: ${field1}"
    echo ""
    echo "${Green}${Bold}Extracting ${reset}: ${field1}"
    tar -xvzf "${field1}"
    echo "${Green}${Bold}Extracted ${reset}: ${field1}"
    echo ""
    echo "${Green}${Bold}Deleting zipped file ${reset}: ${field1}"
    rm -r "${field1}"
    echo "${Green}${Bold}Deleted ${reset}: ${field1}"
    echo ""

done < ${metadata}

x：告訴 tar 提取文件

v：“v”代表“verbose”，在解壓縮繼續時列出所有文件

z：告訴 tar 指令解壓縮/解壓縮檔 (gzip)

f：告訴 tar 將會分配一個檔案來使用

pkill -9 wget # 中止正在執行的 wget 下載

像「 tar.gz」這樣的通配符不適用於「tar」。因為帶有「」的 tar不僅將自身限制為目錄中現有的 tar 文件，而且還擴展到虛構的文件名 (!)，例如 abc.tar.gz def.tar.gz ghi.tar .gz 或1 .gz、2.gz 和3.gz 等。當您有多個 tar 檔案需要解壓縮時，以下循環函數可以解決此問題。

for file in *.tar.gz; do tar -xvzf "$file"; done

運行爆炸

用於運行blastn的互動式bash腳本

下載

#!/bin/bash -i
# ask for query file
echo Enter your input file name including extension and hit ENTER
read -e F
# ask for an output directory name
echo Enter an output directory name and hit ENTER
read -e outDir
# ask for the blast database path
echo Enter the path to the blast database and hit ENTER
read -e BlastDB
echo ""
# start monitoring run time
SECONDS=0
# make blast results directory
mkdir ${outDir}
# prepare output file name prefix
baseName=$(basename $F .fasta)
# Run blastn with .asn output
echo blastn in progress...
blastn -db ${BlastDB} -num_alignments 1 -num_threads 16 -outfmt 11 -query $PWD/$F > $PWD/${outDir}/${baseName}.asn
# convert output file from asn to xml format
echo converting output file from asn to xml format
blast_formatter -archive $PWD/${outDir}/${baseName}.asn -outfmt 5 > $PWD/${outDir}/${baseName}.xml
# convert output file from asn to tsv format
echo converting output file from asn to tsv format
blast_formatter -archive $PWD/${outDir}/${baseName}.asn -outfmt 0 > $PWD/${outDir}/${baseName}.tsv
# display the compute time
if (( $SECONDS > 3600 )) ; then
    let "hours=SECONDS/3600"
    let "minutes=(SECONDS%3600)/60"
    let "seconds=(SECONDS%3600)%60"
    echo "Completed in $hours hour(s), $minutes minute(s) and $seconds second(s)"
elif (( $SECONDS > 60 )) ; then
    let "minutes=(SECONDS%3600)/60"
    let "seconds=(SECONDS%3600)%60"
    echo "Completed in $minutes minute(s) and $seconds second(s)"
else
    echo "Completed in $SECONDS seconds"
fi

預設身份百分比為 90%

預設查詢覆蓋率是 0%

E值越小，配對越好

參考：https://www.metagenomics.wiki/tools/blast/evalue

位分數越高，序列相似性越好

用於運行blastn的Nextflow腳本

該腳本允許您

修改預設的blastn參數
使用容器在本機和遠端電腦上執行blastn（這不需要安裝和更新blastn軟體。但是，您需要安裝Nextflow和Singularity）
自動對多個樣品進行噴砂分析

下載

Nextflow main.nf 腳本

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

//data_location
params.in = "$PWD/*.fasta"
params.outdir = './results'
params.db = "./blastn_db"
params.evalue='0.05'
params.identity='90'
params.qcov='90'

// blastn

process blastn {

	errorStrategy 'ignore'
	tag { file }
	publishDir "${params.outdir}/blastn", mode:'copy'

	input:
	path (file) 
	path db 

	output:
	path "${file.simpleName}_blast.xml"
	path "${file.simpleName}_blast.html"
	path "${file.simpleName}_blast_sort_withHeader.tsv"

	script:
	"""
	blastn 
		-query $file -db ${params.db}/nt 
		-outfmt 11 -out ${file.simpleName}_blast.asn 
		-evalue ${params.evalue} 
		-perc_identity ${params.identity} 
		-qcov_hsp_perc ${params.qcov} 
		-num_threads ${task.cpus}

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-outfmt 5 -out ${file.simpleName}_blast.xml

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-html -out ${file.simpleName}_blast.html

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-outfmt "6 qaccver saccver pident length evalue bitscore stitle" -out ${file.simpleName}_blast_unsort.tsv

	sort -k1,1 -k5,5n -k4,4nr -k6,6nr ${file.simpleName}_blast_unsort.tsv > ${file.simpleName}_blast_sort.tsv
	awk 'BEGIN{print "qaccvertsaccvertpidenttlengthtevaluetbitscoretmismatchtgapopentqstarttqendtsstarttsendtstitle"}1' ${file.simpleName}_blast_sort.tsv > ${file.simpleName}_blast_sort_withHeader.tsv

	"""
}

workflow {

	query_ch = Channel.fromPath(params.in)
	db = file( params.db )
	blastn (query_ch, db)
        
}

Nextflow nextflow.config 腳本

下載

resume = true

process {
    withName:'blastn|blastIndex'                 { container = 'quay.io/biocontainers/blast:2.14.1--pl5321h6f7f691_0' }
}

singularity {
 enabled = true
 autoMounts = true
 //runOptions = '-e TERM=xterm-256color'
 envWhitelist = 'TERM'
}

命令

nextflow run main.nf --evalue=0.05 --identity='90' --qcov='0' --db="/path/to/blastn_database"

提取blastn命中的序列

Bash 腳本自動擷取blastn命中的序列

下載

#!/bin/bash -i

#
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute

# ask for blastn output file
echo ""
echo ""
echo "${Red}${Bold}Enter blastn output tsv file and hit ENTER ${reset}" 
echo ""
read -e F
echo ""
# ask for the key word
echo "${Red}${Bold}Enter filter word (CASE-SENSITIVE) and hit ENTER ${reset}" 
echo ""
read -e KeyWord
echo ""

# ask for the blastn query fasta file
echo "${Red}${Bold}Enter blastn query fasta file and hit ENTER ${reset}" 
echo ""
read -e Query
echo ""

# prepare output file name prefix
baseName=$(basename $F .tsv)
echo ""

# filtering the selected blastn hits
echo ""
echo "${Green}${Bold}Filtering the blastn hits containing ${reset}: "${KeyWord}"" 
echo ""
grep ${KeyWord} $F > ${baseName}_${KeyWord}.tsv

# collecting the query IDs from the selected blastn hits

echo "${Green}${Bold}Collecting the query IDs from the selected blastn hits ${reset}: "${KeyWord}"" 
echo ""
awk '{print $1}' ${baseName}_${KeyWord}.tsv > IDs.txt

# extracting the sequences for the selected blastn hits

echo "${Green}${Bold}Extracting the sequences for the selected blastn hits ${reset}: "${KeyWord}"" 
echo ""
bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"n"$seq}' ${Query} > ${baseName}_${KeyWord}.fasta
echo ""

echo "${Green}${Bold}Done ${reset}: "${KeyWord}"" 
echo ""
echo ""

用於提取blastn命中序列的bash腳本是用戶互動的。它將請求輸入，自動處理它們，並產生一個包含預期 fasta 序列的檔案。請參閱下面的螢幕截圖：

圖 2：blastn_hits_sequences_extraction_auto_AP 腳本的工作原理。