blastn 다운로드 - blastn 소스 코드 다운로드

blastn

기타 소스코드

1.0.0

다운로드

blastn 데이터베이스를 자동으로 다운로드하고, blastn을 실행하고, blastn 히트 시퀀스를 추출하는 방법은 무엇입니까?

저자: Asad Prodhan 박사 https://asadprodhan.github.io/

콘텐츠

01. Blastn 데이터베이스 다운로드 및 업데이트 [NCBI 툴로 자동화]

02. Blastn 데이터베이스 다운로드 및 업데이트 [bash 스크립트로 자동화]

03. Blastn 실행 [User-interactive bash 스크립트 & Nextflow DSL2 스크립트]

04. Blastn 히트 시퀀스 추출 [사용자 대화식 bash 스크립트]

05. Blastn의 일반적인 오류 및 해결 방법

NCBI 제공 스크립트를 사용하여 blastn 데이터베이스 다운로드 또는 업데이트

NCBI 제공 스크립트 사용

blastn을 위한 conda 환경 만들기

conda create -n blastn_db

폭발 환경 활성화

conda activate blastn_db

다음 링크에서 최신 Blast 실행 파일의 링크를 복사하세요.

https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

다음과 같이 실행 파일을 다운로드하십시오

wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.15.0+-x64-linux.tar.gz

다운로드한 파일을 다음과 같이 추출합니다

tar -zxvf ncbi-blast-2.15.0+-x64-linux.tar.gz

다음 디렉터리로 이동하세요.

cd ncbi-blast-2.15.0+/bin

이 경로를 PATH 환경 변수에 추가합니다. 내 튜토리얼에서 이 작업을 수행하는 방법을 확인하세요.

https://github.com/asadprodhan/About-the-PATH

ncbi-blast-2.15.0+/bin 디렉터리의 update_blastdb.pl을 blastn 데이터베이스를 다운로드하려는 디렉터리로 복사합니다.

cp ./update_blastdb.pl databaseDirectory

다음과 같이 스크립트를 실행하세요.

run ./update_blastdb.pl --decompress nt

다운로드가 자동으로 시작됩니다

다운로드가 완료되면 모든 nt 파일이 다운로드 되었는지 확인하세요. https://ftp.ncbi.nlm.nih.gov/blast/db/와 디렉토리 사이의 nt 파일 번호를 교차 확인하여 이를 수행할 수 있습니다.

디렉터리에 모든 nt 파일이 없으면 "BLAST 데이터베이스 오류: 참조된 별칭 파일에서 볼륨 또는 별칭 파일 nt.xxx를 찾을 수 없습니다."라는 메시지가 표시됩니다.

다음 bash 스크립트를 사용하여 누락된 nt 파일을 다운로드할 수 있습니다.

nt 파일이 모두 다운로드되면 다음과 같이 md5 파일을 삭제할 수 있습니다.

rm -r *.md5

Bash 스크립트 사용

모든 nt.??.tar.gz 파일 목록이 포함된 Metadata.tsv 파일을 준비합니다. nt.??.tar.gz 파일은 다음 위치에 있습니다.

https://ftp.ncbi.nlm.nih.gov/blast/db/

메타데이터.tsv 파일은 다음과 같습니다.

그림 1: Blastn 데이터베이스 nt 파일.

blastn 데이터베이스를 다운로드하려는 디렉터리에 메타데이터.tsv 파일과 다음 blastn 스크립트를 넣습니다.
다음과 같이 파일 형식을 확인하십시오.

file *

모든 파일은 UNIX 형식, 즉 ASCII 텍스트여야 합니다. Windows 컴퓨터에서 작성된 파일은 CRLF 줄 종결자가 있는 Windows 형식, 즉 ASCII 텍스트를 갖습니다. 다음 명령을 실행하여 이러한 파일을 Unix 형식으로 변환하십시오.

dos2unix *

파일이 실행 가능한지 확인하십시오.

ls -l

다음 명령을 실행하여 파일을 실행 가능하게 만듭니다.

chmod +x *

blastn 데이터베이스를 자동으로 다운로드하는 Bash 스크립트

다운로드

#!/bin/bash

#metadata
metadata=./*.tsv
#
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute
while IFS=, read -r field1   

do  
    echo "${Red}${Bold}Downloading ${reset}: "${field1}"" 
    echo ""
    wget https://ftp.ncbi.nlm.nih.gov/blast/db/"${field1}" 
    echo "${Green}${Bold}Downloaded ${reset}: ${field1}"
    echo ""
    echo "${Green}${Bold}Extracting ${reset}: ${field1}"
    tar -xvzf "${field1}"
    echo "${Green}${Bold}Extracted ${reset}: ${field1}"
    echo ""
    echo "${Green}${Bold}Deleting zipped file ${reset}: ${field1}"
    rm -r "${field1}"
    echo "${Green}${Bold}Deleted ${reset}: ${field1}"
    echo ""

done < ${metadata}

x: tar에게 파일을 추출하라고 지시합니다.

v: "v"는 "verbose"를 의미하며 압축 해제가 계속됨에 따라 모든 파일을 나열합니다.

z: tar 명령에 파일 압축을 풀거나 풀도록 지시합니다(gzip).

f: tar에게 작업할 파일이 할당될 것이라고 알려줍니다.

pkill -9 wget # 실행 중인 wget 다운로드를 중단합니다.

' tar.gz'와 같은 와일드카드는 'tar'에 대해 작동하지 않습니다. " "가 제공되는 tar는 디렉토리의 기존 tar 파일로 제한될 뿐만 아니라 가상의 파일 이름(!)(예: abc.tar.gz def.tar.gz ghi.tar)으로 확장되기 때문입니다 . .gz 또는 1.gz, 2.gz 및 3.gz 등. 이러한 파일은 존재하지 않기 때문에 tar는 해당 파일을 찾을 수 없으며 '아카이브에서 찾을 수 없음' 오류가 발생합니다. 다음 루프 기능은 압축을 풀 tar 파일이 여러 개 있는 경우 이 문제를 극복할 수 있습니다.

for file in *.tar.gz; do tar -xvzf "$file"; done

폭발을 실행

blastn을 실행하는 대화형 bash 스크립트

다운로드

#!/bin/bash -i
# ask for query file
echo Enter your input file name including extension and hit ENTER
read -e F
# ask for an output directory name
echo Enter an output directory name and hit ENTER
read -e outDir
# ask for the blast database path
echo Enter the path to the blast database and hit ENTER
read -e BlastDB
echo ""
# start monitoring run time
SECONDS=0
# make blast results directory
mkdir ${outDir}
# prepare output file name prefix
baseName=$(basename $F .fasta)
# Run blastn with .asn output
echo blastn in progress...
blastn -db ${BlastDB} -num_alignments 1 -num_threads 16 -outfmt 11 -query $PWD/$F > $PWD/${outDir}/${baseName}.asn
# convert output file from asn to xml format
echo converting output file from asn to xml format
blast_formatter -archive $PWD/${outDir}/${baseName}.asn -outfmt 5 > $PWD/${outDir}/${baseName}.xml
# convert output file from asn to tsv format
echo converting output file from asn to tsv format
blast_formatter -archive $PWD/${outDir}/${baseName}.asn -outfmt 0 > $PWD/${outDir}/${baseName}.tsv
# display the compute time
if (( $SECONDS > 3600 )) ; then
    let "hours=SECONDS/3600"
    let "minutes=(SECONDS%3600)/60"
    let "seconds=(SECONDS%3600)%60"
    echo "Completed in $hours hour(s), $minutes minute(s) and $seconds second(s)"
elif (( $SECONDS > 60 )) ; then
    let "minutes=(SECONDS%3600)/60"
    let "seconds=(SECONDS%3600)%60"
    echo "Completed in $minutes minute(s) and $seconds second(s)"
else
    echo "Completed in $SECONDS seconds"
fi

기본 ID 비율은 90%입니다.

기본 쿼리 적용 범위는 0%입니다.

E-값이 작을수록 일치도가 더 좋습니다.

참조: https://www.metagenomics.wiki/tools/blast/evalue

비트 점수가 높을수록 시퀀스 유사성이 좋아집니다.

blastn을 실행하기 위한 Nextflow 스크립트

이 스크립트를 사용하면 다음을 수행할 수 있습니다.

기본 blastn 매개변수 수정
컨테이너를 사용하여 로컬 및 원격 컴퓨터 모두에서 blastn 실행(이렇게 하면 blastn 소프트웨어를 설치하고 업데이트할 필요가 없습니다. 그러나 Nextflow 및 Singularity를 설치해야 합니다)
여러 샘플에 대한 폭발 분석 자동화

다운로드

Nextflow main.nf 스크립트

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

//data_location
params.in = "$PWD/*.fasta"
params.outdir = './results'
params.db = "./blastn_db"
params.evalue='0.05'
params.identity='90'
params.qcov='90'

// blastn

process blastn {

	errorStrategy 'ignore'
	tag { file }
	publishDir "${params.outdir}/blastn", mode:'copy'

	input:
	path (file) 
	path db 

	output:
	path "${file.simpleName}_blast.xml"
	path "${file.simpleName}_blast.html"
	path "${file.simpleName}_blast_sort_withHeader.tsv"

	script:
	"""
	blastn 
		-query $file -db ${params.db}/nt 
		-outfmt 11 -out ${file.simpleName}_blast.asn 
		-evalue ${params.evalue} 
		-perc_identity ${params.identity} 
		-qcov_hsp_perc ${params.qcov} 
		-num_threads ${task.cpus}

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-outfmt 5 -out ${file.simpleName}_blast.xml

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-html -out ${file.simpleName}_blast.html

	blast_formatter 
		-archive ${file.simpleName}_blast.asn 
		-outfmt "6 qaccver saccver pident length evalue bitscore stitle" -out ${file.simpleName}_blast_unsort.tsv

	sort -k1,1 -k5,5n -k4,4nr -k6,6nr ${file.simpleName}_blast_unsort.tsv > ${file.simpleName}_blast_sort.tsv
	awk 'BEGIN{print "qaccvertsaccvertpidenttlengthtevaluetbitscoretmismatchtgapopentqstarttqendtsstarttsendtstitle"}1' ${file.simpleName}_blast_sort.tsv > ${file.simpleName}_blast_sort_withHeader.tsv

	"""
}

workflow {

	query_ch = Channel.fromPath(params.in)
	db = file( params.db )
	blastn (query_ch, db)
        
}

Nextflow nextflow.config 스크립트

다운로드

resume = true

process {
    withName:'blastn|blastIndex'                 { container = 'quay.io/biocontainers/blast:2.14.1--pl5321h6f7f691_0' }
}

singularity {
 enabled = true
 autoMounts = true
 //runOptions = '-e TERM=xterm-256color'
 envWhitelist = 'TERM'
}

명령

nextflow run main.nf --evalue=0.05 --identity='90' --qcov='0' --db="/path/to/blastn_database"

블래스트 히트에 대한 시퀀스 추출

폭발 적중의 시퀀스를 자동으로 추출하는 Bash 스크립트

다운로드

#!/bin/bash -i

#
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute

# ask for blastn output file
echo ""
echo ""
echo "${Red}${Bold}Enter blastn output tsv file and hit ENTER ${reset}" 
echo ""
read -e F
echo ""
# ask for the key word
echo "${Red}${Bold}Enter filter word (CASE-SENSITIVE) and hit ENTER ${reset}" 
echo ""
read -e KeyWord
echo ""

# ask for the blastn query fasta file
echo "${Red}${Bold}Enter blastn query fasta file and hit ENTER ${reset}" 
echo ""
read -e Query
echo ""

# prepare output file name prefix
baseName=$(basename $F .tsv)
echo ""

# filtering the selected blastn hits
echo ""
echo "${Green}${Bold}Filtering the blastn hits containing ${reset}: "${KeyWord}"" 
echo ""
grep ${KeyWord} $F > ${baseName}_${KeyWord}.tsv

# collecting the query IDs from the selected blastn hits

echo "${Green}${Bold}Collecting the query IDs from the selected blastn hits ${reset}: "${KeyWord}"" 
echo ""
awk '{print $1}' ${baseName}_${KeyWord}.tsv > IDs.txt

# extracting the sequences for the selected blastn hits

echo "${Green}${Bold}Extracting the sequences for the selected blastn hits ${reset}: "${KeyWord}"" 
echo ""
bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"n"$seq}' ${Query} > ${baseName}_${KeyWord}.fasta
echo ""

echo "${Green}${Bold}Done ${reset}: "${KeyWord}"" 
echo ""
echo ""

blastn 적중 시퀀스를 추출하는 bash 스크립트는 사용자 대화형입니다. 입력을 요청하고 자동으로 처리한 후 예상되는 fasta 시퀀스가 포함된 파일을 생성합니다. 아래 스크린샷을 참조하세요.

그림 2: blastn_hits_sequences_extraction_auto_AP 스크립트 작동 방식.

일반적인 폭발 오류 및 해결 방법

Q: Blastn 데이터베이스 오류 '별칭 또는 인덱스 파일을 찾을 수 없음'을 해결하는 방법은 무엇입니까?

그림 3: Blastn 데이터베이스 오류 "별칭 또는 인덱스 파일을 찾을 수 없습니다."

해결책

이 오류는 다음과 같이 스크립트를 조정하여 해결할 수 있습니다.

/path/to/the/blastn/db/nt와 같이 데이터베이스 경로 끝에 'nt'를 추가하십시오.

위의 blastn 스크립트에서 데이터베이스 경로를 참조하세요. 마찬가지로 blastp의 경우 '/nr'

Blastn이 Nextflow 스크립트의 첫 번째 또는 유일한 프로세스인 경우 그러면 프로세스가 데이터베이스의 경로를 사용할 수 있습니다. 그렇지 않은 경우 데이터베이스를 파일로 제공해야 합니다. 다음 참조를 참조하세요. 그리고 입력 채널에는 path(query_sequence) 외에 path(db)도 있어야 합니다. 위의 blastn 스크립트를 참조하세요.

https://stackoverflow.com/questions/75465741/path-not-being-Detected-by-nextflow

Q: Blastn 데이터베이스 오류 '유효한 버전 4 데이터베이스가 아닙니다'를 해결하는 방법은 무엇입니까?

그림 4: Blastn 데이터베이스 오류 "유효한 버전 4 데이터베이스가 아닙니다."

해결책

블래스트 버전 충돌이군요
Conda 환경을 생성하면 최신 Blast nr 데이터베이스를 사용할 수 없는 Blast v2.6이 자동으로 설치됩니다.
최신 blast nr 데이터베이스를 사용하려면 blast v2.15.0과 같은 날짜가 없는 버전이 필요합니다.
사용 중인 blastn 버전을 확인하세요.

blastn -version

blastn의 최신 버전을 업데이트하십시오.

conda install -c bioconda blast

Q: Blastn 데이터베이스 오류 "참조 별칭에서 nt.XXX 별칭을 찾을 수 없습니다"를 해결하는 방법은 무엇입니까?

그림 5: Blastn 데이터베이스 오류 "참조 별칭에서 nt.XXX 별칭을 찾을 수 없습니다."

해결책

blastn 데이터베이스 디렉터리에 모든 nt 파일이 없으면 "BLAST 데이터베이스 오류: 참조된 별칭 파일에서 볼륨 또는 별칭 파일 nt.xxx를 찾을 수 없습니다."라는 오류가 발생합니다.
https://ftp.ncbi.nlm.nih.gov/blast/db/와 blastn 데이터베이스 디렉토리 사이의 nt 파일 번호를 대조 확인하십시오.
위의 bash 스크립트를 사용하여 누락된 nt 파일을 다운로드할 수 있습니다.

확장하다

추가 정보

버전 1.0.0
유형 기타 소스코드
업데이트 시간 2025-01-08
크기 50MB
출처 Github