This downloader can quickly retrieve genes with the same name from different species with known GenBank numbers in the NCBI nucleotide database. The retrieved files will be named in the format of " species name_GenBank number_gene name_sequence position.fasta " .
The downloaded file can be used to compare the nucleotide sequences of a certain gene between different species and draw a genetic evolutionary tree (other programs are required).
This work aims to establish a large-scale, automated method for downloading specified gene (nucleotide) sequences in the NCBI database to reduce unnecessary repetitive work and improve the efficiency of genetic evolution analysis.
This downloader is written in Python language.
Automatic parsing of web pages is completed by selenium and lxml, and resource downloading is completed by urllib.
Selenium needs to be configured.
Modify the save path of downloaded files
Modify savepath_prefix to a customized folder path.
savepath_prefix = 'file save path prefix'
Modify the path to import Gebank table
Currently only csv format is supported.
Modify csv_path to a customized file path.
csv_path = '*.csv'
The csv file needs to be filled in strictly according to the three column titles of serum_type, representative_strain, and GenBank. Serum_type is the serum type , representative_strain is the representative strain , and GenBank is the number . The serum type and GenBank number are required, and the representative strain is optional.
Execute the downloader.py code to start crawling and downloading.
This code currently only supports the gene fragment sequence of the three product keywords of product
gene
note
, which are hexon
hexon protein
fiber
fiber protein
fiber1
fiber1 protein
fiber2
fiber2 protein
, as shown in the figure below.
If you have any questions, please send an email to [email protected]