Please see: https://github.com/DominikBuchner/BOLDigger3
An even better Python program to query .fasta files against the COI database of www.boldsystems.org
DNA metabarcoding datasets often comprise hundreds of Operational Taxonomic Units (OTUs), requiring querying against databases for taxonomic assignment. The Barcode of Life Data system (BOLD) is a widely used database for this purpose among biologists. However, BOLD's online platform limits users to identifying batches of only 50 sequences at a time. Additionally, using BOLD's API does not completely address this issue as it does not provide access to private and early-release data.
BOLDigger2, the successor to BOLDigger, aims to overcome these limitations. As a pure Python program, BOLDigger2 offers:
By leveraging these features, BOLDigger2 streamlines the process of OTU identification, making it more efficient and comprehensive.
identify
, that automatically performs identification, additional data download, and selection of the top hit. This enables direct implementation into pipelines.identify
function in BOLDigger2 only accepts a single argument: the path to the FASTA file to be identified. It saves all results in the same folder.BOLDigger2 requires Python version 3.10 or higher and can be easily installed using pip in any command line:
pip install boldigger2
This command will install BOLDigger2 along with all its dependencies.
To run the identify function, use the following command:
boldigger2 identify PATH_TO_FASTA
To automate the identify function in bioinformatic pipelines, the BOLD credentials can also by passed directly as optional arguments
boldigger2 identify PATH_TO_FASTA -username USERNAME -password PASSWORD
To costumize the implemented thresholds for user-specific needs, the tresholds can be passed as an additional (ordered) argument. Up to 5 different thresholds can be passed for the different taxonomic levels (Species, Genus, Family, Order, Class). Thresholds not passed will be replaced by default, but BOLDigger2 will also inform you about this.
boldigger2 identify PATH_TO_FASTA -thresholds 99 97
Output:
19:16:16: Default thresholds changed!
19:16:16: Species: 99, Genus: 97, Family: 90, Order: 85, Class: 50
19:16:16: Trying to log in.
BOLD username:
BOLDigger2 will prompt you for your username and password, and then it will perform the identification.
When a new version is released, you can update BOLDigger2 by typing:
pip install --upgrade boldigger2
Buchner D, Leese F (2020) BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems. Metabarcoding and Metagenomics 4: e53535. https://doi.org/10.3897/mbmg.4.53535
The BOLDigger2 algorithm operates according to the following flowchart:
Log in to BOLD:
Generate Download Links for Species-Level Barcodes:
Download Top 100 Hits:
"top_100_hits_unsorted"
.Identify Sequences Without Species-Level Hits:
Generate Download Links for All Records:
Download Top 100 Hits for All Records:
"top_100_hits_unsorted"
.Sort and Save Top Hits:
"top_100_hits_sorted"
.Save Additional Data:
"top_100_hits_additional_data"
.Export Additional Data to Excel:
Calculate and Save Top Hits:
identification_result.xlsx
) and Parquet format (identification_result.parquet.snappy
) for fast further processing.Different thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level, <85% and >= 50: class level) for the taxonomic levels are used to find the best fitting hit. After determining the threshold for all hits the most common hit above the threshold will be selected. Note that for all hits below the threshold, the taxonomic resolution will be adjusted accordingly (e.g. for a 96% hit the species-level information will be discarded, and genus-level information will be used as the lowest taxonomic level).
The BOLDigger2 algorithm functions as follows:
Identify Maximum Similarity: Find the maximum similarity value among the top 100 hits currently under consideration.
Set Threshold: Set the threshold to this maximum similarity level. Remove all hits with a similarity below this threshold. For example, if the highest hit has a similarity of 100%, the threshold will be set to 97%, and all hits below this threshold will be removed temporarily.
Classification and Sorting: Count all individual classifications and sort them by abundance.
Filter Missing Data: Drop all classifications that contain missing data. For instance, if the most common hit is "Arthropoda --> Insecta" with a similarity of 100% but missing values for Order, Family, Genus, and Species.
Identify Common Hit: Look for the most common hit that has no missing values.
Return Hit: If a hit with no missing values is found, return that hit.
Threshold Adjustment: If no hit with no missing values is found, increase the threshold to the next higher level and repeat the process until a hit is found.
BOLDigger2 employs a flagging system to highlight certain conditions, indicating a degree of uncertainty in the selected hit. Currently, there are five flags implemented, which may be updated as needed:
Reverse BIN Taxonomy: This flag is raised if all of the top 100 hits representing the selected match utilize reverse BIN taxonomy. Reverse BIN taxonomy assigns species names to deposited sequences on BOLD that lack species information, potentially introducing uncertainty.
Differing Taxonomic Information: If there are two or more entries with differing taxonomic information above the selected threshold (e.g., two species above 97%), this flag is triggered, suggesting potential discrepancies.
Private or Early-Release Data: If all of the top 100 hits representing the top hit are private or early-release hits, this flag is raised, indicating limited accessibility to data.
Unique Hit: This flag indicates that the top hit result represents a unique hit among the top 100 hits, potentially requiring further scrutiny.
Multiple BINs: If the selected species-level hit is composed of more than one BIN, this flag is raised, suggesting potential complexities in taxonomic assignment.
Given the presence of these flags, it is advisable to conduct a closer examination of all flagged hits to better understand and address any uncertainties in the selected hit.