The editor of Downcodes brings you a comprehensive introduction to the NCBI database. NCBI (National Center for Biotechnology Information) is a center affiliated with the National Institutes of Health (NIH). It maintains many important biomedical databases, which provide massive data resources and powerful analysis tools for global biomedical research. . This article will take an in-depth look at NCBI's eight major databases: GenBank, PubMed, BLAST, Protein, Nucleotide, Gene, OMIM, and GEO, and introduce their respective functions and applications in detail.
NCBI has multiple databases, including GenBank, PubMed, BLAST, Protein, Nucleotide, Gene, OMIM, GEO, etc. Each database has its own unique functions, which together provide powerful support and data resources for biomedical research.
The GenBank database is a large public genetic sequence database that allows users to search, download and analyze genetic sequence data of various organisms. For example, researchers can search for the genetic sequence of a certain species here, perform comparative analysis, and even submit new sequence data.
1. GENBANK
The GenBank database is the world's largest public DNA sequence database and is maintained by the National Center for Biotechnology Information (NCBI), a subsidiary of the National Institutes of Health (NIH). It contains a large amount of sequence data obtained from a variety of organisms, and new data are added every day. The main functions of GenBank include but are not limited to storage, retrieval and exchange of genetic sequence information. In addition, GenBank cooperates with other international sequence databases such as Europe's EMBL and Japan's DDBJ to ensure the global sharing of genetic sequence data.
GenBank supports various types of sequence searches, such as searching by keywords, species names, author names, etc. To facilitate research, GenBank also provides an online submission tool for researchers to submit new genetic sequences. These submissions will be published to scientific research institutions and individuals around the world after passing annotation and quality control.
2. PUBMED
PubMed is a free literature retrieval system that mainly collects journal documents in the biomedical field. PubMed's functions are very powerful and diverse. It not only includes traditional literature abstract information retrieval, but also can directly link to full-text resources, provide literature management tools, and even have special data mining API services. For example, researchers can use PubMed to search for the latest research results on a certain disease or a certain gene to obtain theoretical and experimental inspiration.
Most records in the PubMed database also contain abstracts of publications and clickable links to reference information, and many provide free full-text access links (articles in PMC). In addition, PubMed's My NCBI feature allows users to personalize search strategies, save search results, and create email alerts.
3. BLAST
BLAST is a general sequence alignment tool that can find sequences that are highly similar to a given sequence. The BLAST database contains a large amount of sequence data obtained from GenBank and other sources, and provides a variety of alignment programs, such as nucleotide BLAST for nucleotide sequence comparison and protein BLAST for protein sequence comparison. The function of BLAST is to help users identify the origin and function of sequences, infer genetic relationships, and identify homology between sequences.
Using BLAST is very simple. Researchers only need to input a sequence, and BLAST will quickly return a series of similar sequences and related information, such as similarity to the target sequence, matching regions, etc. This information is extremely important for discovering new genes, studying gene functions, and conducting systematic evolution studies.
4. PROTEIN
NCBI's Protein database is a database focused on protein sequences and functions. It collects protein sequence data from sources including GenBank, RefSeq, TPA, and PDB, and provides a variety of search and analysis tools. The characteristic of the Protein database is to provide detailed annotation information for protein sequences, including but not limited to functional description, structural information, similar sequences, literature citations, etc.
The Protein database is also tightly integrated with BLAST tools, allowing alignment and analysis of protein sequences. Researchers often use this information to predict a protein's function, explore its association with disease, or design and engineer proteins for bioengineering applications.
5. NUCLEOTIDE
The Nucleotide database specifically refers to the database maintained by NCBI for single nucleotide sequences. A large number of DNA and RNA sequence records are collected here, and the search interface allows users to retrieve information according to a variety of conditions (such as species, gene name, sequence ID, etc.). The Nucleotide database is widely used in bioinformatics analysis, molecular biology research and genetic research.
Through the Nucleotide database, researchers can quickly access and download specific genetic sequence information, and conduct subsequent gene cloning, sequence comparison, variation analysis and other work. The power of this database is that it provides a huge amount of information and is updated in real time. It is also linked to other NCBI databases to provide scientific researchers with a one-stop nucleotide information query service.
6. GENE
The Gene database is specially used to store known and predicted genes and their information. Each gene record contains comprehensive information from sequence, gene expression, function to epigenetic modifications. The Gene database not only provides users with genetic information of a single species, but also links records of the same gene in different species to facilitate comparative genomics research.
One of the core functions of the Gene database is to provide detailed annotation information of genes, including gene name, introduction, expression pattern, related diseases, etc. Users can gain an in-depth understanding of the research content of specific genes through the Gene database, which is crucial for the study of disease mechanisms and the discovery of drug targets.
7. OMIM
OMIM, Online Mendelian Inheritance in Man, is an online database of genetic diseases and genes. It contains detailed information on human genetic diseases and various genetic mutations. The goal of OMIM is to extract phenotypic descriptions and genotypic details of all known genetic diseases and become an important resource for studying human genetic pathology.
The information in the OMIM database usually includes the clinical characteristics, genetic patterns, molecular basis of the disease, etc. Through OMIM, researchers can quickly access detailed data on related genetic diseases, which is of great help to research on disease mechanisms, genetic counseling and treatment methods.
8. GEO
GEO, Gene Expression Omnibus, is a database that stores high-throughput gene expression data, especially microarray and next-generation sequencing data. GEO accepts experimental data submitted from interdisciplinary research fields and provides query and download services for these data to scientific researchers.
Data in the GEO database can be used for many types of biomedical research, such as comparing gene expression differences between different samples, analyzing the impact of a certain treatment method on gene expression, etc. This database also provides corresponding analysis tools, allowing researchers to analyze and visualize gene expression patterns online.
In general, NCBI brings together a large number of database resources in the biomedical field and provides researchers with powerful data support and research tools. Each database has its own unique functions and uses and plays an integral role in the advancement of biological sciences and medical research.
1. What are the main databases in NCBI (National Center for Biotechnology Information)? What are these databases used for?
NCBI is an important resource that provides relevant information for life sciences. The following major databases and their functions will be introduced below:
PubMed: This is a biomedical literature database. Through the PubMed Central (PMC) sub-database, users can obtain many high-quality biomedical literature for free. GenBank: This is a database containing DNA sequence information that provides researchers with a global platform to share, search, and access biological sequence information. GenBank stores hundreds of millions of genome, gene and protein sequence information. Sequence Read Archive (SRA): This database stores a large amount of high-throughput sequencing data, including DNA fragment sequencing, RNA sequencing, protein sequencing and other information, where researchers can find data sets suitable for their own research. Protein Data Bank (PDB): This is a protein three-dimensional structure database that stores a large amount of protein structure information. Researchers can obtain protein structure data through PDB and understand the relationship between protein structure and function. Gene Expression Omnibus (GEO): This is a gene expression database that stores a large amount of transcriptome and expression profile data. Researchers can use GEO databases to find gene expression information related to specific biological processes or diseases.2. In the database provided by NCBI, what types of genomic data are captured by NCBI? How are these data used in research?
The genome data captured by NCBI includes many types, mainly including the following categories:
Genome: The genome sequence of an entire organism, including chromosomal and mitochondrial DNA sequences. EST (Expressed Sequence Tag): cDNA sequence fragments obtained through sequencing methods, which can be used to study gene functions. HTG (High Throughput Genome Sequence): A short fragment of DNA sequence generated by high-throughput sequencing, used to construct the starting framework of the genome sequence. GSS (Genome Survey Sequence): Random DNA fragments used for sequencing coverage of the genome. TSA (Transcriptome Shotgun Assembly): The overall sequence of the transcriptome obtained by combining and splicing EST fragments. WGS (Whole Genome Shotgun Sequence): Whole genome disordered sequence, used for sequencing and annotation of the entire genome.These genomic data are widely used in research fields, such as gene function research, genome comparison and evolutionary analysis, drug development and disease diagnosis, etc. Researchers can use these data to analyze the structure, function and regulatory mechanism of genes, reveal the genetic variation and evolutionary process of organisms, find the association between specific genes and diseases, and provide support for personalized medicine.
3. Which NCBI databases can be used to analyze protein sequences and structures? How do these databases help researchers conduct protein research?
NCBI provides multiple databases for analyzing protein sequences and structures. The following are some of the important databases:
UniProt: This is a comprehensive protein database that provides information on protein sequence, structure, function and interaction. Researchers can use UniProt to find proteins of interest and understand their basic properties and functions. Protein Data Bank (PDB): This database stores a large amount of protein three-dimensional structure data determined by crystallography. Researchers can use the structural information in the PDB to study a protein's conformation, mechanism of action, and interactions with other molecules. Conserved DomAIn Database (CDD): This database collects conserved functional domains in known protein sequences and provides domain annotation and classification information. Researchers can use CDD to analyze functional domain combinations and structural features in proteins to infer their functions and similarities. Structure-Function Linkage Database (SFLD): This database integrates the relationship between protein sequence, structure and function and provides detailed annotation and classification information. Researchers can use SFLD to explore the relationship between protein function and structure and deepen their understanding of protein function and evolution.Through these databases, researchers can obtain a large amount of protein sequence and structure information, and conduct sequence comparison, structure prediction, functional annotation, similarity analysis and other studies, so as to deeply explore the function and regulatory mechanism of proteins and provide research in related fields. support.
I hope this article by the editor of Downcodes can help you better understand the NCBI database and its application in biomedical research. If you have any questions, please feel free to ask!