There are a lot of ways to run workflows on AWS. Here we list a few possibilities each of which may work for different research aims. As you walk through the various tutorials below, think about how you could possibly run that workflow more efficiently using one of the other methods listed here. If you are unfamiliar with any of the terms or concepts here, please review the AWS Jumpstart page.
screen
or, as a startup script attached as metadata. See the GWAS tutorial below for more info on how to run a pipeline using EC2.For many of these tutorials, you will need Short Term Access Keys to create and use resources, particularly whenever a tutorial calls for "access key ID" and "secret key." Use this guide for an explanation of how to obtain and use Short Term Access Keys. If you are an NIH-affiliated researcher, in other words, you don't work at the NIH but have a Cloud Lab account, you will not have access to keys. If there is a tutorial you are unable to complete, reach out to us for help at [email protected]
Please also note, GPU machines cost more than most CPU machines, so be sure to shut these machines down after use, or apply an EC2 lifecycle configuration. You may also encounter service quotas to protect you from the accidental use of expensive machine types. If that happens, and you still want to use a certain instance type, follow these instructions.
Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Artificial intelligence and machine learning algorithms are being applied to a variety of biomedical research questions, ranging from image classification to genomic variant calling. AWS has a long list of AI/ML tutorials available and we have compiled a list here. Most recent development focuses on generative AI including use cases such as extracting information from text, transforming speech to text, and generating images from text. Sagemaker Studio allows the user to rapidly create, test, and train generative AI models and has ready to use models all contained with JumpStart. These models range from foundation models, fine-tunable models, and task-specific solutions.
Clinical informatics, also known as healthcare informatics or medical informatics, is an interdisciplinary field that applies data science to healthcare data to improve patient care, enhance clinical processes, and facilitate medical research. It often involves integrating diverse data types including electronic health records, demographic, or environmental data. AWS offers two on demand workshops that walk you through AWS HealthLake for Population Health data analysis. This first workshop shows you how to ingest data to HealthLake, query those data using Athena, visualize these data using QuickSight, then join FHIR data with environmental data and visualize the combined dataset. The second workshop also ingests data into HealthLake, then visualizes medical device data, uses AI to summarize clinical notes, and then transcribes clinical audio files and summarizes them.
Next Generation genetic sequence data is housed in the NCBI Sequence Read Archive (SRA). You can access these data using the SRA Toolkit. We walk you through this using this notebook, which also walks you through how to set up and search Athena tables to generate an accession list. You can also read this guide for more information on available dataset tables. Additional example notebooks can be found at this NCBI repo. In particular, we recommend this notebook(https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb), which goes into more detail on using Athena to access the results of the SRA Taxonomic Analysis Tool, which often differ from the user input species name due to contamination, error, or due to samples being metagenomic in nature.
Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes.
Medical imaging analysis requires the analysis of large image files and often requires elastic storage and accelerated computing.
RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks.
Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems.
NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. The NCBI team has written a version of BLAST for the cloud called ElasticBLAST, and you can read all about it here. Essentially, ElasticBLAST helps you submit BLAST jobs to AWS Batch and write the results back to S3. Feel free to experiment with the example tutorial in Cloud Shell, or try our notebook version.
You can run several protein folding algorithms including Alpha Fold on AWS. Because the databases are so large, the setup is normally pretty difficult, but AWS has created a StackFormation stack that automates spinning up all the resources necessary for running Alpha Fold and other protein folding algorithms. You can read about the AWS resources here, and view the GitHub page here. To get this to work, you will need to modify your security groups following these instructions. You will also likely have to grant additional permissions to the Role that CloudFormation is using. If you get stuck, reach out to [email protected]. You can also run ESMFold using this tutorial.
Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. Access the notebooks here. These notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. If you are just looking to try out notebooks, don't start with these. If you are interested in long read sequence analysis, then some troubleshooting may be needed to adapt these to the Cloud Lab environment. You may even need to rewrite them in a fresh notebook by adapting the commands. Feel free to reach out to our support team for help.
The Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium created a series of Jupyter notebooks that walk you through the ATOM approach to Drug Discovery.
These notebooks were created to run in Google Colab, so if you run them in AWS, you will need to make a few modification. First, we recommend you use a Sagemaker Studio Notebook rather than a User-Managed notebook simply because it will have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out %tensorflow_version 2.x
since that is a Colab-specific command. You will also need to pip install
a few packages as needed. If you get errors with deepchem
, try running pip install --pre deepchem[tensorflow]
and/or pip install --pre deepchem[torch]
. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution, or review their issues.
Cryo-Electron Microscopy (cryoEM), is a powerful imaging technique used in structural biology to visualize the structures of biological macromolecules, such as proteins, nucleic acids, and large molecular complexes, at near-atomic or even atomic resolution. It has revolutionized the field of structural biology by providing detailed three-dimensional structures of biomolecules, which is crucial for understanding their functions.
AWS has a lot of public data that you can integrate into your testing or use in your own research. You can access these datasets at the Registry of Open Data on AWS. There you can click on any of the datasets to view the S3 path to the data, as well as publications that have used those data and tutorials if available. To demonstrate, we can click the gnomad dataset, then get the S3 path and view the files at the command line by pasting https://registry.opendata.aws/broad-gnomad/
.