This repository falls under the NIH STRIDES Initiative. STRIDES aims to harness the power of the cloud to accelerate biomedical discoveries. To learn more, visit https://cloud.nih.gov.
NIH Cloud Lab’s goal is to make Cloud easy and accessible for you, so that you can spend less time on administrative tasks and focus more on research.
Use this repository to learn about how to use Azure by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you start with the jumpstart section on the Cloud Lab website before returning here.
Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Artificial intelligence and machine learning algorithms are being applied to a variety of biomedical research questions, ranging from image classification to genomic variant calling. Azure offers AI services through Azure AI Studio and Azure Machine Learning.
See our suite of tutorials to learn more about Gen AI on Azure that highlight Azure products such as Azure AI Studio, Azure OpenAI and Azure AI Search and external tools like Langchain. These notebooks walk you through how to deploy, train, and query models, as well as how to implement techniques like Retrieval-Augmented Generation (RAG). If you are interested in configuring a model to work with structured data like csv or json files, we've created tutorials that walk you through how to index your csv using the Azure UI and query your database using a notebook within Azure ML. We also have another tutorial that runs all the necessary steps directly from a notebook.
Azure Health Data Services is a set of services that enables you to store, process, and analyze medical data in Azure. These services are designed to help organizations quickly connect disparate health data sources and formats, such as structured, imaging, and device data, and normalize it to be persisted in the cloud. At its core, Azure Health Data Services possesses the ability to transform and ingest data into FHIR (Fast Healthcare Interoperability Resources) format. This allows you to transform health data from legacy formats, such as HL7v2 or CDA, or from high-frequency IoT data in device proprietary formats to FHIR. This makes it easier to connect data stored in Azure Health Data Services with services across the Azure ecosystem, like Azure Synapse Analytics, and Azure Machine Learning (Azure ML).
Azure Health Data Services includes support for multiple health data standards for the exchange of structured data, and the ability to deploy multiple instances of different service types (FHIR, DICOM, and MedTech) that seamlessly work with one another. Services deployed within a workspace also share a compliance boundary and common configuration settings. The product scales automatically to meet the varying demands of your workloads, so you spend less time managing infrastructure and more time generating insights from health data.
Copying healthcare data stored in Azure FHIR Server to Synapse Analytics allows researchers to leverage a cloud-scale data warehousing and analytics tool to extract insights from their data as well as build scalable research pipelines. For information on how to perform this export and downstream analytics, please visit this repository.
You can also see hands-on examples of using FHIR on Azure, but note that you will need to supply your own VCF files as these are not provided with the tutorial content.
Medical imaging analysis requires the analysis of large image files and often requires elastic storage and accelerated computing. Microsoft Azure offers cloud-based medical imaging analysis capabilities through its Azure Healthcare APIs and Azure Medical Imaging solutions. Azure's DICOM Service allows for the secure storage, management, and processing of medical images in the cloud, using industry standard DICOM (Digital Imaging and Communications in Medicine) format. The DICOM Service provides features like high availability, disaster recovery, and scalable storage options, making it an ideal solution for pipelines that need to store, manage, and analyze large amounts of medical imaging data. In addition, the server integrates with other Azure services like Azure ML, facilitating the use of advanced machine learning algorithms for image analysis tasks such as object detection, segmentation, and classification. Read about how to deploy the service here.
Microsoft has several medical imaging notebooks that showcase different medical imaging use-cases on Azure Machine Learning. These notebooks demonstrate various data science techniques such as manual model development with PyTorch, automated machine learning, and MLOPS-based examples for automating the machine learning lifecycle in medical use cases, including retraining. These notebooks are available here. Make sure you select a kernel that includes Pytorch else the install of dependencies can be challenging. Note also that you need to use a GPU VM for most of the notebook cells, but you can create several compute environments and switch between them as needed. Be sure to shut them off when you are finished.
For Cloud Lab users interested in multi-modal clinical informatics, DICOMcast provides the ability to synchronize data from a DICOM service to a FHIR service, allowing users to integrate clinical and imaging data. DICOMcast expands the use cases for health data by supporting both a streamlined view of longitudinal patient data and the ability to effectively create cohorts for medical studies, analytics, and machine learning. For more information on how to utilize DICOMcast please visit Microsoft’s documentation or the open-source GitHub repository.
For users hoping to train deep learning models on imaging data, InnerEye-DeepLearning (IE-DL) is a toolbox that Microsoft developed for easily training deep learning models on 3D medical images. Simple to run both locally and in the cloud with Azure Machine Learning, it allows users to train and run inference on the following: • Segmentation models • Classification and regression models • Any PyTorch Lightning model, via a bring-your-own-model setup This project exists in a separate GitHub repository.
Microsoft has several genomics-related offerings that will be useful to many Cloud Lab users. For a broad overview, visit the Microsoft Genomics Community site. You can also get an overview of different execution options from this blog, and a detailed analysis for Nextflow with AWS Batch at this blog. We highlight a few key services here:
Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes.
NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information.
RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks.
Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems.
Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. Access the notebooks here and on GitHub. These notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. Oxford Nanopore also offers a host of Nextflow workflows that will allow you to run a variety of long read pipelines.
These publicly available datasets can save you time on data discovery and preparation by being curated and ready to use in your workflows.