Cascabel是一个管道,旨在跨单个或多个读取库运行扩增子序列分析。该管道的目标是创建不同的输出文件,使用户能够以简单而有意义的方式探索数据,并根据生成的输出文件促进下游分析。
CASCABEL 专为短读高通量序列数据而设计。它涵盖了 fastq 文件的质量控制、将双端读取组装到片段(它也可以处理单端数据)、将文库拆分为样本(可选)、OTU 挑选和分类分配。除了其他输出文件外,它还会返回一个 OTU 表。
我们的管道使用 Snakemake 作为工作流管理引擎来实现,并允许通过为大多数步骤提供多种选择来自定义分析。该管道可以利用多个计算节点,范围从个人计算机到计算服务器。分析和结果完全可重现,并记录在 html 和可选的 pdf 报告中。
当前版本: 6.1.0
安装Cascabel最简单且推荐的方法是通过Conda 。获取 Conda 的最快方法是安装 Miniconda,这是 Anaconda 的迷你版本,仅包含 conda 及其依赖项。
要安装 conda 或 miniconda,请参阅以下教程(推荐),或者,如果您使用的是 Linux 操作系统,则可以尝试以下操作:
下载安装程序:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
执行安装脚本并按照说明进行操作。
bash Miniconda3-latest-Linux-x86_64.sh
不幸的是 Cascabel 有很多依赖项,最新的 Conda 版本发现它们之间存在冲突,但是使用 conda v 4.6.14,我们注意到安装可以顺利运行。为此,我们需要使用以下命令降级 conda 版本:
conda install conda=4.6.14
安装conda后,我们就可以克隆或下载该项目了。
您可以克隆该项目:
git clone https://github.com/AlejandroAb/CASCABEL.git
或者从此存储库下载它:
wget https://github.com/AlejandroAb/CASCABEL/archive/master.zip
下载或克隆存储库后,cd 到“CASCABEL”目录并执行以下命令以创建 CASCABEL 的环境:
conda env create --name cascabel --file environment.yaml
现在您已经创建了 cascabel 的环境,您可以按照此在线帮助安装 Snakemake 或执行以下命令:
conda install -c bioconda -c conda-forge snakemake
除了 Snakemake 和 Python 之外,CASCABEL 所需的所有依赖项都加载在一个 conda 环境中。从这个意义上说,CASCABEL 使用 matplotlib 来生成一些图表,因此需要在加载环境之前安装这个库。推荐的方法是遵循安装指南,或者您也可以尝试:
pip install matplotlib --user
*如果您正在进行本地安装或者没有sudo权限,请考虑使用上面的--user标志
安装 Snakemake 和 Matplotlib 后,我们可以激活新环境。
conda activate cascabel
激活环境后,Snakemake 可能不再位于您的 PATH 中,在这种情况下,只需导出 Snakemake 的 bin 目录即可。 IE:
export PATH=$PATH:/path/to/miniconda3/bin
如果您计划使用asv 工作流程运行Cascabel ,则只需再执行这一步骤。
在 conda 中安装dada2 时报告了一些问题,因此我们需要执行最后一步才能安装dada2
进入 R shell(只需输入R
)并执行以下命令:
BiocManager::install("dada2", version = "3.10")
*请注意,BiocManager 应该已经安装,因此您只需执行前面的命令即可。您还可以在dada2 的安装指南中找到更多信息。
我们知道这不是最简单的安装,因此我们正在开发奇点容器,我们希望很快就能推出。
感谢您的理解!
所需的输入文件:
用于下游分析的主要预期输出文件
运行卡斯卡贝尔
工作流的所有参数和行为都是通过配置文件指定的,因此让管道运行的最简单方法是在该文件中填写一些必需的参数。
# ------------------------------------------------------------------------------#
# Project Name #
# ------------------------------------------------------------------------------#
# The name of the project for which the pipeline will be executed. This should #
# be the same name used as the first parameter on init_sample.sh script (if #
# used for multiple libraries #
# ------------------------------------------------------------------------------#
PROJECT : " My_CASCABEL_Project "
# ------------------------------------------------------------------------------#
# LIBRARIES/SAMPLES #
# ------------------------------------------------------------------------------#
# SAMPLES/LIBRARIES you want to include in the analysis. #
# Use the same library names as with the init_sample.sh script. #
# Include each library name surrounded by quotes, and comma separated. #
# i.e LIBRARY: ["LIB_1","LIB_2",..."LIB_N"] #
# LIBRARY_LAYOUT: Configuration of the library; all the libraries/samples #
# must have the same configuration; use: #
# "PE" for paired-end reads [Default]. #
# "SE" for single-end reads. #
# ------------------------------------------------------------------------------#
LIBRARY : ["EXP1"]
LIBRARY_LAYOUT : " PE "
# ------------------------------------------------------------------------------#
# INPUT FILES #
# ------------------------------------------------------------------------------#
# To run Cascabel for multiple libraries you can provide an input file, tab #
# separated with the following columns: #
# - Library: Name of the library (this have to match with the values entered #
# in the LIBRARY variable described above). #
# - Forward reads: Full path to the forward reads. #
# - Reverse reads: Full path to the reverse reads (only for paired-end). #
# - metadata: Full path to the file with the information for #
# demultiplexing the samples (only if needed). #
# The full path of this file should be supplied in the input_files variable, #
# otherwise, you have to enter the FULL PATH for both: the raw reads and the #
# metadata file (barcode mapping file). The metadata file is only needed if #
# you want to perform demultiplexing. #
# If you want to avoid the creation of this file a third solution is available #
# using the script init_sample.sh. More info at the project Wiki: #
# https://github.com/AlejandroAb/CASCABEL/wiki#21-input-files #
# #
# ----------------------------- PARAMS -----------------------------#
# #
# - fw_reads: Full path to the raw reads in forward direction (R1) #
# - rw_reads: Full path to the raw reads in reverse direction (R2) #
# - metadata: Full path to the metadata file with barcodes for each sample #
# to perform library demultiplexing #
# - input_files: Full path to a file with the information for the library(s) #
# #
# ** Please supply only one of the following: #
# - fw_reads, rv_reads and metadata #
# - input_files #
# - or use init_sample.sh script directly #
# ------------------------------------------------------------------------------#
fw_reads : " /full/path/to/forward.reads.fq "
rv_reads : " /full/path/to/reverse.reads.fq "
metadata : " /full/path/to/metadata.barcodes.txt "
# or
input_files : " /full/path/to/input_reference.txt "
# ------------------------------------------------------------------------------#
# ASV_WF: Binned qualities and Big data workflow #
# ------------------------------------------------------------------------------#
# For fastq files with binned qualities (e.g. NovaSeq and NextSeq) the error #
# learning process within dada2 can be affected, and some data scientists #
# suggest that enforcing monotonicity could be beneficial for the analysis. #
# In this section, you can modify key parameters to enforce monotonicity and #
# also go through a big data workflow when the number of reads may exceed the #
# physical memory limit.
# More on binned qualities: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf
# You can also follow this excellent thread about binned qualities and Dada2: https://forum.qiime2.org/t/novaseq-and-dada2-incompatibility/25865/8
# ------------------------------------------------------------------------------#
binned_q_scores : " F " # Binned quality scores.Set this to "T" if you want to enforce monotonicity
big_data_wf : " F " # Set to true when your sequencing run contains more than 10^9 reads (depends on RAM availability!)
# ------------------------------------------------------------------------------#
# RUN #
# ------------------------------------------------------------------------------#
# Name of the RUN - Only use alphanumeric characters and don't use spaces. #
# This parameter helps the user to execute different runs (pipeline executions)#
# with the same input data but with different parameters (ideally). #
# The RUN parameter can be set here or remain empty, in the latter case, the #
# user must assign this value via the command line. #
# i.e: --config RUN=run_name #
# ------------------------------------------------------------------------------#
RUN : " My_First_run "
# ------------------------------------------------------------------------------#
# ANALYSIS TYPE #
# rules: #
# ------------------------------------------------------------------------------#
# Cascabel supports two main types of analysis: #
# 1) Analysis based on traditional OTUs (Operational Taxonomic Units) which #
# are mainly generated by clustering sequences based on a sheared #
# similarity threshold. #
# 2) Analysis based on ASV (Amplicon sequence variant). This kind of analysis #
# deal also with the errors on the sequence reads such that true sequence #
# variants can be resolved, down to the level of single-nucleotide #
# differences. #
# #
# ----------------------------- PARAMS -----------------------------#
# #
# - ANALYSIS_TYPE "OTU" or "ASV". Defines the type analysis #
# ------------------------------------------------------------------------------#
ANALYSIS_TYPE : " OTU "
有关如何提供此数据的更多信息,请点击链接获取详细说明
正如您在配置文件(config.yaml)的前面片段中看到的,CASCABEL 启动所需的参数是: PROJECT 、 LIBRARY 、 RUN 、 fw_reads 、 rv_reads和metadata 。输入这些参数后,请花几分钟时间浏览配置文件的其余部分并根据您的需要覆盖设置。大多数值已预先配置。配置文件通过在每个规则之前使用有意义的标头来解释自身,解释此类规则的目的以及用户可以使用的不同参数。保持文件的缩进(不要更改制表符和空格)以及参数的名称非常重要。一旦您获得了这些条目的有效值,您就可以运行管道了(在启动 CASCABEL 之前进行“试运行”始终是一个好习惯):
另外,请注意分析类型部分。 Cascabel支持两种主要类型的分析,OTU(操作分类单元)和ASV(扩增子序列变体),在这里您可以选择Cascabel将执行的目标工作流程。有关更多信息,请参阅分析类型部分
snakemake --configfile config.yaml
您可以选择通过 --config 标志指定相同的参数*,而不是在 config.yaml 文件中:
snakemake --configfile config.yaml --config PROJECT="My_CASCABEL_Project" RUN="My_First_run" fw_reads="//full/path/to/forward.reads.fq" rv_reads="/full/path/to/reverse.reads.fq" metadata="full/path/to/metadata.barcodes.txt"
*LIBRARY除外,因为它被声明为数组,因此必须在配置文件中填写
有关如何设置和使用 CASCABEL 的完整指南,请访问官方项目 wiki
我们为主要可能的配置提供一些“预填充”配置文件,例如用于 OTU 和 ASV 分析的双条形码和单条形码配对末端读取。我们强烈建议根据实验和数据集的个性化需求对参数设置做出明智的选择。
为了测试管道,我们还建议尝试使用 CASCABEL 的测试数据来运行它
条形码映射文件示例
Cascabel:可扩展且多功能的扩增子序列数据分析管道,可提供可重复且有记录的结果。 Alejandro Abdala Asbun、Marc A Besseling、Sergio Balzano、Judith van Bleijswijk、Harry Witte、Laura Villanueva、Julia C Engelmann Front。基因。; doi:https://doi.org/10.3389/fgene.2020.489357