Cascabel是一個管道,旨在跨單一或多個讀取庫運行擴增子序列分析。該管道的目標是創建不同的輸出文件,使用戶能夠以簡單且有意義的方式探索數據,並根據生成的輸出文件促進下游分析。
CASCABEL 專為短讀高通量序列資料而設計。它涵蓋了 fastq 檔案的品質控制、將雙端讀取組裝到片段(它也可以處理單端資料)、將文庫拆分為樣本(可選)、OTU 挑选和分類分配。除了其他輸出檔案外,它還會傳回一個 OTU 表。
我們的管道使用 Snakemake 作為工作流程管理引擎來實現,並允許透過為大多數步驟提供多種選擇來自訂分析。該管道可以利用多個計算節點,範圍從個人電腦到計算伺服器。分析和結果完全可重現,並記錄在 html 和可選的 pdf 報告中。
目前版本: 6.1.0
安裝Cascabel最簡單且推薦的方法是透過Conda 。獲得 Conda 最快的方法是安裝 Miniconda,這是 Anaconda 的迷你版本,僅包含 conda 及其相依性。
要安裝 conda 或 miniconda,請參閱以下教學課程(建議),或者,如果您使用的是 Linux 作業系統,則可以嘗試以下操作:
下載安裝程式:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
執行安裝腳本並按照說明進行操作。
bash Miniconda3-latest-Linux-x86_64.sh
不幸的是 Cascabel 有許多依賴項,最新的 Conda 版本發現它們之間存在衝突,但使用 conda v 4.6.14,我們注意到安裝可以順利運行。為此,我們需要使用以下命令降級 conda 版本:
conda install conda=4.6.14
安裝conda後,我們就可以克隆或下載該項目了。
您可以克隆該項目:
git clone https://github.com/AlejandroAb/CASCABEL.git
或從此存儲庫下載它:
wget https://github.com/AlejandroAb/CASCABEL/archive/master.zip
下載或克隆儲存庫後,cd 到「CASCABEL」目錄並執行以下命令以建立 CASCABEL 的環境:
conda env create --name cascabel --file environment.yaml
現在您已經建立了 cascabel 的環境,您可以按照此線上協助安裝 Snakemake 或執行以下命令:
conda install -c bioconda -c conda-forge snakemake
除了 Snakemake 和 Python 之外,CASCABEL 所需的所有相依性都載入在一個 conda 環境中。從這個意義上說,CASCABEL 使用 matplotlib 來產生一些圖表,因此需要在載入環境之前安裝這個函式庫。建議的方法是遵循安裝指南,或者您也可以嘗試:
pip install matplotlib --user
*如果您正在進行本地安裝或沒有sudo權限,請考慮使用上面的--user標誌
安裝 Snakemake 和 Matplotlib 後,我們可以啟動新環境。
conda activate cascabel
啟動環境後,Snakemake 可能不再位於您的 PATH 中,在這種情況下,只需匯出 Snakemake 的 bin 目錄即可。 IE:
export PATH=$PATH:/path/to/miniconda3/bin
如果您打算使用asv 工作流程執行Cascabel ,則只需再執行此步驟。
在 conda 中安裝dada2 時報告了一些問題,因此我們需要執行最後一步才能安裝dada2
進入 R shell(只需輸入R
)並執行以下命令:
BiocManager::install("dada2", version = "3.10")
*請注意,BiocManager 應該已經安裝,因此您只需執行前面的命令即可。您也可以在dada2 的安裝指南中找到更多資訊。
我們知道這不是最簡單的安裝,因此我們正在開發奇點容器,我們希望很快就能推出。
感謝您的體諒!
所需的輸入檔:
用於下游分析的主要預期輸出文件
運行卡斯卡貝爾
工作流程的所有參數和行為都是透過設定檔指定的,因此讓管道運行的最簡單方法是在該檔案中填寫一些必需的參數。
# ------------------------------------------------------------------------------#
# Project Name #
# ------------------------------------------------------------------------------#
# The name of the project for which the pipeline will be executed. This should #
# be the same name used as the first parameter on init_sample.sh script (if #
# used for multiple libraries #
# ------------------------------------------------------------------------------#
PROJECT : " My_CASCABEL_Project "
# ------------------------------------------------------------------------------#
# LIBRARIES/SAMPLES #
# ------------------------------------------------------------------------------#
# SAMPLES/LIBRARIES you want to include in the analysis. #
# Use the same library names as with the init_sample.sh script. #
# Include each library name surrounded by quotes, and comma separated. #
# i.e LIBRARY: ["LIB_1","LIB_2",..."LIB_N"] #
# LIBRARY_LAYOUT: Configuration of the library; all the libraries/samples #
# must have the same configuration; use: #
# "PE" for paired-end reads [Default]. #
# "SE" for single-end reads. #
# ------------------------------------------------------------------------------#
LIBRARY : ["EXP1"]
LIBRARY_LAYOUT : " PE "
# ------------------------------------------------------------------------------#
# INPUT FILES #
# ------------------------------------------------------------------------------#
# To run Cascabel for multiple libraries you can provide an input file, tab #
# separated with the following columns: #
# - Library: Name of the library (this have to match with the values entered #
# in the LIBRARY variable described above). #
# - Forward reads: Full path to the forward reads. #
# - Reverse reads: Full path to the reverse reads (only for paired-end). #
# - metadata: Full path to the file with the information for #
# demultiplexing the samples (only if needed). #
# The full path of this file should be supplied in the input_files variable, #
# otherwise, you have to enter the FULL PATH for both: the raw reads and the #
# metadata file (barcode mapping file). The metadata file is only needed if #
# you want to perform demultiplexing. #
# If you want to avoid the creation of this file a third solution is available #
# using the script init_sample.sh. More info at the project Wiki: #
# https://github.com/AlejandroAb/CASCABEL/wiki#21-input-files #
# #
# ----------------------------- PARAMS -----------------------------#
# #
# - fw_reads: Full path to the raw reads in forward direction (R1) #
# - rw_reads: Full path to the raw reads in reverse direction (R2) #
# - metadata: Full path to the metadata file with barcodes for each sample #
# to perform library demultiplexing #
# - input_files: Full path to a file with the information for the library(s) #
# #
# ** Please supply only one of the following: #
# - fw_reads, rv_reads and metadata #
# - input_files #
# - or use init_sample.sh script directly #
# ------------------------------------------------------------------------------#
fw_reads : " /full/path/to/forward.reads.fq "
rv_reads : " /full/path/to/reverse.reads.fq "
metadata : " /full/path/to/metadata.barcodes.txt "
# or
input_files : " /full/path/to/input_reference.txt "
# ------------------------------------------------------------------------------#
# ASV_WF: Binned qualities and Big data workflow #
# ------------------------------------------------------------------------------#
# For fastq files with binned qualities (e.g. NovaSeq and NextSeq) the error #
# learning process within dada2 can be affected, and some data scientists #
# suggest that enforcing monotonicity could be beneficial for the analysis. #
# In this section, you can modify key parameters to enforce monotonicity and #
# also go through a big data workflow when the number of reads may exceed the #
# physical memory limit.
# More on binned qualities: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf
# You can also follow this excellent thread about binned qualities and Dada2: https://forum.qiime2.org/t/novaseq-and-dada2-incompatibility/25865/8
# ------------------------------------------------------------------------------#
binned_q_scores : " F " # Binned quality scores.Set this to "T" if you want to enforce monotonicity
big_data_wf : " F " # Set to true when your sequencing run contains more than 10^9 reads (depends on RAM availability!)
# ------------------------------------------------------------------------------#
# RUN #
# ------------------------------------------------------------------------------#
# Name of the RUN - Only use alphanumeric characters and don't use spaces. #
# This parameter helps the user to execute different runs (pipeline executions)#
# with the same input data but with different parameters (ideally). #
# The RUN parameter can be set here or remain empty, in the latter case, the #
# user must assign this value via the command line. #
# i.e: --config RUN=run_name #
# ------------------------------------------------------------------------------#
RUN : " My_First_run "
# ------------------------------------------------------------------------------#
# ANALYSIS TYPE #
# rules: #
# ------------------------------------------------------------------------------#
# Cascabel supports two main types of analysis: #
# 1) Analysis based on traditional OTUs (Operational Taxonomic Units) which #
# are mainly generated by clustering sequences based on a sheared #
# similarity threshold. #
# 2) Analysis based on ASV (Amplicon sequence variant). This kind of analysis #
# deal also with the errors on the sequence reads such that true sequence #
# variants can be resolved, down to the level of single-nucleotide #
# differences. #
# #
# ----------------------------- PARAMS -----------------------------#
# #
# - ANALYSIS_TYPE "OTU" or "ASV". Defines the type analysis #
# ------------------------------------------------------------------------------#
ANALYSIS_TYPE : " OTU "
有關如何提供此數據的更多信息,請點擊鏈接獲取詳細說明
如您在設定檔 (config.yaml) 的上一個片段中所看到的,CASCABEL 啟動所需的參數是: PROJECT 、 LIBRARY 、 RUN 、 fw_reads 、 rv_reads和metadata 。輸入這些參數後,請花幾分鐘瀏覽設定檔的其餘部分並根據您的需求覆蓋設定。大多數值已預先配置。設定檔透過在每個規則之前使用有意義的標頭來解釋自身,解釋此類規則的目的以及使用者可以使用的不同參數。保持檔案的縮排(不要更改製表符和空格)以及參數名稱非常重要。一旦您獲得了這些條目的有效值,您就可以運行管道了(在啟動 CASCABEL 之前進行「試運行」始終是一個好習慣):
另外,請注意分析類型部分。 Cascabel支援兩種主要類型的分析,OTU(操作分類單元)和ASV(擴增子序列變體),在這裡您可以選擇Cascabel將執行的目標工作流程。有關更多信息,請參閱分析類型部分
snakemake --configfile config.yaml
您可以選擇透過 --config 標誌指定相同的參數*,而不是在 config.yaml 檔案中:
snakemake --configfile config.yaml --config PROJECT="My_CASCABEL_Project" RUN="My_First_run" fw_reads="//full/path/to/forward.reads.fq" rv_reads="/full/path/to/reverse.reads.fq" metadata="full/path/to/metadata.barcodes.txt"
*LIBRARY除外,因為它被聲明為數組,因此必須在設定檔中填寫
有關如何設定和使用 CASCABEL 的完整指南,請造訪官方專案 wiki
我們為主要可能的配置提供一些「預先填入」設定文件,例如用於 OTU 和 ASV 分析的雙條碼和單條碼配對末端讀取。我們強烈建議根據實驗和資料集的個人化需求對參數設定做出明智的選擇。
為了測試管道,我們還建議嘗試使用 CASCABEL 的測試資料來運行它
條碼映射檔案範例
Cascabel:可擴展且多功能的擴增子序列資料分析管道,可提供可重複且有記錄的結果。 Alejandro Abdala Asbun、Marc A Besseling、Sergio Balzano、Judith van Bleijswijk、Harry Witte、Laura Villanueva、Julia C Engelmann Front。基因。 doi:https://doi.org/10.3389/fgene.2020.489357