CASCABELダウンロード - CASCABELソースコードのダウンロード

CASCABEL

その他のソースコード

ダウンロード

カスカベル

Cascabel は、単一または複数のリードライブラリーにわたってアンプリコン配列分析を実行するように設計されたパイプラインです。このパイプラインの目的は、ユーザーがシンプルかつ有意義な方法でデータを探索できるようにするさまざまな出力ファイルを作成し、生成された出力ファイルに基づいて下流の分析を容易にすることです。

CASCABEL は、ショートリードのハイスループットシーケンスデータ用に設計されました。 fastq ファイルの品質管理、ペアエンドリードのフラグメントへの組み立て (シングルエンドデータも処理可能)、ライブラリのサンプルへの分割 (オプション)、OTU ピッキングおよび分類法の割り当てについて説明します。他の出力ファイルに加えて、OTU テーブルも返されます。

当社のパイプラインはワークフロー管理エンジンとして Snakemake を使用して実装されており、ほとんどのステップに対していくつかの選択肢を提供することで分析をカスタマイズできます。パイプラインは複数のコンピューティングノードを利用でき、パーソナルコンピューターからコンピューティングサーバーまで拡張できます。分析と結果は完全に再現可能で、HTML レポートとオプションの PDF レポートで文書化されます。

現在のバージョン: 6.1.0

インストール

Cascabel をインストールする最も簡単で推奨される方法は、 Conda を使用することです。 Conda を入手する最も早い方法は、conda とその依存関係のみを含む Anaconda のミニバージョンである Miniconda をインストールすることです。

ミニコンダ

conda または miniconda をインストールするには、次のチュートリアル (推奨) を参照してください。または、Linux OS を使用している場合は、次のことを試してください。

インストーラーをダウンロードします。


wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

インストールスクリプトを実行し、指示に従います。


bash Miniconda3-latest-Linux-x86_64.sh

残念ながら、Cascabel には多くの依存関係があり、最新の Conda リリースではそれらの間で競合が見つかりましたが、conda v 4.6.14 ではインストールがスムーズに実行できることがわかりました。これを行うには、次のコマンドを使用して conda バージョンをダウングレードする必要があります。


conda install conda=4.6.14

カスカベルをダウンロード

conda をインストールしたら、プロジェクトのクローンを作成またはダウンロードする準備が整います。

プロジェクトのクローンを作成できます。


git clone https://github.com/AlejandroAb/CASCABEL.git

または、このリポジトリからダウンロードします。


wget https://github.com/AlejandroAb/CASCABEL/archive/master.zip

リポジトリをダウンロードまたは複製した後、「CASCABEL」ディレクトリに移動し、そこで次のコマンドを実行して CASCABEL の環境を作成します。


conda env create --name cascabel --file environment.yaml

スネークメイク

cascabel の環境が作成されたので、オンラインヘルプに従って Snakemake をインストールするか、次のコマンドを実行できます。


conda install -c bioconda -c conda-forge snakemake

マットプロットリブ

Snakemake と Python を除く CASCABEL に必要なすべての依存関係は、1 つの conda 環境にロードされます。この意味で、CASCABEL はいくつかのチャートの生成に matplotlib を使用するため、環境をロードする前にこのライブラリをインストールする必要があります。これを行うための推奨される方法は、インストールガイドに従うことですが、次の方法を試すこともできます。


pip install matplotlib  --user

*ローカルインストールを実行している場合、またはsudo権限がない場合は、上記のように--userフラグを使用することを検討してください。

環境をアクティブ化する

Snakemake と Matplotlib をインストールしたら、新しい環境をアクティブ化できます。


conda activate cascabel

環境をアクティブ化した後、Snakemake が PATH に存在しない可能性があります。その場合は、Snakemake の bin ディレクトリをエクスポートしてください。つまり:


export PATH=$PATH:/path/to/miniconda3/bin

ダダ2

asv ワークフローでCascabel を実行する予定がある場合は、このもう 1 つの手順に従うだけで済みます。

conda 内にdada2 をインストールする際にいくつかの問題が報告されているため、 dada2 をインストールするには最後の手順をもう 1 つ実行する必要があります。

R シェルに入り ( Rと入力するだけ)、次のコマンドを実行します。


BiocManager::install("dada2", version = "3.10")

*BiocManager はすでにインストールされているはずなので、前のコマンドを実行するだけで済みます。詳細については、dada2 のインストールガイドを参照してください。

特異点

私たちは、これが最も簡単なインストールではないことを認識しているため、特異点コンテナーの開発に取り組んでおり、これと同じものがすぐに利用可能になることを期待しています。

ご理解いただきありがとうございます。

はじめる

必要な入力ファイル:

raw 読み取りを転送する (fastq または fastq.gz)
逆生読み取り (fastq または fastq.gz) (ペアエンドレイアウトの場合のみ)
バーコード情報付きファイル（分離専用：形式）

下流分析で想定される主な出力ファイル

逆多重化およびトリミングされた読み取り
OTU または ASV テーブル
代表配列 fasta ファイル
分類上の OTU の割り当て
分類学の概要
代表的な配列アラインメント
系統樹
カスカベルレポート

カスカベルを走れ

ワークフローのすべてのパラメーターと動作は構成ファイルで指定されるため、パイプラインを実行する最も簡単な方法は、そのファイルに必要なパラメーターをいくつか入力することです。

 # ------------------------------------------------------------------------------#
#                             Project Name                                     #
# ------------------------------------------------------------------------------#
# The name of the project for which the pipeline will be executed. This should #
# be the same name used as the first parameter on init_sample.sh script (if    #
# used for multiple libraries                                                 #
# ------------------------------------------------------------------------------#
PROJECT : " My_CASCABEL_Project "

# ------------------------------------------------------------------------------#
#                            LIBRARIES/SAMPLES                                 #
# ------------------------------------------------------------------------------#
# SAMPLES/LIBRARIES you want to include in the analysis.                       #
# Use the same library names as with the init_sample.sh script.                #
# Include each library name surrounded by quotes, and comma separated.         #
# i.e LIBRARY:  ["LIB_1","LIB_2",..."LIB_N"]                                   #
# LIBRARY_LAYOUT: Configuration of the library; all the libraries/samples      #
#                 must have the same configuration; use:                       #
#                 "PE" for paired-end reads [Default].                         #
#                 "SE" for single-end reads.                                   #
# ------------------------------------------------------------------------------#
LIBRARY : ["EXP1"]
LIBRARY_LAYOUT : " PE "

# ------------------------------------------------------------------------------#
#                             INPUT FILES                                      #
# ------------------------------------------------------------------------------#
# To run Cascabel for multiple libraries you can provide an input file, tab    #
# separated with the following columns:                                        #
# - Library: Name of the library (this have to match with the values entered   #
#            in the LIBRARY variable described above).                         #
# - Forward reads: Full path to the forward reads.                             #
# - Reverse reads: Full path to the reverse reads (only for paired-end).       #
# - metadata:      Full path to the file with the information for              #
#                  demultiplexing the samples (only if needed).                #
# The full path of this file should be supplied in the input_files variable,   #
# otherwise, you have to enter the FULL PATH for both: the raw reads and the   #
# metadata file (barcode mapping file). The metadata file is only needed if    #
# you want to perform demultiplexing.                                          #
# If you want to avoid the creation of this file a third solution is available #
# using the script init_sample.sh. More info at the project Wiki:              #
# https://github.com/AlejandroAb/CASCABEL/wiki#21-input-files                  #
#                                                                              #
# -----------------------------       PARAMS       -----------------------------#
#                                                                              #
# - fw_reads:  Full path to the raw reads in forward direction (R1)            #
# - rw_reads:  Full path to the raw reads in reverse direction (R2)            #
# - metadata:  Full path to the metadata file with barcodes for each sample    #
#              to perform library demultiplexing                               #
# - input_files: Full path to a file with the information for the library(s)   #
#                                                                              #
# ** Please supply only one of the following:                                  #
#     - fw_reads, rv_reads and metadata                                        #
#     - input_files                                                            #
#     - or use init_sample.sh script directly                                  #
# ------------------------------------------------------------------------------#
fw_reads : " /full/path/to/forward.reads.fq "
rv_reads : " /full/path/to/reverse.reads.fq "
metadata : " /full/path/to/metadata.barcodes.txt "
# or
input_files : " /full/path/to/input_reference.txt "

# ------------------------------------------------------------------------------#
#  ASV_WF:             Binned qualities and Big data workflow                  #
# ------------------------------------------------------------------------------#
# For fastq files with binned qualities (e.g. NovaSeq and NextSeq) the error   #
# learning process within dada2 can be affected, and some data scientists      #
# suggest that enforcing monotonicity could be beneficial for the analysis.    #
# In this section, you can modify key parameters to enforce monotonicity and   #
# also go through a big data workflow when the number of reads may exceed the  #
# physical memory limit.
# More on binned qualities: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf
# You can also follow this excellent thread about binned qualities and Dada2: https://forum.qiime2.org/t/novaseq-and-dada2-incompatibility/25865/8
# ------------------------------------------------------------------------------#
binned_q_scores : " F " # Binned quality scores.Set this to "T" if you want to enforce monotonicity 
big_data_wf : " F " # Set to true when your sequencing run contains more than 10^9 reads (depends on RAM availability!)


# ------------------------------------------------------------------------------#
#                               RUN                                            #
# ------------------------------------------------------------------------------#
# Name of the RUN - Only use alphanumeric characters and don't use spaces.     #
# This parameter helps the user to execute different runs (pipeline executions)#
# with the same input data but with different parameters (ideally).            #
# The RUN parameter can be set here or remain empty, in the latter case, the   #
# user must assign this value via the command line.                            #
# i.e:  --config RUN=run_name                                                  #
# ------------------------------------------------------------------------------#
RUN : " My_First_run "

# ------------------------------------------------------------------------------#
#                                 ANALYSIS TYPE                                #
# rules:                                                                       #
# ------------------------------------------------------------------------------#
# Cascabel supports two main types of analysis:                                #
#  1) Analysis based on traditional OTUs (Operational Taxonomic Units) which   #
#     are mainly generated by clustering sequences based on a sheared          #
#     similarity threshold.                                                    #
#  2) Analysis based on ASV (Amplicon sequence variant). This kind of analysis #
#     deal also with the errors on the sequence reads such that true sequence  #
#     variants can be resolved, down to the level of single-nucleotide         #
#     differences.                                                             #
#                                                                              #
# -----------------------------       PARAMS       -----------------------------#
#                                                                              #
# - ANALYSIS_TYPE    "OTU" or "ASV". Defines the type analysis                 #
# ------------------------------------------------------------------------------#
ANALYSIS_TYPE : " OTU "

このデータの提供方法の詳細については、リンク先の詳細な手順を参照してください。

構成ファイル (config.yaml) の前の部分でわかるように、CASCABEL の起動に必要なパラメータは、 PROJECT 、 LIBRARY 、 RUN 、 fw_reads 、 rv_reads 、およびmetadataです。これらのパラメータを入力したら、数分かけて構成ファイルの残りの部分を確認し、必要に応じて設定を上書きします。ほとんどの値はすでに事前構成されています。構成ファイルは、各ルールの前に意味のあるヘッダーを使用してそれ自体を説明し、そのルールの目的とユーザーが使用できるさまざまなパラメーターを説明します。ファイルのインデントを維持すること (タブとスペースは変更しないでください) とパラメータの名前を維持することが非常に重要です。これらのエントリに有効な値を取得したら、パイプラインを実行する準備が整います (CASCABEL を開始する前に、常に「予行演習」を行うことをお勧めします)。

また、 「ANALYSIS TYPE」セクションにも注目してください。 Cascabel は、OTU (Operational Taxonomic Units) と ASV (Amplicon Sequence Variants) という 2 つの主要なタイプの分析をサポートしています。ここで、Cascabel が実行するターゲットワークフローを選択できます。詳細については、「分析タイプ」セクションを参照してください。


snakemake --configfile config.yaml

オプションで、config.yaml ファイル内ではなく --config フラグを使用して同じパラメーター* を指定できます。


 snakemake --configfile config.yaml --config PROJECT="My_CASCABEL_Project"  RUN="My_First_run" fw_reads="//full/path/to/forward.reads.fq" rv_reads="/full/path/to/reverse.reads.fq" metadata="full/path/to/metadata.barcodes.txt"

※LIBRARYを除き、配列として宣言されているため、設定ファイル内で埋める必要があります

パイプラインの構成

CASCABEL のセットアップと使用方法に関する完全なガイドについては、公式プロジェクト Wiki を参照してください。

設定ファイル

当社では、OTU および ASV 分析用のダブルおよびシングルバーコードのペアエンド読み取りなど、主な可能な構成用に「事前入力された」構成ファイルをいくつか提供しています。実験とデータセットの個々のニーズに合わせてパラメーター設定について情報に基づいた選択を行うことを強くお勧めします。

config.otu.double_bc.yaml 。ペアエンドデータ、両方の読み取りのバーコード、OTU 分析用の構成ファイル。
config.otu.single_bc.yaml 。シングルエンドデータ、1 回読み取りのみのバーコード、OTU 分析用の構成ファイル。
config.asv.double_bc.yaml 。ペアエンドデータ、両方の読み取りのバーコード、ASV 分析用の構成ファイル。
config.asv.single_bc.yaml 。シングルエンドデータ、1 回読み取りのみのバーコード、ASV 分析用の構成ファイル。
config.otu.double_bc.unpaired.yaml 。ペアエンドデータの構成ファイル、両方の読み取りのバーコード、OTU 分析、ペアになっていないワークフロー、RDP による分類の割り当て
config.asv.double_bc.unpaired.yaml 。ペアエンドデータ、両方の読み取りのバーコード、ASV 分析、ペアになっていないワークフローの構成ファイル。

テストデータ

パイプラインをテストするには、CASCABEL のテストデータを使用してパイプラインを実行してみることもお勧めします。

バーコードマッピングファイルの例

引用

Cascabel: 再現可能で文書化された結果を提供する、スケーラブルで多用途なアンプリコン配列データ分析パイプライン。アレハンドロ・アブダラ・アスブン、マーク・A・ベッセリング、セルヒオ・バルツァーノ、ジュディス・ファン・ブレイスウェイク、ハリー・ヴィッテ、ラウラ・ビジャヌエバ、ジュリア・C・エンゲルマンフロント。ジュネット。土井: https://doi.org/10.3389/fgene.2020.489357

拡大する

追加情報