lumpy sv Download - lumpy sv Quellcode herunterladen

HINWEIS: Dies ist LUMPY 0.2.13 mit zusätzlichen Änderungen, damit Lumpyexpress funktioniert, wenn die Hauptdatei ein CRAM und kein BAM ist. Die Splitter und Discordanten müssen weiterhin BAM-Dateien sein, da LUMPY selbst CRAM noch nicht als Eingabe unterstützt. Dazu muss der Befehl hexdump verfügbar sein.

Für Fragen und Diskussionen zu LUMPY besuchen Sie bitte das Forum unter:

https://groups.google.com/forum/#!forum/lumpy-discuss

KLUMPIG

Ein probabilistischer Rahmen für die Entdeckung struktureller Varianten.

Ryan M. Layer, Colby Chiang, Aaron R. Quinlan und Ira M. Hall. 2014. „LUMPY: ein probabilistischer Rahmen für die Entdeckung struktureller Varianten.“ Genombiologie 15 (6): R84. doi:10.1186/gb-2014-15-6-r84.

Inhaltsverzeichnis

Schnellstart
Installation
Verwendung von LUMPY Express: Automatisierte Haltepunkterkennung für Standardanalysen.
LUMPY (traditionelle) Verwendung: Flexible und anpassbare Haltepunkterkennung für fortgeschrittene Benutzer.
Beispiel-Workflows
Testdaten
Fehlerbehebung

Schnellstart

Beachten Sie, dass Smoove die empfohlene Methode zum Ausführen lumpy ist, da es die Best Practices von lumpy und zugehörigen Tools sammelt und eine kürzere Laufzeit und eine niedrigere Falsch-Positiv-Rate aufweist als lumpyexpress das unten beschrieben wird.

Herunterladen und installieren

 git clone --recursive https://github.com/arq5x/lumpy-sv.git
cd lumpy-sv
make
cp bin/* /usr/local/bin/.

Führen Sie LUMPY Express aus

 lumpyexpress 
    -B my.bam 
    -S my.splitters.bam 
    -D my.discordants.bam 
    -o output.vcf

Installation

Anforderungen

KLUMPIG
- g++-Compiler
- CMake
LUMPY Express (optional)
- Samtools (0.1.18+) (htslib.org/)
- SAMBLASTER (0.1.19+) (Github-Repo)
- Python 2.7 (python.org/) mit pysam (0.8.3+) und NumPy (1.8.1+)
- Sambamba (Gihub Repo)
- gawk (GNU-Projekt)

Installieren

Standardmethode zur Installation:

 git clone --recursive [email protected]:arq5x/lumpy-sv.git
cd lumpy-sv
make
cp bin/* /usr/local/bin/.

Installation mit costom zlib (gzopen64-Kompilierungsfehler):

 git clone --recursive [email protected]:arq5x/lumpy-sv.git
cd lumpy-sv
export ZLIB_PATH="/usr/lib/x86_64-linux-gnu/"; #when /usr/lib/x86_64-linux-gnu/libz.so
make
cp bin/* /usr/local/bin/.

LUMPY Express-Nutzung

Automatisierte Haltepunkterkennung für Standardanalysen.

 usage:   lumpyexpress [options]

Erforderliche Argumente

     -B FILE  coordinate-sorted BAM file(s) (comma separated)
     -S FILE  split reads BAM file(s) (comma separated)
     -D FILE  discordant reads BAM files(s) (comma separated)

Optionale Argumente

 -o STR    output [fullBam.bam.vcf]
-x FILE   BED file to exclude
-P        output probability curves for each variant
-m INT    minimum sample weight for a call [4]
-r FLOAT  trim threshold [0]
-T DIR    temp directory [./output_prefix.XXXXXXXXXXXX]
-k        keep temporary files
-K FILE   path to lumpyexpress.config file
            (default: same directory as lumpyexpress)
-v        verbose
-h        show this message

Konfiguration

LUMPY Express führt mehrere externe Programme aus, deren Pfade in scripts/lumpyexpress.config angegeben sind. Diese Konfiguration muss sich im selben Verzeichnis wie lumpyexpress befinden oder explizit mit dem Flag -K angegeben werden.

Das Installations-Makefile generiert automatisch eine Datei „lumpyexpress.config“ und legt sie im Verzeichnis „bin“ ab.

Eingang

LUMPY Express erwartet BWA-MEM-ausgerichtete BAM-Dateien als Eingabe. Es analysiert automatisch Beispiel-, Bibliotheks- und Lesegruppeninformationen mithilfe der @RG-Tags im BAM-Header. Es wird erwartet, dass jede BAM-Datei genau ein Beispiel enthält.

Die minimale Eingabe ist eine nach Koordinaten sortierte BAM-Datei (-B), aus der LUMPY Express mit SAMBLASTER Splitter und Diskordanten extrahiert, bevor LUMPY ausgeführt wird. Optional können Benutzer koordinatensortierte Splitter- (-S) und diskordante (-D) BAM-Dateien bereitstellen, die die SAMBLASTER-Extraktion für eine schnellere Analyse umgehen.

Ausgabe

LUMPY Express erzeugt eine VCF-Datei gemäß VCF-Spezifikation 4.2.

LUMPY (traditionelle) Verwendung

Flexible und anpassbare Haltepunkterkennung für fortgeschrittene Benutzer.

 usage:    lumpy [options]

Optionen

 -g       Genome file (defines chromosome order)
-e       Show evidence for each call
-w       File read windows size (default 1000000)
-mw      minimum weight across all samples for a call
-msw     minimum per-sample weight for a call
-tt      trim threshold
-x       exclude file bed file
-t       temp file prefix, must be to a writeable directory
-P       output probability curve for each variant
-b       output as BEDPE instead of VCF

-sr      bam_file:<file name>,
         id:<sample name>,
       	 back_distance:<distance>,
         min_mapping_threshold:<mapping quality>,
         weight:<sample weight>,
         min_clip:<minimum clip length>,
         read_group:<string>

-pe      bam_file:<file name>,
         id:<sample name>,
         histo_file:<file name>,
         mean:<value>,
         stdev:<value>,
         read_length:<length>,
         min_non_overlap:<length>,
         discordant_z:<z value>,
         back_distance:<distance>,
         min_mapping_threshold:<mapping quality>,
         weight:<sample weight>,
         read_group:<string>

-bedpe   bedpe_file:<bedpe file>,
         id:<sample name>,
         weight:<sample weight>

Beispiel-Workflows

Vorverarbeitung

Wir empfehlen die Datenausrichtung mit SpeedSeq, das die BWA-MEM-Ausrichtung durchführt, Duplikate markiert und geteilte und nicht übereinstimmende Lesepaare extrahiert.

 speedseq align -R "@RGtID:idtSM:sampletLB:lib" 
    human_g1k_v37.fasta 
    sample.1.fq 
    sample.2.fq

Andernfalls kann es zu einem Datenabgleich mit BWA-MEM kommen.

 # Align the data
bwa mem -R "@RGtID:idtSM:sampletLB:lib" human_g1k_v37.fasta sample.1.fq sample.2.fq 
    | samblaster --excludeDups --addMateTags --maxSplitCount 2 --minNonOverlap 20 
    | samtools view -S -b - 
    > sample.bam

# Extract the discordant paired-end alignments.
samtools view -b -F 1294 sample.bam > sample.discordants.unsorted.bam

# Extract the split-read alignments
samtools view -h sample.bam 
    | scripts/extractSplitReads_BwaMem -i stdin 
    | samtools view -Sb - 
    > sample.splitters.unsorted.bam

# Sort both alignments
samtools sort sample.discordants.unsorted.bam sample.discordants
samtools sort sample.splitters.unsorted.bam sample.splitters

Läuft LUMPY

LUMPY bietet zwei unterschiedliche Ausführungsalternativen. LUMPY Express ist ein vereinfachter Wrapper für Standardanalysen. LUMPY (traditionell) ist anpassbarer und eignet sich für fortgeschrittene Benutzer und spezielle Experimente.

LUMPY Express

Führen Sie LUMPY Express für eine einzelne Probe mit vorextrahierten Splittern und Diskordanten aus

 lumpyexpress 
    -B sample.bam 
    -S sample.splitters.bam 
    -D sample.discordants.bam 
    -o sample.vcf

Führen Sie LUMPY Express gemeinsam für mehrere Proben mit vorextrahierten Splittern und Diskordanten aus

 lumpyexpress 
    -B sample1.bam,sample2.bam,sample3.bam 
    -S sample1.splitters.bam,sample2.splitters.bam,sample3.splitters.bam 
    -D sample1.discordants.bam,sample2.discordants.bam,sample3.discordants.bam 
    -o multi_sample.vcf

Führen Sie LUMPY Express auf einem tumornormalen Paar aus

 lumpyexpress 
    -B tumor.bam,normal.bam 
    -S tumor.splitters.bam,normal.splitters.bam 
    -D tumor.discordants.bam,normal.discordants.bam 
    -o tumor_normal.vcf

LUMPY (traditionell)

Generieren Sie zunächst empirische Statistiken zur Einfügegröße für jede Bibliothek in der BAM-Datei

 samtools view -r readgroup1 sample.bam 
    | tail -n+100000 
    | scripts/pairend_distro.py 
    -r 101 
    -X 4 
    -N 10000 
    -o sample.lib1.histo

Das obige Skript (scripts/pairend_distro.py) zeigt „mean“ und „stdev“ auf dem Bildschirm an. Für diese Beispiele gehen wir davon aus, dass der Mittelwert 500 und die Standardabweichung 50 beträgt.

Führen Sie LUMPY mit Paired-End- und Split-Reads aus.

 lumpy 
    -mw 4 
    -tt 0 
    -pe id:sample,bam_file:sample.discordants.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 
    -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 
    > sample.vcf

Führen Sie LUMPY für eine BAM-Datei mit mehreren Bibliotheken aus.

 lumpy 
    -mw 4 
    -tt 0 
    -pe id:sample,read_group:rg1,bam_file:sample.discordants.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 
    -pe id:sample,read_group:rg2,bam_file:sample.discordants.bam,histo_file:sample.lib2.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 
    -sr id:sample,bam_file:sample.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 
    > sample.vcf

Führen Sie LUMPY auf mehreren Beispielen mit mehreren Bibliotheken aus.

 lumpy 
    -mw 4 
    -tt 0 
    -pe id:sample1,bam_file:sample1.discordants.bam,read_group:rg1,read_group:rg2,histo_file:sample1.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 
    -pe id:sample1,bam_file:sample1.discordants.bam,read_group:rg3,histo_file:sample1.lib2.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 
    -pe id:sample2,bam_file:sample2.discordants.bam,read_group:rg4,histo_file:sample2.lib1.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20 
    -sr id:sample1,bam_file:sample1.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 
    -sr id:sample2,bam_file:sample2.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 
    > multi_sample.vcf

Führen Sie LUMPY aus, wobei Regionen mit geringer Komplexität ausgeschlossen werden.
Heng Li stellt eine Reihe von Regionen mit geringer Komplexität in den ergänzenden Informationen seines Aufsatzes „Toward betterverständnis of artefakten in variantecalling from high-coverage Samples“ unter https://doi.org/10.1093/bioinformatics/btu356 zur Verfügung.

 unzip btu356_Supplementary_Data.zip
unzip btu356-suppl_data.zip
lumpy 
    -mw 4 
    -tt 0.0 
    -x btu356_LCR-hs37d5.bed/btu356_LCR-hs37d5.bed 
    -pe bam_file:sample.discordants.bam,histo_file:sample.pe.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,id:sample,min_mapping_threshold:1 
    -sr bam_file:sample.sr.sort.bam,back_distance:10,weight:1,id:sample,min_mapping_threshold:1 
    > sample.exclude.vcf

Führen Sie LUMPY aus, wobei Regionen mit sehr hoher Abdeckung ausgeschlossen sind.
Wir können Lumpy anweisen, bestimmte Regionen zu ignorieren, indem wir die Option „Region ausschließen“ verwenden. In diesem Beispiel suchen wir Regionen mit sehr hoher Abdeckung und schließen sie dann aus. Zuerst verwenden wir das Skript get_coverages.py, um die minimale, maximale und mittlere Abdeckung der SR- und PE-Bam-Dateien zu ermitteln und Abdeckungsprofile für beide Dateien zu erstellen.
```
 python ../scripts/get_coverages.py 
    sample.pe.sort.bam 
sample.sr.sort.bam
# sample.pe.sort.bam.coverage  min:1   max:14  mean(non-zero):2.35557521272
# sample.sr.sort.bam.coverage  min:1   max:7   mean(non-zero):1.08945936729
```
Von dieser Ausgabe werden wir Regionen ausschließen, die mehr als das Zehnfache der Abdeckung haben. Um die Ausschlussdatei zu erstellen, verwenden wir das Skript „get_exclude_regions.py“, um die Datei „exclude.bed“ zu erstellen
```
 python ../scripts/get_exclude_regions.py 
    10 
exclude.bed 
sample.pe.sort.bam 
sample.sr.sort.bam
```
Jetzt führen wir lumpy erneut mit der Option „exclude (-x)“ aus
```
 lumpy 
    -mw 4 
    -tt 0.0 
    -x exclude.bed 
    -pe bam_file:sample.discordants.bam,histo_file:sample.pe.histo,mean:500,stdev:50,read_length:101,min_non_overlap:101,discordant_z:5,back_distance:10,weight:1,id:sample,min_mapping_threshold:1 
    -sr bam_file:sample.sr.sort.bam,back_distance:10,weight:1,id:sample,min_mapping_threshold:1 
    > sample.exclude.vcf
```

Nachbearbeitung

SVTyper kann Genotypen in LUMPY-Ausgabe-VCF-Dateien mithilfe eines Bayes'schen Maximum-Likelihood-Algorithmus aufrufen.

 svtyper       
    -B sample.bam 
    -S sample.splitters.bam 
    -i sample.vcf
    > sample.gt.vcf

Testdaten

Das Skript test/test.sh führt mehrere simulierte Datensätze lumpy aus und vergleicht die Ergebnisse mit dem bekanntermaßen korrekten Ergebnis. Die Beispieldatensätze finden Sie unter http://layerlab.org/lumpy/data.tar.gz. Dieser Tar-Ball sollte in das Lumpy-Verzeichnis der obersten Ebene extrahiert werden. Das Skript test/test.sh prüft die Existenz dieses Verzeichnisses, bevor es LUMPY ausführt.

Fehlerbehebung

Alle BAM-Dateien, die Lumpy verarbeitet, müssen nach der Position sortiert werden. Um zu überprüfen, ob Ihre Bams richtig sortiert sind, verwenden Sie das Skript check_sorting.py

 python ../scripts/check_sorting.py 
    pe.pos_sorted.bam 
    sr.pos_sorted.bam 
    pe.name_sorted.bam
# pe.pos_sorted.bam
# in order
# sr.pos_sorted.bam
# in order
# pe.name_sorted.bam
# out of order:   chr10   102292476   occurred after   chr10   102292893

Expandieren