该存储库包含用于执行分析和生成相关手稿中描述的图形的所有代码。
所有 jupyter 笔记本最初都在具有 64 GB 内存和 8 个 CPU 核心的 GCP VM 环境中运行。这些笔记本电脑适合在具有 64 GB 内存和 Ryzen 7 系列 CPU 的机器上运行。这些笔记本电脑未经在计算资源较低的环境中进行测试,但预计大多数笔记本电脑只需 8 GB 内存和单个 CPU 即可完全执行。一部分笔记本电脑需要大量内存开销,并且预计无法使用少于 32 GB 的内存来执行。
该存储库中的 shell 脚本最初在具有 32 个 CPU 核心和 128 GB 内存的计算环境中运行。这些脚本已被修改为在单个 CPU 核心上运行(花费大量时间成本),但可以修改为并行运行。
为简单起见,关键输入和中间文件在此 github 存储库中提供或托管在Figshare 上。其中许多文件的生成需要大量的计算。下面是有关如何生成这些输入的完整说明,但强烈建议使用预先生成的输入。
首先,克隆此 github 存储库并从 Figshare 下载输入,如下所示:
#From DepMap figshare
wget -O ./data/21q4_Achilles_guide_map.csv https://figshare.com/ndownloader/files/31315819
wget -O .data/21q4_Achilles_logfold_change.csv https://figshare.com/ndownloader/files/31315903
wget -O ./data/21q4_Achilles_replicate_map.csv https://figshare.com/ndownloader/files/31315876
wget -O ./data/21q4_crispr_gene_effect.csv https://figshare.com/ndownloader/files/31315996
wget -O ./data/22q1_Achilles_gene_effect.csv https://figshare.com/ndownloader/files/34008383
wget -O ./data/22q1_Achilles_guide_map.csv https://figshare.com/ndownloader/files/34008362
wget -O ./data/22q1_Achilles_logfold_change.csv https://figshare.com/ndownloader/files/34008443
wget -O ./data/22q1_Achilles_replicate_map.csv https://figshare.com/ndownloader/files/34008398
wget -O ./data/22q1_CCLE_gene_cn.csv https://figshare.com/ndownloader/files/34008428
wget -O ./data/22q1_crispr_gene_effect.csv https://figshare.com/ndownloader/files/34008491
wget -O ./data/22q1_expression.csv https://figshare.com/ndownloader/files/34008404
wget -O ./data/22q1_sample_info.csv https://figshare.com/ndownloader/files/34008503
wget -O ./data/22q2_crispr_gene_effect.csv https://figshare.com/ndownloader/files/34990036
wget -O ./data/OmicsGuideMutationsBinaryKY.csv https://plus.figshare.com/ndownloader/files/43347465
wget -O ./data/OmicsGuideMutationsBinaryHumagne.csv https://plus.figshare.com/ndownloader/files/43347432
wget -O ./data/OmicsGuideMutationsBinaryAvana.csv https://plus.figshare.com/ndownloader/files/43347378
wget -O ./data/AvanaGuideMap.csv https://plus.figshare.com/ndownloader/files/43346391
wget -O ./data/HumagneGuideMap.csv https://plus.figshare.com/ndownloader/files/43346748
wget -O ./data/KYGuideMap.csv https://plus.figshare.com/ndownloader/files/43346769
wget -O ./data/common_essentials.csv https://plus.figshare.com/ndownloader/files/43346361
#From figshare for this paper
wget -O ./data/final_avana.txt https://figshare.com/ndownloader/files/45747036
wget -O ./data/final_moffat.txt https://figshare.com/ndownloader/files/45747045
wget -O ./data/genetic_map_hg38_withX.txt https://figshare.com/ndownloader/files/45746991
wget -O ./data/final_custom_library.txt https://figshare.com/ndownloader/files/45747039
wget -O ./data/final_dolcetto.txt https://figshare.com/ndownloader/files/45747054
wget -O ./data/gene.block.matrix.txt https://figshare.com/ndownloader/files/45747027
wget -O ./data/avana.filtered.ccle.variant.calls.vcf.gz https://figshare.com/ndownloader/files/45747048
wget -O ./data/final_sanger.txt https://figshare.com/ndownloader/files/45747003
wget -O ./data/final_gecko.txt https://figshare.com/ndownloader/files/45747006
wget -O ./data/final_minlibcas.txt https://figshare.com/ndownloader/files/45747024
wget -O ./data/final_calabrese.txt https://figshare.com/ndownloader/files/45746943
此时,您现在可以运行以下笔记本:
以下笔记本需要额外的输入文件(详细信息如下)才能运行。这些笔记本需要受保护的访问数据作为输入,需要手动下载。
下载或创建输入文件并按指示命名。除非另有说明,所有输入文件必须位于此存储库的 ./data 目录中。一些先前生成的输入文件的访问受到限制,需要用户从相应的研究中申请下载权限。以下代码将生成运行 jupyter 笔记本所需的所有输入。许多处理步骤都是计算密集型的,并且如果可能的话,在 ./data 目录中提供预先计算的输入文件。
21Q4 Achilles Guide Map -> ./data/21q4_Achilles_guide_map.csv
21Q4 Achilles Logfold Change -> .data/21q4_Achilles_logfold_change.csv
21Q4 Achilles Replicate Map -> ./data/21q4_Achilles_replicate_map.csv
21Q4 CRISPR Gene Effect -> ./data/21q4_crispr_gene_effect.csv
22q1 Achilles Gene Effect -> ./data/22q1_Achilles_gene_effect.csv
22q1 Achilles Guide Map -> ./data/22q1_Achilles_guide_map.csv
22q1 Achilles Logfold Change -> ./data/22q1_Achilles_logfold_change.csv
22q1 Achilles Replicate Map -> ./data/22q1_Achilles_replicate_map.csv
22q1 CCLE Gene CN -> ./data/22q1_CCLE_gene_cn.csv
22q1 CRISPR Gene Effect -> ./data/22q1_crispr_gene_effect.csv
22q1 Expression -> ./data/22q1_expression.csv
22q1 Sample Info -> ./data/22q1_sample_info.csv
23q4 Omics Signatures -> ./data/23q4_omics_signatures.csv
22q2 CRISPR Gene Effect -> ./data/22q2_crispr_gene_effect.csv
23q4 OmicsSignatures -> ./data/23q4_omics_signatures.csv
CCLE_SNP.Birdseed.Calls_2013-07-29.tar.gz -> ./snp_array_data
OmicsGuideMutationsBinaryKY -> ./data/OmicsGuideMutationsBinaryKY.csv
OmicsGuideMutationsBinaryHumagne -> ./data/OmicsGuideMutationsBinaryHumagne.csv
OmicsGuideMutationsBinaryAvana -> ./data/OmicsGuideMutationsBinaryAvana.csv
AvanaGuideMap -> ./data/AvanaGuideMap.csv
HumagneGuideMap -> ./data/HumagneGUideMap.csv
KYGUideMap -> ./data/KYGuideMap.csv
AchillesCommonEssentialControls -> ./data/common_essentials.csv
#Get access to the CCLE WGS controlled access data
#Download the files listed in ./ccle_wgs/wgs_file_names.txt
#Store all files in ./ccle_wgs/
#Create num_shared.txt
#Create num_snp6_only.txt
#Create num_wgs_only.txt
#Requires Bcftools (in PATH)
bash compute_intersections.sh
#Download hg38 gtf
cd ./data
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
gzip -d hg38.refGene.gtf.gz
#Download the somatic mutation calls listed in ./tcga_somatic/vcf_sample_sheet.tsv
#Store all files in ./tcga_somatic/
#Filter the files to only include variants that map to avana guides
cd ./code
bash create_avana_filtered_tcga_somatic.sh
#TCGA germline variant calls are included as part of the following manuscript:
#Pathogenic germline variants in 10,389 adult cancers
#PMID: 29625052
#Download the variant call file (name = PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.vcf.gz). Store in ./tcga_germline/
#Extract the sample names
#Requires Bcftools (in PATH)
cd ./code
bash extract_tcga_germline_sample_names.sh
#Filter the germline variant calls to include only variants in avana guides
#Requires Bcftools (in PATH)
cd ./code
bash extract_tcga_germline_variants_in_avana_guides.sh
#Requires Python-3.6 (must be in PATH)
#Requires R-4.0 (must be in PATH)
#Requires Bcftools (must be in PATH)
#Requires Tabix (must be in PATH)
#Download the SNP6 annotation file
cd ./snp_array_data
wget https://software.broadinstitute.org/cancer/cga/sites/default/files/data/tools/contest/GenomeWideSNP_6.na30.annot.hg19.csv.pickle.gz
#Download the hg19 fasta
cd ./snp_array_data
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
gzip -d human_g1k_v37.fasta.gz
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
#Process the vcf files. This script will take a long time to run, but can easily be modified to process vcf files in parallel
cd ./snp_array_data
bash process_ccle_genotyping.sh
############
#Imputation#
############
#Step 1) Perform genotype phasing and imputation using the Michigan Imputation Server
#Step 2) Download and unzip the phased/imputed vcf files. The zipped file is password protected, follow unpacking instructions.
#Step 3) Add the 'pe_' prefix to the phased/imputed vcf files.
#Cat the phased/imputed vcfs, pass filter, then zip and index
cd ./snp_array_data
bcftools concat -o ccle_snp6_phased_imputed.vcf pe_*
bcftools view -f PASS ccle_snp6_phased_imputed.vcf > ccle_all_called.vcf
bgzip ccle_all_called.vcf
tabix -p vcf ccle_all_called.vcf.gz
cp ccle_all_called.vcf.gz ../data
cp ccle_all_called.vcf.gz.tbi ../data
#Extract the sample headers, which is a useful input for many scripts
cd ./snp_array_data
bcftools query -l ccle_all_called.vcf.gz > ../data/ccle.vcf.sample.names.txt
#Filter the CCLE variant calls (ccle_all_called.vcf.gz) to retain only SNPS in Avana guides
#This script will create 'snps.in.all.avana.guides.vcf.gz'
#Requires Bcftools (in PATH)
cd ./code
bash create_snps_in_all_avana_guides_vcf.sh
#Requires RFMixv2 (https://github.com/slowkoni/rfmix)
#Requres Samtools-v1.9 (in PATH)
#Requires Bcftools (in PATH)
#Requires R-4.0 (in PATH)
#Download the RFMixv2 reference panel
cd ./data
for i in {1..22};
do
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr${i}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr${i}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz.tbi
done
#Download the genetic map (https://alkesgroup.broadinstitute.org/Eagle/#x1-250005.1.2)
#Download genetic_map_hg38_withX.txt (retain this name) and put in ./data
#Format the genetic map for our analysis
cd ./code
Rscript format_genetic_map.R
#Run RFMix
cd ./code
bash run_rfmix.sh
#Create depmap_cell_lineage.csv
#Requires R-4.0 (or higher)
cd ./code
Rscript create_depmap_cell_lineage_file.R
#Create gene.block.matrix.txt
#Requires R-4.0 (or higher)
cd ./code
Rscript create_gene_block_matrix.R
#Create ancestry_top_snp_df.txt and merged.pvals.txt
#Requires R-4.0 (or higher)
#Requires PLINK 2.0 (in PATH)
cd ./code
Rscript create_ancestry_top_snp_df.txt
#Create merged_frequency_dataset.txt and ccle.ancestry.snps.vcf.gz
#Requires R-4.0 (or higher)
#Requires Bcftools (in PATH)
cd ./code
Rscript create_merged_frequency_dataset.R
#Create collapsed.ancestry.information.txt
#Requires R-4.0 (or higher)
cd ./code
Rscript create_collapsed_ancestry_information.R
#Create snv_position_single_guide_finaldf.txt
#Requires R-4.0 (or higher)
cd ./code
Rscript create_snv_position_single_guide_finaldf.R
#Create chronos_22q1_ancestry_associated_dependency_pvals.txt
#Create chronos_22q2_ancestry_associated_dependency_pvals.txt
#Requires R-4.0 (or higher)
cd ./code
Rscript create_chronos_22q1_22q2_ancestry_associated_dependency_pvals.R
#Create ccle_snp6_ancestry_calls.txt
#Requires R-4.0 (or higher)
cd ./code
Rscript create_ccle_snp6_ancestry_calls.R
#Create affected_guides_per_gnomad_sample.txt
#Require R-4.0 (or higher)
cd ./code
Rscript create_affected_guides_per_gnomad_sample.R
#Create top_snp_for_extraction.txt
#Requires R-4.0 (or higher)
cd ./code
Rscript create_top_snp_for_extraction.R
#Create top.snp.fdr.txt
#Create merged.pvals.txt
#Requires R-4.0 (or higher)
cd ./code
Rscript create_top_snp_fdr.R
Mismatch Tab from Supplemental Table 19 -> ./data/Doench_Data.txt
Download and unpack Cosmic_CancerGeneCensus_Tsv_v97_GRCh38.tar -> ./data/cosmic_genes.csv
#Download the hgdp+1kg subset info file
gsutil cp https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_sample_meta.tsv.bgz ./data
#Download the individualized hgdb+1kg vcf and index files
for i in {1..22};
do
gsutil cp https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.hgdp_tgp.chr${i}.vcf.bgz ./data
gsutil cp https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.hgdp_tgp.chr${i}.vcf.bgz.tbi ./data
done
#Download the aggregated vcf and index files
for i in {1..22};
do
gsutil cp https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr${i}.vcf.bgz ./data
gsutil cp https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr${i}.vcf.bgz.tbi ./data
done
figure_1d.ipynb -> ./data/lm_ancestry_associated_dependency_pvals.txt
本节描述运行每个笔记本的预期输出文件。除非另有说明,所有输出文件都写入 ./output。
figure_1b.ipynb -> fibure_1b_data.txt
figure_1b.ipynb -> figure_1b.pdf
figure_1b.ipynb -> collapsed_snp6_ancestry_calls.txt
figure_1c.ipynb -> figure_1c_data.txt
figure_1c.ipynb -> figure_1c.pdf
figure_1d.ipynb -> ./data/lm_ancestry_Associated_dependency_pvals.txt
figure_1d.ipynb -> figure_1d.pdf
figure_1d.ipynb -> figure_1d_figure_df.txt
figure_1e.ipynb -> figure_1e_data.txt
figure_1e.ipynb -> figure_1e.pdf
figure_2a_2b.ipynb -> figure_2a.pdf
figure_2a_2b.ipynb -> figure_2b.pdf
figure_2c.ipynb -> figure_2c_distance_to_tss_df.txt
figure_2c.ipynb -> figure_2c.pdf
figure_2d.ipynb -> figure_2d.pdf
figure_2d.ipynb -> figure_2d_eqtl_summary_df.txt
figure_2e.ipynb -> figure_2e.pdf
figure_3a.ipynb -> figure_3a_compiled_difference.txt
figure_3a.ipynb -> figure_3a.pdf
figure_3b.ipynb -> avana_filtering_bed_file.bed
figure_3b.ipynb -> figure_3b.pdf
figure_3b.ipynb -> figure_3b_distribution_table.txt
figure_3c.ipynb -> figure_3c.pdf
figure_3c.ipynb -> figure_3c_avana_affected_rate.txt
figure_3d.ipynb -> figure_3d.pdf
figure_3d.ipynb -> figure_3d.pdf
figure_3d.ipynb -> figure_3d_germline_somatic.txt
figure_3e_supplementalfigure_8.ipynb -> snv_position_single_guide_finaldf.txt
figure_3e_supplementalfigure_8.ipynb -> figure_3e.pdf
figure_3e_supplementalfigure_8.ipynb -> figure3_plotting_df.txt
figure_3e_supplementalfigure_8.ipynb -> supplemental_figure_8.pdf
figure_3e_supplementalfigure_8.ipynb -> supplemental_figure_8_ours_doench_merged.txt
figure_4a_4b.ipynb -> figure_4a_affected_guides_df.txt
figure_4a_4b.ipynb -> figure_4a_top.pdf
figure_4a_4b.ipynb -> figure_4a_bottom.pdf
figure_4a_4b.ipynb -> figure_4a_complete_counts.txt
figure_4a_4b.ipynb -> figure_4b.pdf
figure_4a_4b.ipynb -> figure_4b_complete_counts.txt
figure_4c_4d.ipynb -> figure_4c.pdf
figure_4c_4d.ipynb -> figure_4c_affected_genes_per_person.txt
figure_4c_4d.ipynb -> figure_4d.pdf
figure_4c_4d.ipynb -> figure_4d_cosmic_matrix.txt
figure_4e.ipynb -> figure_4e.pdf
figure_4e.ipynb -> figure_4e_scatterplot_df.txt
supplemental_figure_1.ipynb -> supplemental_fig1_achilles_only_ancestry_pvals.txt
supplemental_figure_1.ipynb -> supplemental_fig1.pdf
supplemental_figure_1.ipynb -> supplemental_figure1_data_table.txt
supplemental_figure_2.ipynb -> supplemental_figure_2.pdf
supplemental_figure_2.ipynb -> supplemental_figure_2_merged_cn_snp.txt
supplemental_figure_3.ipynb -> supplemental_figure_3_left.pdf
supplemental_figure_3.ipynb -> supplemental_figure_3_right.pdf
supplemental_figure_3.ipynb -> supplemental_figure_3_df.txt
supplemental_figure_4.ipynb -> supplemental_figure_4.pdf
supplemental_figure_4.ipynb -> supplemental_figure_4_differential_df.txt
supplemental_figure_5_6.ipynb -> supplemental_figure_5_6_a.pdf
supplemental_figure_5_6.ipynb -> supplemental_figure_5_6_b.pdf
supplemental_figure_5_6.ipynb -> supplemental_figure_5_6_c.pdf
supplemental_figure_5_6.ipynb -> supplemental_figure_5_6_df.txt
supplemental_figure_7.ipynb -> supplemental_figure_7_snp_in_guide_df.txt
supplemental_figure_7.ipynb -> supplemental_figure_7_df.txt
supplemental_figure_7.ipynb -> supplemental_figure_7.pdf
supplemental_figure_11.ipynb -> supplemental_figure_11.pdf
supplemental_figure_11.ipynb -> supplemental_figure_11.df.txt