mt_helpers
mt_helpers
We’re just sourcing the functions straight from Github
## For TCGA-related functions
source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.tcga.R")
## For oncoplot and other MAF plotting functions
source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.oncoplot.R")
## For mutational signatures functions
source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.mutsig.R")
get_tcga_data
will fetch MAF files from TCGA data using TCGABiolinks
. It just packages the jobs of fetching a MAF file, it’s associated clinical data and returns a maftools
MAF object.
We’ll use data from the TCGA-ACC project.
tcga_maf <- get_tcga_data(tcga_dataset = "ACC", variant_caller = "mutect2")
print(tcga_maf)
## -Reading
## -Validating
## -Silent variants: 3997
## -Summarizing
## --Possible FLAGS among top ten genes:
## MUC16
## TTN
## HMCN1
## -Processing clinical data
## -Finished in 2.148s elapsed (5.499s cpu)
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build GRCh38 NA NA
## 2: Center BCM NA NA
## 3: Samples 92 NA NA
## 4: nGenes 4752 NA NA
## 5: Frame_Shift_Del 335 3.641 1
## 6: Frame_Shift_Ins 122 1.326 0
## 7: In_Frame_Del 45 0.489 0
## 8: In_Frame_Ins 28 0.304 0
## 9: Missense_Mutation 5580 60.652 18
## 10: Nonsense_Mutation 467 5.076 1
## 11: Nonstop_Mutation 7 0.076 0
## 12: Splice_Site 159 1.728 1
## 13: Translation_Start_Site 7 0.076 0
## 14: total 6750 73.370 23
The MAF file will be saved as ./data/TCGA_ACC/mutect2/TCGA-ACC.mutect2.maf
filter_maf_chunked
is a function to read and filter a MAF file in chunks. This kinda helps for huge MAFs (say > 500k variant or a couple of Gbs), but ultimately you still gotta hold ’em in memory because this is R :(
, so downstream work may still be clunky/might not work.
It uses another function, filter_maf_tbl
, which is meant to work on data.table
tibbles, so you can repurpose that for use with maftools
objects (i.e. the maf@data
or maf@maf.silent
slots).
One key feature of this function is that it will also remove top 50 exome FLAG genes by default.
source("../helper_functions.oncoplot.R")
# tcga_maf.filtered <- filter_maf_chunked("data/TCGA_ACC/mutect2/TCGA_ACC.mutect2.maf")
tcga_maf.filtered <- filter_maf_chunked(tcga_maf)
## Loading required package: tibble
## Couldn't find these columns; skipping filtering for these: tumor_freq,norm_freq,gnomAD_AF,AF
print(tcga_maf.filtered)
## -Validating
## -Silent variants: 3740
## -Summarizing
## -Processing clinical data
## -Finished in 1.534s elapsed (3.533s cpu)
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build GRCh38 NA NA
## 2: Center BCM NA NA
## 3: Samples 92 NA NA
## 4: nGenes 4552 NA NA
## 5: Frame_Shift_Del 313 3.402 1.0
## 6: Frame_Shift_Ins 113 1.228 0.0
## 7: In_Frame_Del 45 0.489 0.0
## 8: In_Frame_Ins 20 0.217 0.0
## 9: Missense_Mutation 5151 55.989 17.5
## 10: Nonsense_Mutation 437 4.750 1.0
## 11: Nonstop_Mutation 7 0.076 0.0
## 12: Splice_Site 151 1.641 0.0
## 13: Translation_Start_Site 7 0.076 0.0
## 14: total 6244 67.870 22.5
plot_silent_nonsilent
returns a ggplot2
plot, so we can easily make it interactive with plotly::ggplotly
plotly::ggplotly(plot_silent_nonsilent(tcga_maf.filtered))
make_burden_plot
returns a ggplot2
plot, so we can easily make it interactive with plotly::ggplotly
This is more useful for larger cohorts
plotly::ggplotly(make_burden_plot(tcga_maf.filtered, plotType = "Dotplot"))
This also adds Variant Classification information, but mostly useful for smaller cohorts
plotly::ggplotly(make_burden_plot(tcga_maf.filtered, plotType = "Barplot"))
make_oncoplot
is my own implementation of the oncoPrint
function from ComplexHeatmap
(heavily relying on ideas from maftools
’s oncoplot function)
myhm <- make_oncoplot(tcga_maf.filtered)
draw(myhm)
tcga_clinical_colors
will make some reasonable colors for commonly reported clinical features in TCGA datasets. This can then be input into make_oncoplot
to draw sample annotations.
Note that this clin_data
argument must be a data.table
object (suitable for use with the clinical.data
slot of a MAF
object) and must contain sample IDs in a column named ‘Tumor_Sample_Barcode’.
source("../helper_functions.tcga.R")
## maftools converts all columns to factor when storing the clinical data
## So let's turn age back to numeric
sample_annotation_data <- tcga_maf.filtered@clinical.data
sample_annotation_data$age_at_diagnosis <- as.numeric(as.character(sample_annotation_data$age_at_diagnosis))
## This function returns nice colors for selected clinical data columns
sample_annotation_colors <- tcga_clinical_colors(sample_annotation_data)
anno_columns <- c(names(sample_annotation_colors),"Tumor_Sample_Barcode")
## Remve clinical data columns that don't have custom colors
sample_annotation_data <- sample_annotation_data[,..anno_columns]
## Plot
source("../helper_functions.oncoplot.R")
# myhm <- make_oncoplot(tcga_maf.filtered,
make_oncoplot(tcga_maf.filtered,
savename = "annotated_oncoplot.pdf",
clin_data = sample_annotation_data,
clin_data_colors = sample_annotation_colors
)
# draw(myhm)
knitr::include_graphics(file.path(getwd(),"annotated_oncoplot.pdf"))
make_oncoplot2
is a more experimental version of make_oncoplot
. You can do things like plot selected genes and show pathogenicity annotations from ClinVar (if available in the MAF).
## Select just the genes that have pathogenic or uncertain mentions in the ClinVar significance annotation
mygenes <- unique(tcga_maf.filtered@data$Hugo_Symbol[grepl("pathogenic|uncertain",tcga_maf.filtered@data$CLIN_SIG, ignore.case = T)])
myhm <- make_oncoplot2(tcga_maf.filtered,cohort_freq_thresh = NULL, genes_to_plot = mygenes, use_clinvar_anno = T)
draw(myhm)
The genes_to_plot
argument will also accept a data.frame
with columns named “Reason
” and “Hugo_Symbol
” to define custom sets of genes to display. If cohort_freq_thresh
is set to a fraction, it will also add frequently mutated genes.
Note that the genes are plotted as given, so a gene can appear several times if it is in multiple “Reason” sets.
## Adapted from: http://yulab-smu.top/clusterProfiler-book/chapter7.html
library(msigdbr)
library(stringr)
## Get human gene sets from msigdb
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>%
dplyr::select(gs_name, gene_symbol, gs_subcat)
## Select just ones matching "JAK" or "AKT"
pathwaysdf <- data.frame(m_t2g)
mypaths <- pathwaysdf[grepl("JAK|AKT",pathwaysdf$gs_name, ignore.case = T),]
## Set up a data frame with columns "Reason" and "Hugo_Symbol"
## The "Reason" value is used to label the plot, so here I'm replacing _ with spaces and adding text wrapping with stringr
genes_df <- data.frame(Reason=stringr::str_wrap(gsub("_"," ",gsub("HALLMARK_","",mypaths$gs_name)), width=10),
Hugo_Symbol=mypaths$gene_symbol,
stringsAsFactors = F)
## Sizing can get tricky with larger plots; it will be adjusted automatically for the number of samples and genes if saving to pdf
make_oncoplot2(tcga_maf.filtered,cohort_freq_thresh = 0.05, genes_to_plot = genes_df, use_clinvar_anno = T, savename="customonco.pdf")
## This is just for adding the plot to this markdown document
knitr::include_graphics(file.path(getwd(),"customonco.pdf"))
Here we’re only talking about single-base substitution (SBS) signatures, focusing on the ones categorized by COSMIC v3.1.
I’m storing the manually curated etiology data in my Github repo.
# Download signatures data from my repo
local_signatures_file="../cosmic/COSMIC_Mutational_Signatures_v3.1.xlsx"
local_etiology_file="../cosmic/COSMIC_signature_etiology.xlsx"
if (!file.exists(local_signatures_file)) {
if (!dir.exists(dirname(local_signatures_file))) { dir.create(dirname(local_signatures_file), recursive = T)}
download.file("https://github.com/mtandon09/mt_helpers/blob/main/cosmic/COSMIC_Mutational_Signatures_v3.1.xlsx?raw=true",destfile = local_signatures_file)
}
if (!file.exists(local_signatures_file)) {
if (!dir.exists(dirname(local_signatures_file))) { dir.create(dirname(local_signatures_file), recursive = T)}
download.file("https://github.com/mtandon09/mt_helpers/blob/main/cosmic/COSMIC_signature_etiology.xlsx?raw=true",destfile = local_etiology_file)
}
Just plotting the first 5 samples here.
source("../helper_functions.mutsig.R")
# source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.mutsig.R")
### Sample signatures
plotN=5
make_signature_plot(make_tnm(tcga_maf.filtered)[,1:5])
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in 20.879 % of samples (APOBEC enrichment score > 2 ; 19 of 91 samples)
## -Creating mutation matrix
## --matrix of dimension 92x96
The cosine similarity of each individual signature is computed against each COSMIC signature using the cosin_sim_matrix
function from MutationalPatterns
.
The row annotations are colored by categories according to the Proposed Etiology
section of each COSMIC signatures page (e.g. for SBS1).
I tried several (relatively naïve) ways of clustering/concept-mapping the text with no luck, so instead I used those results to aid manual curation. My proposed solution, used here, can be found in this repo as an Excel file.
THIS IS EXTREMELY EXPERIMENTAL!. If you find discrepancies or bad categorizations in these annotations, please open an issue here!
### COSMIC signatures
source("../helper_functions.mutsig.R")
## maftools converts all columns to factor when storing the clinical data
## So let's turn age back to numeric
sample_annotation_data <- tcga_maf.filtered@clinical.data
sample_annotation_data$age_at_diagnosis <- as.numeric(as.character(sample_annotation_data$age_at_diagnosis))
## This function returns nice colors for selected clinical data columns
sample_annotation_colors <- tcga_clinical_colors(sample_annotation_data)
anno_columns <- c(names(sample_annotation_colors),"Tumor_Sample_Barcode")
## Remve clinical data columns that don't have custom colors
sample_annotation_data <- sample_annotation_data[,..anno_columns]
make_mut_signature_heatmap(tcga_maf.filtered,signatures_file = local_signatures_file, etio_data_xlsx = local_etiology_file,
savename="cosmic.pdf",
clin_data = sample_annotation_data, clin_data_colors = sample_annotation_colors)
## This is just for adding the plot to this markdown document
knitr::include_graphics(file.path(getwd(),"cosmic.pdf"))
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in 20.879 % of samples (APOBEC enrichment score > 2 ; 19 of 91 samples)
## -Creating mutation matrix
## --matrix of dimension 92x96
make_overlap_plot
can plot a square matrix heatmap of pair-wise overlaps, or show the same information in a ribbon plot (which is not usually helpful, but it looks pretty sometimes). Note that the summarize_by
can be set to ‘gene’ to count overlaps by gene (one or more mutations), or set to anything else to count by variant position and protein change.
This is also coded very naively so it’s very slow for large cohorts.
Also both plotting methods (using pheatmap
or circlize
) do not lend well to passing objects, so the plots are printed to PDF.
source("../helper_functions.oncoplot.R")
make_overlap_plot(tcga_maf.filtered, plotType = c("heatmap","ribbon"), savename = file.path(getwd(),"overlap.pdf"),savewidth = 12, saveheight = 12)
## This is just for adding the plot to this markdown document
knitr::include_graphics(file.path(getwd(),"overlap.pdf"))
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] ggsci_2.9 MutationalPatterns_1.12.0
## [3] NMF_0.23.0 Biobase_2.46.0
## [5] cluster_2.1.1 rngtools_1.5
## [7] pkgmaker_0.32.2 registry_0.5-1
## [9] BSgenome.Hsapiens.UCSC.hg38_1.4.1 BSgenome_1.54.0
## [11] rtracklayer_1.46.0 Biostrings_2.54.0
## [13] XVector_0.26.0 GenomicRanges_1.38.0
## [15] GenomeInfoDb_1.22.1 IRanges_2.20.2
## [17] S4Vectors_0.24.4 BiocGenerics_0.32.0
## [19] stringr_1.4.0 msigdbr_7.2.1
## [21] circlize_0.4.12 RColorBrewer_1.1-2
## [23] ggbeeswarm_0.6.0 tibble_3.0.6
## [25] TCGAbiolinks_2.14.1 dplyr_1.0.4
## [27] plotly_4.9.3 ggplot2_3.3.3
## [29] ComplexHeatmap_2.2.0 openxlsx_4.2.3
## [31] maftools_2.2.10
##
## loaded via a namespace (and not attached):
## [1] R.utils_2.10.1 tidyselect_1.1.0
## [3] RSQLite_2.2.3 AnnotationDbi_1.48.0
## [5] htmlwidgets_1.5.3 BiocParallel_1.20.1
## [7] DESeq_1.38.0 munsell_0.5.0
## [9] codetools_0.2-18 withr_2.4.1
## [11] colorspace_2.0-0 highr_0.8
## [13] knitr_1.31 ggsignif_0.6.0
## [15] labeling_0.4.2 GenomeInfoDbData_1.2.2
## [17] hwriter_1.3.2 KMsurv_0.1-5
## [19] parsetools_0.1.3 farver_2.0.3
## [21] bit64_4.0.5 downloader_0.4
## [23] vctrs_0.3.6 generics_0.1.0
## [25] xfun_0.21 ggthemes_4.2.4
## [27] BiocFileCache_1.10.2 EDASeq_2.20.0
## [29] R6_2.5.0 doParallel_1.0.16
## [31] clue_0.3-58 locfit_1.5-9.4
## [33] bitops_1.0-6 cachem_1.0.4
## [35] DelayedArray_0.12.3 assertthat_0.2.1
## [37] scales_1.1.1 beeswarm_0.2.3
## [39] gtable_0.3.0 sva_3.34.0
## [41] rlang_0.4.10 pkgcond_0.1.0
## [43] genefilter_1.68.0 GlobalOptions_0.1.2
## [45] splines_3.6.2 rstatix_0.7.0
## [47] lazyeval_0.2.2 wordcloud_2.6
## [49] selectr_0.4-2 broom_0.7.4
## [51] yaml_2.2.1 reshape2_1.4.4
## [53] abind_1.4-5 GenomicFeatures_1.38.2
## [55] crosstalk_1.1.1 backports_1.2.1
## [57] purrrogress_0.1.1 tools_3.6.2
## [59] gridBase_0.4-7 ellipsis_0.3.1
## [61] ggdendro_0.1.22 Rcpp_1.0.6
## [63] plyr_1.8.6 progress_1.2.2
## [65] zlibbioc_1.32.0 purrr_0.3.4
## [67] RCurl_1.98-1.2 prettyunits_1.1.1
## [69] ggpubr_0.4.0 openssl_1.4.3
## [71] GetoptLong_1.0.5 cowplot_1.1.1
## [73] zoo_1.8-8 SummarizedExperiment_1.16.1
## [75] haven_2.3.1 ggrepel_0.9.1
## [77] magrittr_2.0.1 data.table_1.13.6
## [79] survminer_0.4.8 matrixStats_0.58.0
## [81] aroma.light_3.16.0 hms_1.0.0
## [83] evaluate_0.14 xtable_1.8-4
## [85] XML_3.99-0.3 rio_0.5.16
## [87] jpeg_0.1-8.1 readxl_1.3.1
## [89] gridExtra_2.3 shape_1.4.5
## [91] testthat_3.0.2 compiler_3.6.2
## [93] biomaRt_2.42.1 crayon_1.4.1
## [95] R.oo_1.24.0 htmltools_0.5.1.1
## [97] mgcv_1.8-34 tidyr_1.1.2
## [99] geneplotter_1.64.0 postlogic_0.1.0.1
## [101] DBI_1.1.1 dbplyr_2.1.0
## [103] MASS_7.3-53.1 rappdirs_0.3.3
## [105] ShortRead_1.44.3 Matrix_1.3-2
## [107] car_3.0-10 readr_1.4.0
## [109] R.methodsS3_1.8.1 forcats_0.5.1
## [111] pkgconfig_2.0.3 km.ci_0.5-2
## [113] GenomicAlignments_1.22.1 foreign_0.8-76
## [115] testextra_0.1.0.1 xml2_1.3.2
## [117] foreach_1.5.1 annotate_1.64.0
## [119] vipor_0.4.5 rvest_0.3.6
## [121] VariantAnnotation_1.32.0 digest_0.6.27
## [123] pracma_2.3.3 rmarkdown_2.6
## [125] cellranger_1.1.0 survMisc_0.5.5
## [127] edgeR_3.28.1 curl_4.3
## [129] Rsamtools_2.2.3 rjson_0.2.20
## [131] lifecycle_1.0.0 nlme_3.1-152
## [133] jsonlite_1.7.2 carData_3.0-4
## [135] viridisLite_0.3.0 askpass_1.1
## [137] limma_3.42.2 pillar_1.4.7
## [139] lattice_0.20-41 fastmap_1.1.0
## [141] httr_1.4.2 survival_3.2-7
## [143] glue_1.4.2 zip_2.1.1
## [145] png_0.1-7 iterators_1.0.13
## [147] bit_4.0.4 stringi_1.5.3
## [149] blob_1.2.1 latticeExtra_0.6-29
## [151] memoise_2.0.0