Example exploration


Source mt_helpers

We’re just sourcing the functions straight from Github

## For TCGA-related functions
source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.tcga.R")
## For oncoplot and other MAF plotting functions
source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.oncoplot.R")
## For mutational signatures functions
source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.mutsig.R")

Get some MAF data

get_tcga_data will fetch MAF files from TCGA data using TCGABiolinks. It just packages the jobs of fetching a MAF file, it’s associated clinical data and returns a maftools MAF object.

We’ll use data from the TCGA-ACC project.

tcga_maf <- get_tcga_data(tcga_dataset = "ACC", variant_caller = "mutect2")
print(tcga_maf)
## -Reading
## -Validating
## -Silent variants: 3997 
## -Summarizing
## --Possible FLAGS among top ten genes:
##   MUC16
##   TTN
##   HMCN1
## -Processing clinical data
## -Finished in 2.148s elapsed (5.499s cpu) 
## An object of class  MAF 
##                         ID summary   Mean Median
##  1:             NCBI_Build  GRCh38     NA     NA
##  2:                 Center     BCM     NA     NA
##  3:                Samples      92     NA     NA
##  4:                 nGenes    4752     NA     NA
##  5:        Frame_Shift_Del     335  3.641      1
##  6:        Frame_Shift_Ins     122  1.326      0
##  7:           In_Frame_Del      45  0.489      0
##  8:           In_Frame_Ins      28  0.304      0
##  9:      Missense_Mutation    5580 60.652     18
## 10:      Nonsense_Mutation     467  5.076      1
## 11:       Nonstop_Mutation       7  0.076      0
## 12:            Splice_Site     159  1.728      1
## 13: Translation_Start_Site       7  0.076      0
## 14:                  total    6750 73.370     23

The MAF file will be saved as ./data/TCGA_ACC/mutect2/TCGA-ACC.mutect2.maf


Filter MAF data

filter_maf_chunked is a function to read and filter a MAF file in chunks. This kinda helps for huge MAFs (say > 500k variant or a couple of Gbs), but ultimately you still gotta hold ’em in memory because this is R :(, so downstream work may still be clunky/might not work.

It uses another function, filter_maf_tbl, which is meant to work on data.table tibbles, so you can repurpose that for use with maftools objects (i.e. the maf@data or maf@maf.silent slots).

One key feature of this function is that it will also remove top 50 exome FLAG genes by default.

source("../helper_functions.oncoplot.R")
# tcga_maf.filtered <- filter_maf_chunked("data/TCGA_ACC/mutect2/TCGA_ACC.mutect2.maf")
tcga_maf.filtered <- filter_maf_chunked(tcga_maf)
## Loading required package: tibble
## Couldn't find these columns; skipping filtering for these: tumor_freq,norm_freq,gnomAD_AF,AF
print(tcga_maf.filtered)
## -Validating
## -Silent variants: 3740 
## -Summarizing
## -Processing clinical data
## -Finished in 1.534s elapsed (3.533s cpu) 
## An object of class  MAF 
##                         ID summary   Mean Median
##  1:             NCBI_Build  GRCh38     NA     NA
##  2:                 Center     BCM     NA     NA
##  3:                Samples      92     NA     NA
##  4:                 nGenes    4552     NA     NA
##  5:        Frame_Shift_Del     313  3.402    1.0
##  6:        Frame_Shift_Ins     113  1.228    0.0
##  7:           In_Frame_Del      45  0.489    0.0
##  8:           In_Frame_Ins      20  0.217    0.0
##  9:      Missense_Mutation    5151 55.989   17.5
## 10:      Nonsense_Mutation     437  4.750    1.0
## 11:       Nonstop_Mutation       7  0.076    0.0
## 12:            Splice_Site     151  1.641    0.0
## 13: Translation_Start_Site       7  0.076    0.0
## 14:                  total    6244 67.870   22.5

Count silent vs. non-silent mutations

plot_silent_nonsilent returns a ggplot2 plot, so we can easily make it interactive with plotly::ggplotly

Silent/Non-Silent

plotly::ggplotly(plot_silent_nonsilent(tcga_maf.filtered))

Mutation Burden

Plot mutation burden

make_burden_plot returns a ggplot2 plot, so we can easily make it interactive with plotly::ggplotly

Dot plot

This is more useful for larger cohorts

plotly::ggplotly(make_burden_plot(tcga_maf.filtered, plotType = "Dotplot"))

Bar plot

This also adds Variant Classification information, but mostly useful for smaller cohorts

plotly::ggplotly(make_burden_plot(tcga_maf.filtered, plotType = "Barplot"))

Oncoplot

Basic oncoplot

make_oncoplot is my own implementation of the oncoPrint function from ComplexHeatmap (heavily relying on ideas from maftools’s oncoplot function)

myhm <- make_oncoplot(tcga_maf.filtered)

draw(myhm)

Sample annotations

tcga_clinical_colors will make some reasonable colors for commonly reported clinical features in TCGA datasets. This can then be input into make_oncoplot to draw sample annotations.

Note that this clin_data argument must be a data.table object (suitable for use with the clinical.data slot of a MAF object) and must contain sample IDs in a column named ‘Tumor_Sample_Barcode’.

source("../helper_functions.tcga.R")
## maftools converts all columns to factor when storing the clinical data
## So let's turn age back to numeric
sample_annotation_data <- tcga_maf.filtered@clinical.data
sample_annotation_data$age_at_diagnosis <- as.numeric(as.character(sample_annotation_data$age_at_diagnosis))

## This function returns nice colors for selected clinical data columns
sample_annotation_colors <- tcga_clinical_colors(sample_annotation_data)
anno_columns <- c(names(sample_annotation_colors),"Tumor_Sample_Barcode")

## Remve clinical data columns that don't have custom colors
sample_annotation_data <- sample_annotation_data[,..anno_columns]

## Plot
source("../helper_functions.oncoplot.R")
# myhm <- make_oncoplot(tcga_maf.filtered,
make_oncoplot(tcga_maf.filtered,
              savename = "annotated_oncoplot.pdf",
              clin_data = sample_annotation_data,
              clin_data_colors = sample_annotation_colors
              )
# draw(myhm)
knitr::include_graphics(file.path(getwd(),"annotated_oncoplot.pdf"))

ClinVar annotations

make_oncoplot2 is a more experimental version of make_oncoplot. You can do things like plot selected genes and show pathogenicity annotations from ClinVar (if available in the MAF).

## Select just the genes that have pathogenic or uncertain mentions in the ClinVar significance annotation
mygenes <- unique(tcga_maf.filtered@data$Hugo_Symbol[grepl("pathogenic|uncertain",tcga_maf.filtered@data$CLIN_SIG, ignore.case = T)])

myhm <- make_oncoplot2(tcga_maf.filtered,cohort_freq_thresh = NULL, genes_to_plot = mygenes, use_clinvar_anno = T)

draw(myhm)

Sets of selected genes

The genes_to_plot argument will also accept a data.frame with columns named “Reason” and “Hugo_Symbol” to define custom sets of genes to display. If cohort_freq_thresh is set to a fraction, it will also add frequently mutated genes.

Note that the genes are plotted as given, so a gene can appear several times if it is in multiple “Reason” sets.

## Adapted from: http://yulab-smu.top/clusterProfiler-book/chapter7.html
library(msigdbr)
library(stringr)
## Get human gene sets from msigdb
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>% 
  dplyr::select(gs_name, gene_symbol, gs_subcat)

## Select just ones matching "JAK" or "AKT"
pathwaysdf <- data.frame(m_t2g)
mypaths <- pathwaysdf[grepl("JAK|AKT",pathwaysdf$gs_name, ignore.case = T),]

## Set up a data frame with columns "Reason" and "Hugo_Symbol"
## The "Reason" value is used to label the plot, so here I'm replacing _ with spaces and adding text wrapping with stringr
genes_df <- data.frame(Reason=stringr::str_wrap(gsub("_"," ",gsub("HALLMARK_","",mypaths$gs_name)), width=10),
                       Hugo_Symbol=mypaths$gene_symbol, 
                       stringsAsFactors = F)

## Sizing can get tricky with larger plots; it will be adjusted automatically for the number of samples and genes if saving to pdf
make_oncoplot2(tcga_maf.filtered,cohort_freq_thresh = 0.05, genes_to_plot = genes_df, use_clinvar_anno = T, savename="customonco.pdf")

## This is just for adding the plot to this markdown document
knitr::include_graphics(file.path(getwd(),"customonco.pdf"))

Mutational Signatures

Here we’re only talking about single-base substitution (SBS) signatures, focusing on the ones categorized by COSMIC v3.1.

Download COSMIC data and etiologies

I’m storing the manually curated etiology data in my Github repo.

# Download signatures data from my repo
local_signatures_file="../cosmic/COSMIC_Mutational_Signatures_v3.1.xlsx"
local_etiology_file="../cosmic/COSMIC_signature_etiology.xlsx"

if (!file.exists(local_signatures_file)) {
  if (!dir.exists(dirname(local_signatures_file))) { dir.create(dirname(local_signatures_file), recursive = T)}
  download.file("https://github.com/mtandon09/mt_helpers/blob/main/cosmic/COSMIC_Mutational_Signatures_v3.1.xlsx?raw=true",destfile = local_signatures_file)
}

if (!file.exists(local_signatures_file)) {
  if (!dir.exists(dirname(local_signatures_file))) { dir.create(dirname(local_signatures_file), recursive = T)}
  download.file("https://github.com/mtandon09/mt_helpers/blob/main/cosmic/COSMIC_signature_etiology.xlsx?raw=true",destfile = local_etiology_file)
}

Inspect individual signatures

Just plotting the first 5 samples here.

source("../helper_functions.mutsig.R")
# source("https://raw.githubusercontent.com/mtandon09/mt_helpers/main/helper_functions.mutsig.R")
### Sample signatures
plotN=5
make_signature_plot(make_tnm(tcga_maf.filtered)[,1:5])

## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in  20.879 % of samples (APOBEC enrichment score > 2 ;  19  of  91  samples)
## -Creating mutation matrix
## --matrix of dimension 92x96

Plot similarity to COSMIC signatures

The cosine similarity of each individual signature is computed against each COSMIC signature using the cosin_sim_matrix function from MutationalPatterns.

The row annotations are colored by categories according to the Proposed Etiology section of each COSMIC signatures page (e.g. for SBS1).

I tried several (relatively naïve) ways of clustering/concept-mapping the text with no luck, so instead I used those results to aid manual curation. My proposed solution, used here, can be found in this repo as an Excel file.

THIS IS EXTREMELY EXPERIMENTAL!. If you find discrepancies or bad categorizations in these annotations, please open an issue here!

### COSMIC signatures
source("../helper_functions.mutsig.R")
## maftools converts all columns to factor when storing the clinical data
## So let's turn age back to numeric
sample_annotation_data <- tcga_maf.filtered@clinical.data
sample_annotation_data$age_at_diagnosis <- as.numeric(as.character(sample_annotation_data$age_at_diagnosis))

## This function returns nice colors for selected clinical data columns
sample_annotation_colors <- tcga_clinical_colors(sample_annotation_data)
anno_columns <- c(names(sample_annotation_colors),"Tumor_Sample_Barcode")

## Remve clinical data columns that don't have custom colors
sample_annotation_data <- sample_annotation_data[,..anno_columns]

make_mut_signature_heatmap(tcga_maf.filtered,signatures_file = local_signatures_file, etio_data_xlsx = local_etiology_file, 
                                    savename="cosmic.pdf",
                                    clin_data = sample_annotation_data, clin_data_colors = sample_annotation_colors)

## This is just for adding the plot to this markdown document
knitr::include_graphics(file.path(getwd(),"cosmic.pdf"))
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in  20.879 % of samples (APOBEC enrichment score > 2 ;  19  of  91  samples)
## -Creating mutation matrix
## --matrix of dimension 92x96

Co-occurence and overlaps

Visualize overlap between samples

make_overlap_plot can plot a square matrix heatmap of pair-wise overlaps, or show the same information in a ribbon plot (which is not usually helpful, but it looks pretty sometimes). Note that the summarize_by can be set to ‘gene’ to count overlaps by gene (one or more mutations), or set to anything else to count by variant position and protein change.

This is also coded very naively so it’s very slow for large cohorts.

Also both plotting methods (using pheatmap or circlize) do not lend well to passing objects, so the plots are printed to PDF.

source("../helper_functions.oncoplot.R")
make_overlap_plot(tcga_maf.filtered, plotType = c("heatmap","ribbon"), savename = file.path(getwd(),"overlap.pdf"),savewidth = 12, saveheight = 12)

## This is just for adding the plot to this markdown document
knitr::include_graphics(file.path(getwd(),"overlap.pdf"))

R Session Info

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] ggsci_2.9                         MutationalPatterns_1.12.0        
##  [3] NMF_0.23.0                        Biobase_2.46.0                   
##  [5] cluster_2.1.1                     rngtools_1.5                     
##  [7] pkgmaker_0.32.2                   registry_0.5-1                   
##  [9] BSgenome.Hsapiens.UCSC.hg38_1.4.1 BSgenome_1.54.0                  
## [11] rtracklayer_1.46.0                Biostrings_2.54.0                
## [13] XVector_0.26.0                    GenomicRanges_1.38.0             
## [15] GenomeInfoDb_1.22.1               IRanges_2.20.2                   
## [17] S4Vectors_0.24.4                  BiocGenerics_0.32.0              
## [19] stringr_1.4.0                     msigdbr_7.2.1                    
## [21] circlize_0.4.12                   RColorBrewer_1.1-2               
## [23] ggbeeswarm_0.6.0                  tibble_3.0.6                     
## [25] TCGAbiolinks_2.14.1               dplyr_1.0.4                      
## [27] plotly_4.9.3                      ggplot2_3.3.3                    
## [29] ComplexHeatmap_2.2.0              openxlsx_4.2.3                   
## [31] maftools_2.2.10                  
## 
## loaded via a namespace (and not attached):
##   [1] R.utils_2.10.1              tidyselect_1.1.0           
##   [3] RSQLite_2.2.3               AnnotationDbi_1.48.0       
##   [5] htmlwidgets_1.5.3           BiocParallel_1.20.1        
##   [7] DESeq_1.38.0                munsell_0.5.0              
##   [9] codetools_0.2-18            withr_2.4.1                
##  [11] colorspace_2.0-0            highr_0.8                  
##  [13] knitr_1.31                  ggsignif_0.6.0             
##  [15] labeling_0.4.2              GenomeInfoDbData_1.2.2     
##  [17] hwriter_1.3.2               KMsurv_0.1-5               
##  [19] parsetools_0.1.3            farver_2.0.3               
##  [21] bit64_4.0.5                 downloader_0.4             
##  [23] vctrs_0.3.6                 generics_0.1.0             
##  [25] xfun_0.21                   ggthemes_4.2.4             
##  [27] BiocFileCache_1.10.2        EDASeq_2.20.0              
##  [29] R6_2.5.0                    doParallel_1.0.16          
##  [31] clue_0.3-58                 locfit_1.5-9.4             
##  [33] bitops_1.0-6                cachem_1.0.4               
##  [35] DelayedArray_0.12.3         assertthat_0.2.1           
##  [37] scales_1.1.1                beeswarm_0.2.3             
##  [39] gtable_0.3.0                sva_3.34.0                 
##  [41] rlang_0.4.10                pkgcond_0.1.0              
##  [43] genefilter_1.68.0           GlobalOptions_0.1.2        
##  [45] splines_3.6.2               rstatix_0.7.0              
##  [47] lazyeval_0.2.2              wordcloud_2.6              
##  [49] selectr_0.4-2               broom_0.7.4                
##  [51] yaml_2.2.1                  reshape2_1.4.4             
##  [53] abind_1.4-5                 GenomicFeatures_1.38.2     
##  [55] crosstalk_1.1.1             backports_1.2.1            
##  [57] purrrogress_0.1.1           tools_3.6.2                
##  [59] gridBase_0.4-7              ellipsis_0.3.1             
##  [61] ggdendro_0.1.22             Rcpp_1.0.6                 
##  [63] plyr_1.8.6                  progress_1.2.2             
##  [65] zlibbioc_1.32.0             purrr_0.3.4                
##  [67] RCurl_1.98-1.2              prettyunits_1.1.1          
##  [69] ggpubr_0.4.0                openssl_1.4.3              
##  [71] GetoptLong_1.0.5            cowplot_1.1.1              
##  [73] zoo_1.8-8                   SummarizedExperiment_1.16.1
##  [75] haven_2.3.1                 ggrepel_0.9.1              
##  [77] magrittr_2.0.1              data.table_1.13.6          
##  [79] survminer_0.4.8             matrixStats_0.58.0         
##  [81] aroma.light_3.16.0          hms_1.0.0                  
##  [83] evaluate_0.14               xtable_1.8-4               
##  [85] XML_3.99-0.3                rio_0.5.16                 
##  [87] jpeg_0.1-8.1                readxl_1.3.1               
##  [89] gridExtra_2.3               shape_1.4.5                
##  [91] testthat_3.0.2              compiler_3.6.2             
##  [93] biomaRt_2.42.1              crayon_1.4.1               
##  [95] R.oo_1.24.0                 htmltools_0.5.1.1          
##  [97] mgcv_1.8-34                 tidyr_1.1.2                
##  [99] geneplotter_1.64.0          postlogic_0.1.0.1          
## [101] DBI_1.1.1                   dbplyr_2.1.0               
## [103] MASS_7.3-53.1               rappdirs_0.3.3             
## [105] ShortRead_1.44.3            Matrix_1.3-2               
## [107] car_3.0-10                  readr_1.4.0                
## [109] R.methodsS3_1.8.1           forcats_0.5.1              
## [111] pkgconfig_2.0.3             km.ci_0.5-2                
## [113] GenomicAlignments_1.22.1    foreign_0.8-76             
## [115] testextra_0.1.0.1           xml2_1.3.2                 
## [117] foreach_1.5.1               annotate_1.64.0            
## [119] vipor_0.4.5                 rvest_0.3.6                
## [121] VariantAnnotation_1.32.0    digest_0.6.27              
## [123] pracma_2.3.3                rmarkdown_2.6              
## [125] cellranger_1.1.0            survMisc_0.5.5             
## [127] edgeR_3.28.1                curl_4.3                   
## [129] Rsamtools_2.2.3             rjson_0.2.20               
## [131] lifecycle_1.0.0             nlme_3.1-152               
## [133] jsonlite_1.7.2              carData_3.0-4              
## [135] viridisLite_0.3.0           askpass_1.1                
## [137] limma_3.42.2                pillar_1.4.7               
## [139] lattice_0.20-41             fastmap_1.1.0              
## [141] httr_1.4.2                  survival_3.2-7             
## [143] glue_1.4.2                  zip_2.1.1                  
## [145] png_0.1-7                   iterators_1.0.13           
## [147] bit_4.0.4                   stringi_1.5.3              
## [149] blob_1.2.1                  latticeExtra_0.6-29        
## [151] memoise_2.0.0