In this tutorial, we will perform DNA sequence motif analysis in Signac. We will explore two complementary options for performing motif analysis: one by finding overrepresented motifs in a set of differentially accessible peaks, one method performing differential motif activity analysis between groups of cells.
In this demonstration we use data from the adult mouse brain. See our vignette for the code used to generate this object, and links to the raw data. First, load the required packages and the pre-computed Seurat object:
library(Signac) library(Seurat) library(JASPAR2018) library(TFBSTools) library(BSgenome.Mmusculus.UCSC.mm10) library(patchwork) set.seed(1234)
mouse_brain <- readRDS("../vignette_data/adult_mouse_brain.rds") mouse_brain
## An object of class Seurat
## 178654 features across 3522 samples within 2 assays
## Active assay: peaks (157203 features, 157203 variable features)
## 1 other assay present: RNA
## 2 dimensional reductions calculated: lsi, umap
To facilitate motif analysis in Signac, we have create the Motif
class to store all the required information, including a list of position weight matrices (PWMs) or position frequency matrices (PFMs) and a motif occurrence matrix. Here, we construct a Motif
object and add it to our mouse brain dataset. A motif object can be added to any Seurat assay using the AddMotifObject
function:
# Get a list of motif position frequency matrices from the JASPAR database pfm <- getMatrixSet( x = JASPAR2018, opts = list(species = 9606, all_versions = FALSE) ) # Scan the DNA sequence of each peak for the presence of each motif motif.matrix <- CreateMotifMatrix( features = StringToGRanges(rownames(mouse_brain), sep = c(":", "-")), pwm = pfm, genome = 'mm10', sep = c(":", "-"), use.counts = FALSE ) # Create a new Mofif object to store the results motif <- CreateMotifObject( data = motif.matrix, pwm = pfm ) # Add the Motif object to the assay mouse_brain[['peaks']] <- AddMotifObject( object = mouse_brain[['peaks']], motif.object = motif )
In order to test for overrepresented motifs, we also need to compute some sequence characteristics of the peaks, such as GC content, sequence length, and dinucleotide frequency. The RegionStats
function computes this information for us and stores the results in the feature metadata in the Seurat object.
mouse_brain <- RegionStats( object = mouse_brain, genome = BSgenome.Mmusculus.UCSC.mm10, sep = c(":", "-") )
To identify potentially important cell-type-specific regulatory sequences, we can search for DNA motifs that are overrepresented in a set of peaks that are differentially accessible between cell types.
Here, we find differentially accessible peaks between Pvalb and Sst inhibitory interneurons. We then perform a hypergeometric test to test the probability of observing the motif at the given frequency by chance, comparing with a background set of peaks matched for GC content.
da_peaks <- FindMarkers( object = mouse_brain, ident.1 = 'Pvalb', ident.2 = 'Sst', only.pos = TRUE, test.use = 'LR', latent.vars = 'nCount_peaks' ) # Test the differentially accessible peaks for overrepresented motifs enriched.motifs <- FindMotifs( object = mouse_brain, features = head(rownames(da_peaks), 1000) ) # sort by p-value and fold change enriched.motifs <- enriched.motifs[order(enriched.motifs[, 7], -enriched.motifs[, 6]), ]
head(enriched.motifs)
## motif observed background percent.observed percent.background
## MA0497.1 MA0497.1 445 8914 44.5 22.2850
## MA0773.1 MA0773.1 311 5315 31.1 13.2875
## MA0052.3 MA0052.3 326 5802 32.6 14.5050
## MA0660.1 MA0660.1 264 4319 26.4 10.7975
## MA1151.1 MA1151.1 221 3304 22.1 8.2600
## MA1100.1 MA1100.1 443 10054 44.3 25.1350
## fold.enrichment pvalue motif.name
## MA0497.1 1.996859 3.118991e-56 MEF2C
## MA0773.1 2.340546 1.039226e-49 MEF2D
## MA0052.3 2.247501 1.238076e-48 MEF2A
## MA0660.1 2.445010 1.459268e-44 MEF2B
## MA1151.1 2.675545 1.240018e-42 RORC
## MA1100.1 1.762483 1.118402e-40 ASCL1
We can also plot the position weight matrices for the motifs, so we can visualize the different motif sequences.
We and others have previously shown that Mef-family motifs, particularly Mef2c, are enriched in Pvalb-specific peaks in scATAC-seq data (https://doi.org/10.1016/j.cell.2019.05.031; https://doi.org/10.1101/615179), and further shown that Mef2c is required for the development of Pvalb interneurons (https://www.nature.com/articles/nature25999). Here our results are consistent with these findings, and we observe a strong enrichment of Mef-family motifs in the top results from FindMotifs
.
We can also compute a per-cell motif activity score by running chromVAR. This allows us to visualize motif activities per cell, and also provides an alternative method of identifying differentially-active motifs between cell types.
ChromVAR identifies motifs associated with variability in chromatin accessibility between cells. See the chromVAR paper for a complete description of the method.
mouse_brain <- RunChromVAR( object = mouse_brain, genome = BSgenome.Mmusculus.UCSC.mm10 ) DefaultAssay(mouse_brain) <- 'chromvar' # look at the activity of Mef2c, the top hit from the overrepresentation testing p2 <- FeaturePlot( object = mouse_brain, features = rownames(enriched.motifs)[[1]], min.cutoff = 'q10', max.cutoff = 'q90', pt.size = 0.1 ) p1 + p2
We can also directly test for differential activity scores between cell types. This tends to give similar results as performing an enrichment test on differentially accessible peaks between the cell types (shown above)
differential.activity <- FindMarkers( object = mouse_brain, ident.1 = 'Pvalb', ident.2 = 'Sst', only.pos = TRUE, test.use = 'LR', latent.vars = 'nCount_peaks' ) MotifPlot( object = mouse_brain, motifs = head(rownames(differential.activity)), assay = 'peaks' )
Session Info
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] patchwork_1.0.1 BSgenome.Mmusculus.UCSC.mm10_1.4.0
## [3] BSgenome_1.56.0 rtracklayer_1.48.0
## [5] Biostrings_2.56.0 XVector_0.28.0
## [7] GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
## [9] IRanges_2.22.2 S4Vectors_0.26.1
## [11] BiocGenerics_0.34.0 TFBSTools_1.26.0
## [13] JASPAR2018_1.1.1 Seurat_3.1.5.9009
## [15] Signac_0.2.5
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.8 chromVAR_1.10.0
## [3] plyr_1.8.6 igraph_1.2.5
## [5] lazyeval_0.2.2 splines_4.0.1
## [7] BiocParallel_1.22.0 listenv_0.8.0
## [9] ggplot2_3.3.2 digest_0.6.25
## [11] htmltools_0.5.0 GO.db_3.11.4
## [13] magrittr_1.5 memoise_1.1.0
## [15] cluster_2.1.0 ROCR_1.0-11
## [17] globals_0.12.5 readr_1.3.1
## [19] annotate_1.66.0 ggfittext_0.9.0
## [21] matrixStats_0.56.0 R.utils_2.9.2
## [23] pkgdown_1.5.1 colorspace_1.4-1
## [25] blob_1.2.1 rappdirs_0.3.1
## [27] ggrepel_0.8.2 xfun_0.14
## [29] dplyr_1.0.0 crayon_1.3.4
## [31] RCurl_1.98-1.2 jsonlite_1.7.0
## [33] TFMPvalue_0.0.8 survival_3.2-3
## [35] zoo_1.8-8 ape_5.4
## [37] glue_1.4.1 gtable_0.3.0
## [39] zlibbioc_1.34.0 leiden_0.3.3
## [41] DelayedArray_0.14.0 future.apply_1.6.0
## [43] scales_1.1.1 DBI_1.1.0
## [45] miniUI_0.1.1.1 Rcpp_1.0.5
## [47] viridisLite_0.3.0 xtable_1.8-4
## [49] reticulate_1.16 bit_1.1-15.2
## [51] rsvd_1.0.3 DT_0.13
## [53] htmlwidgets_1.5.1 httr_1.4.1
## [55] RColorBrewer_1.1-2 nabor_0.5.0
## [57] ellipsis_0.3.1 ica_1.0-2
## [59] farver_2.0.3 pkgconfig_2.0.3
## [61] XML_3.99-0.3 R.methodsS3_1.8.0
## [63] ggseqlogo_0.1 uwot_0.1.8
## [65] later_1.1.0.1 labeling_0.3
## [67] tidyselect_1.1.0 rlang_0.4.6
## [69] reshape2_1.4.4 AnnotationDbi_1.50.0
## [71] munsell_0.5.0 tools_4.0.1
## [73] DirichletMultinomial_1.30.0 generics_0.0.2
## [75] RSQLite_2.2.0 ggridges_0.5.2
## [77] fastmap_1.0.1 evaluate_0.14
## [79] stringr_1.4.0 yaml_2.2.1
## [81] knitr_1.28 bit64_0.9-7
## [83] fs_1.4.1 fitdistrplus_1.1-1
## [85] caTools_1.18.0 purrr_0.3.4
## [87] RANN_2.6.1 KEGGREST_1.28.0
## [89] pbapply_1.4-2 future_1.17.0
## [91] nlme_3.1-148 mime_0.9
## [93] R.oo_1.23.0 poweRlaw_0.70.6
## [95] pracma_2.2.9 compiler_4.0.1
## [97] plotly_4.9.2.1 png_0.1-7
## [99] tibble_3.0.2 stringi_1.4.6
## [101] desc_1.2.0 lattice_0.20-41
## [103] CNEr_1.24.0 Matrix_1.2-18
## [105] vctrs_0.3.1 pillar_1.4.4
## [107] lifecycle_0.2.0 lmtest_0.9-37
## [109] RcppAnnoy_0.0.16 data.table_1.12.8
## [111] cowplot_1.0.0 bitops_1.0-6
## [113] irlba_2.3.3 httpuv_1.5.4
## [115] gggenes_0.4.0 R6_2.4.1
## [117] promises_1.1.1 KernSmooth_2.23-17
## [119] gridExtra_2.3 codetools_0.2-16
## [121] MASS_7.3-51.6 gtools_3.8.2
## [123] assertthat_0.2.1 seqLogo_1.54.3
## [125] SummarizedExperiment_1.18.1 rprojroot_1.3-2
## [127] GenomicAlignments_1.24.0 sctransform_0.2.1
## [129] Rsamtools_2.4.0 GenomeInfoDbData_1.2.3
## [131] hms_0.5.3 motifmatchr_1.10.0
## [133] grid_4.0.1 tidyr_1.1.0
## [135] rmarkdown_2.2 Rtsne_0.15
## [137] shiny_1.4.0.2 Biobase_2.48.0