vignettes/motif_vignette.Rmd
motif_vignette.Rmd
In this tutorial, we will perform DNA sequence motif analysis in Signac. We will explore two complementary options for performing motif analysis: one by finding overrepresented motifs in a set of differentially accessible peaks, one method performing differential motif activity analysis between groups of cells.
In this demonstration we use data from the adult mouse brain. See our vignette for the code used to generate this object, and links to the raw data. First, load the required packages and the pre-computed Seurat object:
library(Signac)
library(Seurat)
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Mmusculus.UCSC.mm10)
library(patchwork)
mouse_brain <- readRDS("adult_mouse_brain.rds")
mouse_brain
## An object of class Seurat
## 179011 features across 3512 samples within 2 assays
## Active assay: peaks (157203 features, 157203 variable features)
## 2 layers present: counts, data
## 1 other assay present: RNA
## 2 dimensional reductions calculated: lsi, umap
To add the DNA sequence motif information required for motif
analyses, we can run the AddMotifs()
function:
# Get a list of motif position frequency matrices from the JASPAR database
pfm <- getMatrixSet(
x = JASPAR2020,
opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE)
)
# add motif information
mouse_brain <- AddMotifs(
object = mouse_brain,
genome = BSgenome.Mmusculus.UCSC.mm10,
pfm = pfm
)
To facilitate motif analysis in Signac, we have created the
Motif
class to store all the required information,
including a list of position weight matrices (PWMs) or position
frequency matrices (PFMs) and a motif occurrence matrix. Here, the
AddMotifs()
function constructs a Motif
object
and adds it to our mouse brain dataset, along with other information
such as the base composition of each peak. A motif object can be added
to any Seurat assay using the SetAssayData()
function. See
the object interaction vignette for
more information.
To identify potentially important cell-type-specific regulatory sequences, we can search for DNA motifs that are overrepresented in a set of peaks that are differentially accessible between cell types.
Here, we find differentially accessible peaks between Pvalb and Sst
inhibitory interneurons. For sparse data (such as scATAC-seq), we find
it is often necessary to lower the min.pct
threshold in
FindMarkers()
from the default (0.1, which was designed for
scRNA-seq data).
We then perform a hypergeometric test to test the probability of observing the motif at the given frequency by chance, comparing with a background set of peaks matched for GC content.
da_peaks <- FindMarkers(
object = mouse_brain,
ident.1 = 'Pvalb',
ident.2 = 'Sst',
only.pos = TRUE,
test.use = 'LR',
min.pct = 0.05,
latent.vars = 'nCount_peaks'
)
# get top differentially accessible peaks
top.da.peak <- rownames(da_peaks[da_peaks$p_val < 0.005 & da_peaks$pct.1 > 0.2, ])
Matching the set of background peaks is essential when finding enriched DNA sequence motifs. By default, we choose a set of peaks matched for GC content, but it can be sometimes be beneficial to further restrict the background peaks to those that are accessible in the groups of cells compared when finding differentially accessible peaks.
The AccessiblePeaks()
function can be used to find a set
of peaks that are open in a subset of cells. We can use this function to
first restrict the set of possible background peaks to those peaks that
were open in the set of cells compared in FindMarkers()
,
and then create a GC-content-matched set of peaks from this larger set
using MatchRegionStats()
.
# find peaks open in Pvalb or Sst cells
open.peaks <- AccessiblePeaks(mouse_brain, idents = c("Pvalb", "Sst"))
# match the overall GC content in the peak set
meta.feature <- GetAssayData(mouse_brain, assay = "peaks", layer = "meta.features")
peaks.matched <- MatchRegionStats(
meta.feature = meta.feature[open.peaks, ],
query.feature = meta.feature[top.da.peak, ],
n = 50000
)
## Matching GC.percent distribution
peaks.matched
can then be used as the background peak
set by setting background=peaks.matched
in
FindMotifs()
.
# test enrichment
enriched.motifs <- FindMotifs(
object = mouse_brain,
features = top.da.peak
)
## Selecting background regions to match input sequence characteristics
## Matching GC.percent distribution
## Testing motif enrichment in 1131 regions
motif | observed | background | percent.observed | percent.background | fold.enrichment | pvalue | motif.name | p.adjust | |
---|---|---|---|---|---|---|---|---|---|
MA0497.1 | MA0497.1 | 556 | 8806 | 49.16004 | 22.0150 | 2.233024 | 0 | MEF2C | 0 |
MA0052.4 | MA0052.4 | 533 | 8477 | 47.12644 | 21.1925 | 2.223732 | 0 | MEF2A | 0 |
MA0773.1 | MA0773.1 | 398 | 5191 | 35.19010 | 12.9775 | 2.711624 | 0 | MEF2D | 0 |
MA0660.1 | MA0660.1 | 345 | 4246 | 30.50398 | 10.6150 | 2.873667 | 0 | MEF2B | 0 |
MA0592.3 | MA0592.3 | 350 | 4470 | 30.94607 | 11.1750 | 2.769223 | 0 | ESRRA | 0 |
MA1151.1 | MA1151.1 | 286 | 3343 | 25.28736 | 8.3575 | 3.025708 | 0 | RORC | 0 |
We can also plot the position weight matrices for the motifs, so we can visualize the different motif sequences.
Mef-family motifs, particularly Mef2c, are enriched in
Pvalb-specific peaks in scATAC-seq data (https://doi.org/10.1016/j.cell.2019.05.031; https://doi.org/10.1101/615179), and Mef2c is
required for the development of Pvalb interneurons (https://www.nature.com/articles/nature25999). Here our
results are consistent with these findings, and we observe a strong
enrichment of Mef-family motifs in the top results from
FindMotifs()
.
We can also compute a per-cell motif activity score by running chromVAR. This allows us to visualize motif activities per cell, and also provides an alternative method of identifying differentially-active motifs between cell types.
ChromVAR identifies motifs associated with variability in chromatin accessibility between cells. See the chromVAR paper for a complete description of the method.
mouse_brain <- RunChromVAR(
object = mouse_brain,
genome = BSgenome.Mmusculus.UCSC.mm10
)
DefaultAssay(mouse_brain) <- 'chromvar'
# look at the activity of Mef2c
p2 <- FeaturePlot(
object = mouse_brain,
features = "MA0497.1",
min.cutoff = 'q10',
max.cutoff = 'q90',
pt.size = 0.1
)
p1 + p2
We can also directly test for differential activity scores between cell types. This tends to give similar results as performing an enrichment test on differentially accessible peaks between the cell types (shown above).
When performing differential testing on the chromVAR z-score, we can
set mean.fxn=rowMeans
and fc.name="avg_diff"
in the FindMarkers()
function so that the fold-change
calculation computes the average difference in z-score between the
groups.
differential.activity <- FindMarkers(
object = mouse_brain,
ident.1 = 'Pvalb',
ident.2 = 'Sst',
only.pos = TRUE,
mean.fxn = rowMeans,
fc.name = "avg_diff"
)
MotifPlot(
object = mouse_brain,
motifs = head(rownames(differential.activity)),
assay = 'peaks'
)
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Singapore
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.2.0 BSgenome.Mmusculus.UCSC.mm10_1.4.3
## [3] BSgenome_1.70.2 rtracklayer_1.62.0
## [5] BiocIO_1.12.0 Biostrings_2.70.3
## [7] XVector_0.42.0 GenomicRanges_1.54.1
## [9] GenomeInfoDb_1.38.8 IRanges_2.36.0
## [11] S4Vectors_0.40.2 BiocGenerics_0.48.1
## [13] TFBSTools_1.40.0 JASPAR2020_0.99.10
## [15] Seurat_5.1.0 SeuratObject_5.0.2
## [17] sp_2.1-4 Signac_1.14.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.3.1
## [3] later_1.3.2 bitops_1.0-8
## [5] tibble_3.2.1 R.oo_1.26.0
## [7] polyclip_1.10-6 XML_3.99-0.17
## [9] DirichletMultinomial_1.44.0 fastDummies_1.7.3
## [11] lifecycle_1.0.4 globals_0.16.3
## [13] lattice_0.22-6 MASS_7.3-60.0.1
## [15] magrittr_2.0.3 limma_3.58.1
## [17] plotly_4.10.4 sass_0.4.9
## [19] rmarkdown_2.28 jquerylib_0.1.4
## [21] yaml_2.3.10 httpuv_1.6.15
## [23] sctransform_0.4.1 spam_2.10-0
## [25] spatstat.sparse_3.0-3 reticulate_1.37.0
## [27] cowplot_1.1.3 pbapply_1.7-2
## [29] DBI_1.2.3 CNEr_1.38.0
## [31] RColorBrewer_1.1-3 abind_1.4-5
## [33] zlibbioc_1.48.2 Rtsne_0.17
## [35] presto_1.0.0 purrr_1.0.2
## [37] R.utils_2.12.3 RCurl_1.98-1.16
## [39] pracma_2.4.4 GenomeInfoDbData_1.2.11
## [41] ggrepel_0.9.5 irlba_2.3.5.1
## [43] spatstat.utils_3.0-4 listenv_0.9.1
## [45] seqLogo_1.68.0 goftest_1.2-3
## [47] RSpectra_0.16-2 spatstat.random_3.2-3
## [49] annotate_1.80.0 fitdistrplus_1.1-11
## [51] parallelly_1.38.0 pkgdown_2.0.9
## [53] leiden_0.4.3.1 codetools_0.2-19
## [55] DelayedArray_0.28.0 RcppRoll_0.3.1
## [57] DT_0.33 tidyselect_1.2.1
## [59] ggseqlogo_0.2 farver_2.1.2
## [61] spatstat.explore_3.2-7 matrixStats_1.3.0
## [63] GenomicAlignments_1.38.2 jsonlite_1.8.8
## [65] progressr_0.14.0 motifmatchr_1.24.0
## [67] ggridges_0.5.6 survival_3.6-4
## [69] systemfonts_1.1.0 tools_4.3.1
## [71] ragg_1.3.2 ica_1.0-3
## [73] TFMPvalue_0.0.9 Rcpp_1.0.13
## [75] glue_1.7.0 gridExtra_2.3
## [77] SparseArray_1.2.4 xfun_0.47
## [79] MatrixGenerics_1.14.0 dplyr_1.1.4
## [81] withr_3.0.1 fastmap_1.2.0
## [83] fansi_1.0.6 caTools_1.18.2
## [85] digest_0.6.37 R6_2.5.1
## [87] mime_0.12 nabor_0.5.0
## [89] textshaping_0.4.0 colorspace_2.1-1
## [91] scattermore_1.2 GO.db_3.18.0
## [93] tensor_1.5 gtools_3.9.5
## [95] poweRlaw_0.80.0 spatstat.data_3.0-4
## [97] chromVAR_1.24.0 RSQLite_2.3.7
## [99] R.methodsS3_1.8.2 utf8_1.2.4
## [101] tidyr_1.3.1 generics_0.1.3
## [103] data.table_1.15.4 httr_1.4.7
## [105] htmlwidgets_1.6.4 S4Arrays_1.2.1
## [107] uwot_0.1.16 pkgconfig_2.0.3
## [109] gtable_0.3.5 blob_1.2.4
## [111] lmtest_0.9-40 htmltools_0.5.8.1
## [113] dotCall64_1.1-1 scales_1.3.0
## [115] Biobase_2.62.0 png_0.1-8
## [117] knitr_1.48 tzdb_0.4.0
## [119] reshape2_1.4.4 rjson_0.2.21
## [121] nlme_3.1-164 zoo_1.8-12
## [123] cachem_1.1.0 stringr_1.5.1
## [125] KernSmooth_2.23-24 parallel_4.3.1
## [127] miniUI_0.1.1.1 AnnotationDbi_1.64.1
## [129] restfulr_0.0.15 desc_1.4.3
## [131] pillar_1.9.0 grid_4.3.1
## [133] vctrs_0.6.5 RANN_2.6.1
## [135] promises_1.3.0 xtable_1.8-4
## [137] cluster_2.1.6 evaluate_0.24.0
## [139] readr_2.1.5 cli_3.6.3
## [141] compiler_4.3.1 Rsamtools_2.18.0
## [143] rlang_1.1.4 crayon_1.5.3
## [145] future.apply_1.11.2 labeling_0.4.3
## [147] plyr_1.8.9 fs_1.6.4
## [149] stringi_1.8.4 deldir_2.0-4
## [151] viridisLite_0.4.2 BiocParallel_1.36.0
## [153] munsell_0.5.1 lazyeval_0.2.2
## [155] spatstat.geom_3.2-9 Matrix_1.6-5
## [157] RcppHNSW_0.6.0 hms_1.1.3
## [159] bit64_4.0.5 future_1.34.0
## [161] ggplot2_3.5.1 statmod_1.5.0
## [163] KEGGREST_1.42.0 shiny_1.9.1
## [165] highr_0.11 SummarizedExperiment_1.32.0
## [167] ROCR_1.0-11 igraph_2.0.3
## [169] memoise_2.0.1 bslib_0.8.0
## [171] fastmatch_1.1-4 bit_4.0.5