vignettes/mouse_brain_vignette.Rmd
mouse_brain_vignette.Rmd
For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of adult mouse brain cells provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website.
To download the required data, run the following lines in a shell:
wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5
wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_singlecell.csv
wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz
wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz.tbi
This vignette echoes the commands run in the introductory Signac vignette on human PBMC. We provide the same analysis in a different system to demonstrate performance and applicability to other tissue types, and to provide an example from another species.
First load in Signac, Seurat, and some other packages we will be using for analyzing mouse data.
## Warning: package 'Seurat' was built under R version 4.3.3
## Warning: package 'SeuratObject' was built under R version 4.3.3
library(EnsDb.Mmusculus.v79)
## Warning: package 'S4Vectors' was built under R version 4.3.2
## Warning: package 'GenomeInfoDb' was built under R version 4.3.3
## Warning: package 'GenomicFeatures' was built under R version 4.3.3
counts <- Read10X_h5("atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "atac_v1_adult_brain_fresh_5k_singlecell.csv",
header = TRUE,
row.names = 1
)
brain_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
genome = "mm10",
fragments = 'atac_v1_adult_brain_fresh_5k_fragments.tsv.gz',
min.cells = 1
)
## Computing hash
brain <- CreateSeuratObject(
counts = brain_assay,
assay = 'peaks',
project = 'ATAC',
meta.data = metadata
)
We can also add gene annotations to the brain
object for
the mouse genome. This will allow downstream functions to pull the gene
annotation information directly from the object.
# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "mm10"
# add the gene information to the object
Annotation(brain) <- annotations
Next we compute some useful per-cell QC metrics.
brain <- NucleosomeSignal(object = brain)
We can look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells which are outliers for the mononucleosomal/ nucleosome-free ratio have different banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.
brain$nucleosome_group <- ifelse(brain$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = brain, group.by = 'nucleosome_group', region = 'chr1-1-10000000')
The enrichment of Tn5 integration events at transcriptional start
sites (TSSs) can also be an important quality control metric to assess
the targeting of Tn5 in ATAC-seq experiments. The ENCODE consortium
defined a TSS enrichment score as the number of Tn5 integration site
around the TSS normalized to the number of Tn5 integration sites in
flanking regions. See the ENCODE documentation for more information
about the TSS enrichment score (https://www.encodeproject.org/data-standards/terms/). We
can calculate the TSS enrichment score for each cell using the
TSSEnrichment()
function in Signac.
brain <- TSSEnrichment(brain)
brain$pct_reads_in_peaks <- brain$peak_region_fragments / brain$passed_filters * 100
brain$blacklist_ratio <- brain$blacklist_region_fragments / brain$peak_region_fragments
VlnPlot(
object = brain,
features = c('pct_reads_in_peaks', 'peak_region_fragments',
'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),
pt.size = 0.1,
ncol = 5
)
We remove cells that are outliers for these QC metrics.
brain <- subset(
x = brain,
subset = peak_region_fragments > 3000 &
peak_region_fragments < 100000 &
pct_reads_in_peaks > 40 &
blacklist_ratio < 0.025 &
nucleosome_signal < 4 &
TSS.enrichment > 2
)
brain
## An object of class Seurat
## 157203 features across 3512 samples within 1 assay
## Active assay: peaks (157203 features, 0 variable features)
## 2 layers present: counts, data
brain <- RunTFIDF(brain)
brain <- FindTopFeatures(brain, min.cutoff = 'q0')
brain <- RunSVD(object = brain)
The first LSI component often captures sequencing depth (technical
variation) rather than biological variation. If this is the case, the
component should be removed from downstream analysis. We can assess the
correlation between each LSI component and sequencing depth using the
DepthCor()
function:
DepthCor(brain)
Here we see there is a very strong correlation between the first LSI component and the total number of counts for the cell, so we will perform downstream steps without this component.
Now that the cells are embedded in a low-dimensional space, we can
use methods commonly applied for the analysis of scRNA-seq data to
perform graph-based clustering, and non-linear dimension reduction for
visualization. The functions RunUMAP()
,
FindNeighbors()
, and FindClusters()
all come
from the Seurat package.
brain <- RunUMAP(
object = brain,
reduction = 'lsi',
dims = 2:30
)
brain <- FindNeighbors(
object = brain,
reduction = 'lsi',
dims = 2:30
)
brain <- FindClusters(
object = brain,
algorithm = 3,
resolution = 1.2,
verbose = FALSE
)
DimPlot(object = brain, label = TRUE) + NoLegend()
# compute gene activities
gene.activities <- GeneActivity(brain)
# add the gene activity matrix to the Seurat object as a new assay
brain[['RNA']] <- CreateAssayObject(counts = gene.activities)
brain <- NormalizeData(
object = brain,
assay = 'RNA',
normalization.method = 'LogNormalize',
scale.factor = median(brain$nCount_RNA)
)
DefaultAssay(brain) <- 'RNA'
FeaturePlot(
object = brain,
features = c('Sst','Pvalb',"Gad2","Neurod6","Rorb","Syt6"),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3
)
To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (the adult mouse brain). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here.
You can download the raw data for this experiment from the Allen Institute website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.
# Load the pre-processed scRNA-seq data
allen_rna <- readRDS("allen_brain.rds")
allen_rna <- UpdateSeuratObject(allen_rna)
allen_rna <- FindVariableFeatures(
object = allen_rna,
nfeatures = 5000
)
transfer.anchors <- FindTransferAnchors(
reference = allen_rna,
query = brain,
reduction = 'cca',
dims = 1:30
)
predicted.labels <- TransferData(
anchorset = transfer.anchors,
refdata = allen_rna$subclass,
weight.reduction = brain[['lsi']],
dims = 2:30
)
brain <- AddMetaData(object = brain, metadata = predicted.labels)
plot1 <- DimPlot(allen_rna, group.by = 'subclass', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
plot2 <- DimPlot(brain, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
plot1 + plot2
You can see that the RNA-based classifications are consistent with the UMAP visualization, computed only on the ATAC-seq data.
Here, we find differentially accessible regions between excitatory neurons in different layers of the cortex.
#switch back to working with peaks instead of gene activities
DefaultAssay(brain) <- 'peaks'
Idents(brain) <- "predicted.id"
da_peaks <- FindMarkers(
object = brain,
ident.1 = c("L2/3 IT"),
ident.2 = c("L4", "L5 IT", "L6 IT"),
test.use = 'wilcox',
min.pct = 0.1
)
head(da_peaks)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## chr4-86523678-86525285 2.421492e-81 3.691294 0.426 0.037 3.806657e-76
## chr2-118700082-118704897 2.118661e-79 2.092487 0.648 0.182 3.330599e-74
## chr3-137056475-137058371 9.591220e-71 2.237328 0.536 0.113 1.507769e-65
## chr15-87605281-87607659 9.057748e-70 2.450827 0.499 0.097 1.423905e-64
## chr10-107751762-107753240 4.487524e-67 1.801355 0.632 0.192 7.054523e-62
## chr19-28380552-28383664 6.361896e-65 2.414502 0.476 0.092 1.000109e-59
plot1 <- VlnPlot(
object = brain,
features = rownames(da_peaks)[1],
pt.size = 0.1,
idents = c("L4","L5 IT","L2/3 IT")
)
plot2 <- FeaturePlot(
object = brain,
features = rownames(da_peaks)[1],
pt.size = 0.1,
max.cutoff = 'q95'
)
plot1 | plot2
open_l23 <- rownames(da_peaks[da_peaks$avg_log2FC > 3, ])
open_l456 <- rownames(da_peaks[da_peaks$avg_log2FC < 3, ])
closest_l23 <- ClosestFeature(brain, open_l23)
closest_l456 <- ClosestFeature(brain, open_l456)
head(closest_l23)
## tx_id gene_name gene_id
## ENSMUST00000151481 ENSMUST00000151481 Fam154a ENSMUSG00000028492
## ENSMUST00000131864 ENSMUST00000131864 Gm12796 ENSMUSG00000085721
## ENSMUSE00001329193 ENSMUST00000185379 Gm29414 ENSMUSG00000099392
## ENSMUST00000161356 ENSMUST00000161356 Reln ENSMUSG00000042453
## ENSMUST00000139527 ENSMUST00000139527 Yipf1 ENSMUSG00000057375
## ENSMUSE00000514286 ENSMUST00000077353 Hmbs ENSMUSG00000032126
## gene_biotype type closest_region
## ENSMUST00000151481 protein_coding gap chr4-86487920-86538964
## ENSMUST00000131864 lincRNA gap chr4-101292521-101318425
## ENSMUSE00001329193 lincRNA exon chr1-25026581-25026779
## ENSMUST00000161356 protein_coding gap chr5-21891568-21895988
## ENSMUST00000139527 protein_coding cds chr4-107345009-107345191
## ENSMUSE00000514286 protein_coding exon chr9-44344010-44344228
## query_region distance
## ENSMUST00000151481 chr4-86523678-86525285 0
## ENSMUST00000131864 chr4-101303935-101305131 0
## ENSMUSE00001329193 chr1-25008426-25009334 17246
## ENSMUST00000161356 chr5-21894051-21894682 0
## ENSMUST00000139527 chr4-107344435-107345145 0
## ENSMUSE00000514286 chr9-44345250-44346015 1021
head(closest_l456)
## tx_id gene_name gene_id
## ENSMUST00000104937 ENSMUST00000104937 Ankrd63 ENSMUSG00000078137
## ENSMUST00000070198 ENSMUST00000070198 Ppp3ca ENSMUSG00000028161
## ENSMUSE00000647021 ENSMUST00000068088 Fam19a5 ENSMUSG00000054863
## ENSMUST00000165341 ENSMUST00000165341 Otogl ENSMUSG00000091455
## ENSMUST00000162022 ENSMUST00000162022 Glis3 ENSMUSG00000052942
## ENSMUST00000179893 ENSMUST00000179893 Ryr1 ENSMUSG00000030592
## gene_biotype type closest_region
## ENSMUST00000104937 protein_coding cds chr2-118702266-118703438
## ENSMUST00000070198 protein_coding utr chr3-136935226-136937727
## ENSMUSE00000647021 protein_coding exon chr15-87625230-87625486
## ENSMUST00000165341 protein_coding utr chr10-107762223-107762309
## ENSMUST00000162022 protein_coding gap chr19-28357953-28530872
## ENSMUST00000179893 protein_coding cds chr7-29073019-29073139
## query_region distance
## ENSMUST00000104937 chr2-118700082-118704897 0
## ENSMUST00000070198 chr3-137056475-137058371 118747
## ENSMUSE00000647021 chr15-87605281-87607659 17570
## ENSMUST00000165341 chr10-107751762-107753240 8982
## ENSMUST00000162022 chr19-28380552-28383664 0
## ENSMUST00000179893 chr7-29070299-29073172 0
We can also create coverage plots grouped by cluster, cell type, or
any other metadata stored in the object for any genomic region using the
CoveragePlot()
function. These represent pseudo-bulk
accessibility tracks, where signal from all cells within a group have
been averaged together to visualize the DNA accessibility in a
region.
# show cell types with at least 50 cells
idents.plot <- names(which(table(Idents(brain)) > 50))
brain <- SortIdents(brain)
CoveragePlot(
object = brain,
region = c("Neurod6", "Gad2"),
idents = idents.plot,
extend.upstream = 1000,
extend.downstream = 1000,
ncol = 1
)
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Singapore
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.2.0 ggplot2_3.5.1
## [3] EnsDb.Mmusculus.v79_2.99.0 ensembldb_2.26.0
## [5] AnnotationFilter_1.26.0 GenomicFeatures_1.54.4
## [7] AnnotationDbi_1.64.1 Biobase_2.62.0
## [9] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
## [11] IRanges_2.36.0 S4Vectors_0.40.2
## [13] BiocGenerics_0.48.1 Seurat_5.1.0
## [15] SeuratObject_5.0.2 sp_2.1-4
## [17] Signac_1.14.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.4 ProtGenerics_1.34.0
## [3] matrixStats_1.3.0 spatstat.sparse_3.0-3
## [5] bitops_1.0-8 httr_1.4.7
## [7] RColorBrewer_1.1-3 tools_4.3.1
## [9] sctransform_0.4.1 backports_1.5.0
## [11] utf8_1.2.4 R6_2.5.1
## [13] lazyeval_0.2.2 uwot_0.1.16
## [15] withr_3.0.1 prettyunits_1.2.0
## [17] gridExtra_2.3 progressr_0.14.0
## [19] cli_3.6.3 textshaping_0.4.0
## [21] spatstat.explore_3.2-7 fastDummies_1.7.3
## [23] labeling_0.4.3 sass_0.4.9
## [25] spatstat.data_3.0-4 ggridges_0.5.6
## [27] pbapply_1.7-2 pkgdown_2.0.9
## [29] Rsamtools_2.18.0 systemfonts_1.1.0
## [31] foreign_0.8-86 dichromat_2.0-0.1
## [33] parallelly_1.38.0 limma_3.58.1
## [35] BSgenome_1.70.2 rstudioapi_0.16.0
## [37] RSQLite_2.3.7 generics_0.1.3
## [39] BiocIO_1.12.0 ica_1.0-3
## [41] spatstat.random_3.2-3 dplyr_1.1.4
## [43] Matrix_1.6-5 ggbeeswarm_0.7.2
## [45] fansi_1.0.6 abind_1.4-5
## [47] lifecycle_1.0.4 yaml_2.3.10
## [49] SummarizedExperiment_1.32.0 SparseArray_1.2.4
## [51] BiocFileCache_2.10.2 Rtsne_0.17
## [53] grid_4.3.1 blob_1.2.4
## [55] promises_1.3.0 crayon_1.5.3
## [57] miniUI_0.1.1.1 lattice_0.22-6
## [59] cowplot_1.1.3 KEGGREST_1.42.0
## [61] pillar_1.9.0 knitr_1.48
## [63] rjson_0.2.21 future.apply_1.11.2
## [65] codetools_0.2-19 fastmatch_1.1-4
## [67] leiden_0.4.3.1 glue_1.7.0
## [69] data.table_1.15.4 vctrs_0.6.5
## [71] png_0.1-8 spam_2.10-0
## [73] gtable_0.3.5 cachem_1.1.0
## [75] xfun_0.47 S4Arrays_1.2.1
## [77] mime_0.12 survival_3.6-4
## [79] RcppRoll_0.3.1 statmod_1.5.0
## [81] fitdistrplus_1.1-11 ROCR_1.0-11
## [83] nlme_3.1-164 bit64_4.0.5
## [85] progress_1.2.3 filelock_1.0.3
## [87] RcppAnnoy_0.0.22 bslib_0.8.0
## [89] irlba_2.3.5.1 vipor_0.4.7
## [91] KernSmooth_2.23-24 rpart_4.1.23
## [93] colorspace_2.1-1 DBI_1.2.3
## [95] Hmisc_5.1-3 nnet_7.3-19
## [97] ggrastr_1.0.2 tidyselect_1.2.1
## [99] bit_4.0.5 compiler_4.3.1
## [101] curl_5.2.1 htmlTable_2.4.3
## [103] hdf5r_1.3.10 xml2_1.3.6
## [105] desc_1.4.3 DelayedArray_0.28.0
## [107] plotly_4.10.4 rtracklayer_1.62.0
## [109] checkmate_2.3.2 scales_1.3.0
## [111] lmtest_0.9-40 rappdirs_0.3.3
## [113] stringr_1.5.1 digest_0.6.37
## [115] goftest_1.2-3 presto_1.0.0
## [117] spatstat.utils_3.0-4 rmarkdown_2.28
## [119] XVector_0.42.0 htmltools_0.5.8.1
## [121] pkgconfig_2.0.3 base64enc_0.1-3
## [123] MatrixGenerics_1.14.0 highr_0.11
## [125] dbplyr_2.5.0 fastmap_1.2.0
## [127] rlang_1.1.4 htmlwidgets_1.6.4
## [129] shiny_1.9.1 farver_2.1.2
## [131] jquerylib_0.1.4 zoo_1.8-12
## [133] jsonlite_1.8.8 BiocParallel_1.36.0
## [135] VariantAnnotation_1.48.1 RCurl_1.98-1.16
## [137] magrittr_2.0.3 Formula_1.2-5
## [139] GenomeInfoDbData_1.2.11 dotCall64_1.1-1
## [141] munsell_0.5.1 Rcpp_1.0.13
## [143] reticulate_1.37.0 stringi_1.8.4
## [145] zlibbioc_1.48.2 MASS_7.3-60.0.1
## [147] plyr_1.8.9 parallel_4.3.1
## [149] listenv_0.9.1 ggrepel_0.9.5
## [151] deldir_2.0-4 Biostrings_2.70.3
## [153] splines_4.3.1 tensor_1.5
## [155] hms_1.1.3 igraph_2.0.3
## [157] spatstat.geom_3.2-9 RcppHNSW_0.6.0
## [159] reshape2_1.4.4 biomaRt_2.58.2
## [161] XML_3.99-0.17 evaluate_0.24.0
## [163] biovizBase_1.50.0 httpuv_1.6.15
## [165] RANN_2.6.1 tidyr_1.3.1
## [167] purrr_1.0.2 polyclip_1.10-6
## [169] future_1.34.0 scattermore_1.2
## [171] xtable_1.8-4 restfulr_0.0.15
## [173] RSpectra_0.16-2 later_1.3.2
## [175] viridisLite_0.4.2 ragg_1.3.2
## [177] tibble_3.2.1 beeswarm_0.4.0
## [179] memoise_2.0.1 GenomicAlignments_1.38.2
## [181] cluster_2.1.6 globals_0.16.3