For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of adult mouse brain cells provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website.

View data download code

To download the required data, run the following lines in a shell:

wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5
wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_singlecell.csv
wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz
wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_adult_brain_fresh_5k/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz.tbi

This vignette echoes the commands run in the introductory Signac vignette on human PBMC. We provide the same analysis in a different system to demonstrate performance and applicability to other tissue types, and to provide an example from another species.

First load in Signac, Seurat, and some other packages we will be using for analyzing mouse data.

Pre-processing workflow

counts <- Read10X_h5("../vignette_data/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "../vignette_data/atac_v1_adult_brain_fresh_5k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

brain_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  genome = "mm10",
  fragments = '../vignette_data/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz',
  min.cells = 1
)
## Computing hash
brain <- CreateSeuratObject(
  counts = brain_assay,
  assay = 'peaks',
  project = 'ATAC',
  meta.data = metadata
)
## Warning in CreateSeuratObject.Assay(counts = brain_assay, assay = "peaks", :
## Some cells in meta.data not present in provided counts matrix.
## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from peaks to peaks_

We can also add gene annotations to the brain object for the mouse genome. This will allow downstream functions to pull the gene annotation information directly from the object.

# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)

# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "mm10"

# add the gene information to the object
Annotation(brain) <- annotations

Computing QC Metrics

Next we compute some useful per-cell QC metrics.

brain <- NucleosomeSignal(object = brain)

We can look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells which are outliers for the mononucleosomal/ nucleosome-free ratio have different banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.

brain$nucleosome_group <- ifelse(brain$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = brain, group.by = 'nucleosome_group', region = 'chr1-1-10000000')

The enrichment of Tn5 integration events at transcriptional start sites (TSSs) can also be an important quality control metric to assess the targeting of Tn5 in ATAC-seq experiments. The ENCODE consortium defined a TSS enrichment score as the number of Tn5 integration site around the TSS normalized to the number of Tn5 integration sites in flanking regions. See the ENCODE documentation for more information about the TSS enrichment score (https://www.encodeproject.org/data-standards/terms/). We can calculate the TSS enrichment score for each cell using the TSSEnrichment() function in Signac.

brain <- TSSEnrichment(brain, fast = FALSE)
brain$high.tss <- ifelse(brain$TSS.enrichment > 2, 'High', 'Low')
TSSPlot(brain, group.by = 'high.tss') + NoLegend()

brain$pct_reads_in_peaks <- brain$peak_region_fragments / brain$passed_filters * 100
brain$blacklist_ratio <- brain$blacklist_region_fragments / brain$peak_region_fragments

VlnPlot(
  object = brain,
  features = c('pct_reads_in_peaks', 'peak_region_fragments',
               'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),
  pt.size = 0.1,
  ncol = 5
)

We remove cells that are outliers for these QC metrics.

brain <- subset(
  x = brain,
  subset = peak_region_fragments > 3000 &
    peak_region_fragments < 100000 &
    pct_reads_in_peaks > 40 &
    blacklist_ratio < 0.025 &
    nucleosome_signal < 4 &
    TSS.enrichment > 2
)
brain
## An object of class Seurat 
## 157203 features across 3512 samples within 1 assay 
## Active assay: peaks (157203 features, 0 variable features)

Normalization and linear dimensional reduction

brain <- RunTFIDF(brain)
brain <- FindTopFeatures(brain, min.cutoff = 'q0')
brain <- RunSVD(object = brain)

The first LSI component often captures sequencing depth (technical variation) rather than biological variation. If this is the case, the component should be removed from downstream analysis. We can assess the correlation between each LSI component and sequencing depth using the DepthCor() function:

DepthCor(brain)

Here we see there is a very strong correlation between the first LSI component and the total number of counts for the cell, so we will perform downstream steps without this component.

Non-linear dimension reduction and clustering

Now that the cells are embedded in a low-dimensional space, we can use methods commonly applied for the analysis of scRNA-seq data to perform graph-based clustering, and non-linear dimension reduction for visualization. The functions RunUMAP(), FindNeighbors(), and FindClusters() all come from the Seurat package.

brain <- RunUMAP(
  object = brain,
  reduction = 'lsi',
  dims = 2:30
)
brain <- FindNeighbors(
  object = brain,
  reduction = 'lsi',
  dims = 2:30
)
brain <- FindClusters(
  object = brain,
  algorithm = 3,
  resolution = 1.2,
  verbose = FALSE
)

DimPlot(object = brain, label = TRUE) + NoLegend()

Create a gene activity matrix

# compute gene activities
gene.activities <- GeneActivity(brain)

# add the gene activity matrix to the Seurat object as a new assay
brain[['RNA']] <- CreateAssayObject(counts = gene.activities)
brain <- NormalizeData(
  object = brain,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(brain$nCount_RNA)
)
DefaultAssay(brain) <- 'RNA'
FeaturePlot(
  object = brain,
  features = c('Sst','Pvalb',"Gad2","Neurod6","Rorb","Syt6"),
  pt.size = 0.1,
  max.cutoff = 'q95',
  ncol = 3
)

Integrating with scRNA-seq data

To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (the adult mouse brain). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here.

You can download the raw data for this experiment from the Allen Institute website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.

# Load the pre-processed scRNA-seq data
allen_rna <- readRDS("../vignette_data/allen_brain.rds")
allen_rna <- FindVariableFeatures(
  object = allen_rna,
  nfeatures = 5000
)

transfer.anchors <- FindTransferAnchors(
  reference = allen_rna,
  query = brain,
  reduction = 'cca',
  dims = 1:40
)

predicted.labels <- TransferData(
  anchorset = transfer.anchors,
  refdata = allen_rna$subclass,
  weight.reduction = brain[['lsi']],
  dims = 2:30
)

brain <- AddMetaData(object = brain, metadata = predicted.labels)
plot1 <- DimPlot(allen_rna, group.by = 'subclass', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
plot2 <- DimPlot(brain, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
plot1 + plot2

Why did we change default parameters?

We changed default parameters for FindIntegrationAnchors() and FindVariableFeatures() (including more features and dimensions). You can run the analysis both ways, and observe very similar results. However, when using default parameters we mislabel cluster 11 cells as Vip-interneurons, when they are in fact a Meis2 expressing CGE-derived interneuron population recently described by us and others. The reason is that this subset is exceptionally rare in the scRNA-seq data (0.3%), and so the genes define this subset (for example, Meis2) were too lowly expressed to be selected in the initial set of variable features. We therefore need more genes and dimensions to facilitate cross-modality mapping. Interestingly, this subset is 10-fold more abundant in the scATAC-seq data compared to the scRNA-seq data (see this paper for possible explanations.)

You can see that the RNA-based classifications are entirely consistent with the UMAP visualization, computed only on the ATAC-seq data. We can now easily annotate our scATAC-seq derived clusters (alternately, we could use the RNA classifications themselves). We note three small clusters (13, 20, 21) which represent subdivisions of the scRNA-seq labels. Try transferring the cluster label (which shows finer distinctions) from the allen scRNA-seq dataset, to annotate them!

# replace each label with its most likely prediction
for(i in levels(brain)) {
  cells_to_reid <- WhichCells(brain, idents = i)
  newid <- names(which.max(table(brain$predicted.id[cells_to_reid])))
  Idents(brain, cells = cells_to_reid) <- newid
}

Find differentially accessible peaks between clusters

Here, we find differentially accessible regions between excitatory neurons in different layers of the cortex.

#switch back to working with peaks instead of gene activities
DefaultAssay(brain) <- 'peaks'

da_peaks <- FindMarkers(
  object = brain,
  ident.1 = c("L2/3 IT"), 
  ident.2 = c("L4", "L5 IT", "L6 IT"),
  test.use = 'LR',
  latent.vars = 'nCount_peaks'
)

head(da_peaks)
##                                 p_val avg_log2FC pct.1 pct.2    p_val_adj
## chr4-86523678-86525285   1.874566e-67   3.762540 0.401 0.036 2.946874e-62
## chr15-87605281-87607659  2.026033e-59   2.602410 0.485 0.090 3.184984e-54
## chr2-118700082-118704897 2.671349e-57   2.045740 0.622 0.180 4.199441e-52
## chr4-101303935-101305131 4.756647e-54   3.736644 0.340 0.025 7.477592e-49
## chr13-69329933-69331707  4.711909e-53  -2.250185 0.143 0.445 7.407262e-48
## chr3-137056475-137058371 6.221641e-53   2.383564 0.523 0.103 9.780607e-48
plot1 <- VlnPlot(
  object = brain,
  features = rownames(da_peaks)[1],
  pt.size = 0.1,
  idents = c("L4","L5 IT","L2/3 IT")
)
plot2 <- FeaturePlot(
  object = brain,
  features = rownames(da_peaks)[1],
  pt.size = 0.1,
  max.cutoff = 'q95'
)
plot1 | plot2

open_l23 <- rownames(da_peaks[da_peaks$avg_log2FC > 3, ])
open_l456 <- rownames(da_peaks[da_peaks$avg_log2FC < 3, ])
closest_l23 <- ClosestFeature(brain, open_l23)
closest_l456 <- ClosestFeature(brain, open_l456)
head(closest_l23)
##                                 tx_id gene_name            gene_id
## ENSMUST00000151481 ENSMUST00000151481   Fam154a ENSMUSG00000028492
## ENSMUST00000131864 ENSMUST00000131864   Gm12796 ENSMUSG00000085721
## ENSMUST00000161356 ENSMUST00000161356      Reln ENSMUSG00000042453
## ENSMUSE00000847481 ENSMUST00000158346   Gm23764 ENSMUSG00000088971
## ENSMUSE00001329193 ENSMUST00000185379   Gm29414 ENSMUSG00000099392
## ENSMUST00000139527 ENSMUST00000139527     Yipf1 ENSMUSG00000057375
##                      gene_biotype type            closest_region
## ENSMUST00000151481 protein_coding  gap    chr4-86487920-86538964
## ENSMUST00000131864        lincRNA  gap  chr4-101292521-101318425
## ENSMUST00000161356 protein_coding  gap    chr5-21891568-21895988
## ENSMUSE00000847481           rRNA exon chr14-102806099-102806146
## ENSMUSE00001329193        lincRNA exon    chr1-25026581-25026779
## ENSMUST00000139527 protein_coding  cds  chr4-107345009-107345191
##                                 query_region distance
## ENSMUST00000151481    chr4-86523678-86525285        0
## ENSMUST00000131864  chr4-101303935-101305131        0
## ENSMUST00000161356    chr5-21894051-21894682        0
## ENSMUSE00000847481 chr14-102878306-102879118    72159
## ENSMUSE00001329193    chr1-25008426-25009334    17246
## ENSMUST00000139527  chr4-107344435-107345145        0
head(closest_l456)
##                                 tx_id gene_name            gene_id
## ENSMUSE00000647021 ENSMUST00000068088   Fam19a5 ENSMUSG00000054863
## ENSMUST00000104937 ENSMUST00000104937   Ankrd63 ENSMUSG00000078137
## ENSMUST00000044081 ENSMUST00000044081     Papd7 ENSMUSG00000034575
## ENSMUST00000070198 ENSMUST00000070198    Ppp3ca ENSMUSG00000028161
## ENSMUST00000165341 ENSMUST00000165341     Otogl ENSMUSG00000091455
## ENSMUST00000179893 ENSMUST00000179893      Ryr1 ENSMUSG00000030592
##                      gene_biotype type            closest_region
## ENSMUSE00000647021 protein_coding exon   chr15-87625230-87625486
## ENSMUST00000104937 protein_coding  cds  chr2-118702266-118703438
## ENSMUST00000044081 protein_coding  utr   chr13-69497959-69499915
## ENSMUST00000070198 protein_coding  utr  chr3-136935226-136937727
## ENSMUST00000165341 protein_coding  utr chr10-107762223-107762309
## ENSMUST00000179893 protein_coding  cds    chr7-29073019-29073139
##                                 query_region distance
## ENSMUSE00000647021   chr15-87605281-87607659    17570
## ENSMUST00000104937  chr2-118700082-118704897        0
## ENSMUST00000044081   chr13-69329933-69331707   166251
## ENSMUST00000070198  chr3-137056475-137058371   118747
## ENSMUST00000165341 chr10-107751762-107753240     8982
## ENSMUST00000179893    chr7-29070299-29073172        0

Plotting genomic regions

We can also create coverage plots grouped by cluster, cell type, or any other metadata stored in the object for any genomic region using the CoveragePlot() function. These represent pseudo-bulk accessibility tracks, where signal from all cells within a group have been averaged together to visualize the DNA accessibility in a region.

# set plotting order
levels(brain) <- c("L2/3 IT","L4","L5 IT","L5 PT","L6 CT", "L6 IT","NP","Sst","Pvalb","Vip","Lamp5","Meis2","Oligo","Astro","Endo","VLMC","Macrophage")

CoveragePlot(
  object = brain,
  region = c("Neurod6", "Gad2"),
  extend.upstream = 1000,
  extend.downstream = 1000,
  ncol = 1
)

Session Info
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.0
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Singapore
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.1.3            ggplot2_3.4.4             
##  [3] EnsDb.Mmusculus.v79_2.99.0 ensembldb_2.24.1          
##  [5] AnnotationFilter_1.24.0    GenomicFeatures_1.52.2    
##  [7] AnnotationDbi_1.62.2       Biobase_2.60.0            
##  [9] GenomicRanges_1.52.0       GenomeInfoDb_1.36.3       
## [11] IRanges_2.34.1             S4Vectors_0.38.2          
## [13] BiocGenerics_0.46.0        SeuratObject_4.1.4        
## [15] Seurat_4.4.0               Signac_1.11.0             
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.21            splines_4.3.1              
##   [3] later_1.3.1                 BiocIO_1.10.0              
##   [5] bitops_1.0-7                filelock_1.0.2             
##   [7] tibble_3.2.1                polyclip_1.10-6            
##   [9] rpart_4.1.19                XML_3.99-0.14              
##  [11] lifecycle_1.0.3             rprojroot_2.0.3            
##  [13] hdf5r_1.3.8                 globals_0.16.2             
##  [15] lattice_0.21-8              MASS_7.3-60                
##  [17] backports_1.4.1             magrittr_2.0.3             
##  [19] Hmisc_5.1-1                 plotly_4.10.2              
##  [21] sass_0.4.7                  rmarkdown_2.25             
##  [23] jquerylib_0.1.4             yaml_2.3.7                 
##  [25] httpuv_1.6.11               sctransform_0.4.0          
##  [27] spatstat.sparse_3.0-2       sp_2.1-0                   
##  [29] reticulate_1.32.0           cowplot_1.1.1              
##  [31] pbapply_1.7-2               DBI_1.1.3                  
##  [33] RColorBrewer_1.1-3          abind_1.4-5                
##  [35] zlibbioc_1.46.0             Rtsne_0.16                 
##  [37] purrr_1.0.2                 biovizBase_1.48.0          
##  [39] RCurl_1.98-1.12             nnet_7.3-19                
##  [41] VariantAnnotation_1.46.0    rappdirs_0.3.3             
##  [43] GenomeInfoDbData_1.2.10     ggrepel_0.9.3              
##  [45] irlba_2.3.5.1               listenv_0.9.0              
##  [47] spatstat.utils_3.0-3        goftest_1.2-3              
##  [49] spatstat.random_3.1-6       fitdistrplus_1.1-11        
##  [51] parallelly_1.36.0           pkgdown_2.0.7              
##  [53] leiden_0.4.3                codetools_0.2-19           
##  [55] DelayedArray_0.26.7         RcppRoll_0.3.0             
##  [57] xml2_1.3.5                  tidyselect_1.2.0           
##  [59] farver_2.1.1                base64enc_0.1-3            
##  [61] matrixStats_1.0.0           BiocFileCache_2.8.0        
##  [63] spatstat.explore_3.2-3      GenomicAlignments_1.36.0   
##  [65] jsonlite_1.8.7              Formula_1.2-5              
##  [67] ellipsis_0.3.2              progressr_0.14.0           
##  [69] ggridges_0.5.4              survival_3.5-5             
##  [71] systemfonts_1.0.5           tools_4.3.1                
##  [73] progress_1.2.2              ragg_1.2.6                 
##  [75] ica_1.0-3                   Rcpp_1.0.11                
##  [77] glue_1.6.2                  gridExtra_2.3              
##  [79] xfun_0.40                   MatrixGenerics_1.12.3      
##  [81] dplyr_1.1.3                 withr_2.5.1                
##  [83] fastmap_1.1.1               fansi_1.0.5                
##  [85] digest_0.6.33               R6_2.5.1                   
##  [87] mime_0.12                   textshaping_0.3.7          
##  [89] colorspace_2.1-0            scattermore_1.2            
##  [91] tensor_1.5                  dichromat_2.0-0.1          
##  [93] spatstat.data_3.0-1         biomaRt_2.56.1             
##  [95] RSQLite_2.3.1               utf8_1.2.3                 
##  [97] tidyr_1.3.0                 generics_0.1.3             
##  [99] data.table_1.14.8           rtracklayer_1.60.1         
## [101] prettyunits_1.2.0           httr_1.4.7                 
## [103] htmlwidgets_1.6.2           S4Arrays_1.0.6             
## [105] uwot_0.1.16                 pkgconfig_2.0.3            
## [107] gtable_0.3.4                blob_1.2.4                 
## [109] lmtest_0.9-40               XVector_0.40.0             
## [111] htmltools_0.5.6.1           ProtGenerics_1.32.0        
## [113] scales_1.2.1                png_0.1-8                  
## [115] knitr_1.44                  rstudioapi_0.15.0          
## [117] reshape2_1.4.4              rjson_0.2.21               
## [119] checkmate_2.2.0             nlme_3.1-162               
## [121] curl_5.1.0                  cachem_1.0.8               
## [123] zoo_1.8-12                  stringr_1.5.0              
## [125] KernSmooth_2.23-21          vipor_0.4.5                
## [127] parallel_4.3.1              miniUI_0.1.1.1             
## [129] foreign_0.8-84              ggrastr_1.0.2              
## [131] restfulr_0.0.15             desc_1.4.2                 
## [133] pillar_1.9.0                grid_4.3.1                 
## [135] vctrs_0.6.3                 RANN_2.6.1                 
## [137] promises_1.2.1              dbplyr_2.3.4               
## [139] xtable_1.8-4                cluster_2.1.4              
## [141] beeswarm_0.4.0              htmlTable_2.4.1            
## [143] evaluate_0.22               cli_3.6.1                  
## [145] compiler_4.3.1              Rsamtools_2.16.0           
## [147] rlang_1.1.1                 crayon_1.5.2               
## [149] future.apply_1.11.0         labeling_0.4.3             
## [151] ggbeeswarm_0.7.2            plyr_1.8.9                 
## [153] fs_1.6.3                    stringi_1.7.12             
## [155] deldir_1.0-9                viridisLite_0.4.2          
## [157] BiocParallel_1.34.2         munsell_0.5.0              
## [159] Biostrings_2.68.1           lazyeval_0.2.2             
## [161] spatstat.geom_3.2-5         Matrix_1.6-1.1             
## [163] BSgenome_1.68.0             hms_1.1.3                  
## [165] bit64_4.0.5                 future_1.33.0              
## [167] KEGGREST_1.40.1             shiny_1.7.5                
## [169] SummarizedExperiment_1.30.2 ROCR_1.0-11                
## [171] igraph_1.5.1                memoise_2.0.1              
## [173] bslib_0.5.1                 fastmatch_1.1-4            
## [175] bit_4.0.5