Analyzing adult mouse brain scATAC-seq

For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of adult mouse brain cells provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website:

The Raw data
The Metadata
The fragments file
The fragments file index

This vignette echoes the commands run in the introductory Signac vignette on human PBMC. We provide a secondary analysis in a second system to demonstrate performance on a second system, and to provide a non-human example.

First load in Signac, Seurat, and some other packages we will be using for analyzing human data.

library(Signac)
library(Seurat)
library(GenomeInfoDb)
library(EnsDb.Mmusculus.v79)
library(ggplot2)
library(patchwork)
set.seed(1234)

Pre-processing workflow

counts <- Read10X_h5("/home/stuartt/github/chrom/vignette_data/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "/home/stuartt/github/chrom/vignette_data/atac_v1_adult_brain_fresh_5k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

brain <- CreateSeuratObject(
  counts = counts,
  assay = 'peaks',
  project = 'ATAC',
  min.cells = 1,
  meta.data = metadata
)

## Warning in CreateSeuratObject.default(counts = counts, assay = "peaks", : Some
## cells in meta.data not present in provided counts matrix

fragment.path <- '/home/stuartt/github/chrom/vignette_data/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz'

brain <- SetFragments(
  object = brain,
  file = fragment.path
)

Optional step: Filtering the fragment file

To make downstream steps that use this file faster, we can filter the fragments file to contain only reads from cells that we retain in the analysis. This is optional, but slow, and only needs to be performed once. Running this command writes a new file to disk and indexes the file so it is ready to be used by Signac.

fragment_file_filtered <- "/home/stuartt/github/chrom/vignette_data/atac_v1_adult_brain_fresh_5k_fragments_filtered.tsv"

FilterFragments(
  fragment.path = fragment.path,
  cells = colnames(brain),
  output.path = fragment_file_filtered
)

brain <- SetFragments(object = brain, file = paste0(fragment_file_filtered, '.bgz'))

Computing QC Metrics

brain <- NucleosomeSignal(object = brain)

brain$pct_reads_in_peaks <- brain$peak_region_fragments / brain$passed_filters * 100
brain$blacklist_ratio <- brain$blacklist_region_fragments / brain$peak_region_fragments

p1 <- VlnPlot(brain, c('pct_reads_in_peaks', 'peak_region_fragments'), pt.size = 0.1)
p2 <- VlnPlot(brain, c('blacklist_ratio', 'nucleosome_signal'), pt.size = 0.1) & scale_y_log10()

p1 | p2

We can also look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells which are outliers for the mononucleosomal/ nucleosome-free ratio (based off the plots above) have different banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.

brain$nucleosome_group <- ifelse(brain$nucleosome_signal > 10, 'NS > 10', 'NS < 10')
FragmentHistogram(object = brain, group.by = 'nucleosome_group', region = 'chr1-1-10000000')

The enrichment of Tn5 integration events at transcriptional start sites (TSSs) can also be an important quality control metric to assess the targeting of Tn5 in ATAC-seq experiments. The ENCODE consortium defined a TSS enrichment score as the number of Tn5 integration site around the TSS normalized to the number of Tn5 integration sites in flanking regions. See the ENCODE documentation for more information about the TSS enrichment score (https://www.encodeproject.org/data-standards/terms/). We can calculate the TSS enrichment score for each cell using the TSSEnrichment function in Signac.

# create granges object with TSS positions
gene.ranges <- genes(EnsDb.Mmusculus.v79)
seqlevelsStyle(gene.ranges) <- 'UCSC'
gene.ranges <- gene.ranges[gene.ranges$gene_biotype == 'protein_coding', ]
gene.ranges <- keepStandardChromosomes(gene.ranges, pruning.mode = 'coarse')

tss.ranges <- resize(gene.ranges, width = 1, fix = "start")
seqlevelsStyle(tss.ranges) <- 'UCSC'
tss.ranges <- keepStandardChromosomes(tss.ranges, pruning.mode = 'coarse')

# to save time use the first 2000 TSSs
brain <- TSSEnrichment(object = brain, tss.positions = tss.ranges[1:2000])

brain$high.tss <- ifelse(brain$TSS.enrichment > 2, 'High', 'Low')
TSSPlot(brain, group.by = 'high.tss') + ggtitle("TSS enrichment score") + NoLegend()

We remove cells that are outliers for these QC metrics.

brain <- subset(
  x = brain,
  subset = peak_region_fragments > 3000 &
    peak_region_fragments < 100000 &
    pct_reads_in_peaks > 15 &
    blacklist_ratio < 0.025 &
    nucleosome_signal < 10 &
    TSS.enrichment > 2
)
brain

## An object of class Seurat 
## 157203 features across 3522 samples within 1 assay 
## Active assay: peaks (157203 features, 0 variable features)

Normalization and linear dimensional reduction

brain <- RunTFIDF(brain)
brain <- FindTopFeatures(brain, min.cutoff = 'q0')
brain <- RunSVD(
  object = brain,
  assay = 'peaks',
  reduction.key = 'LSI_',
  reduction.name = 'lsi'
)

The first LSI component often captures sequencing depth (technical variation) rather than biological variation. If this is the case, the component should be removed from downstream analysis. We can assess the correlation between each LSI component and sequencing depth using the DepthCor function:

DepthCor(brain)

Here we see there is a very strong correlation between the first LSI component and the total number of counts for the cell, so we will perform downstream steps without this component.

Non-linear dimension reduction and clustering

Now that the cells are embedded in a low-dimensional space, we can use methods commonly applied for the analysis of scRNA-seq data to perform graph-based clustering, and non-linear dimension reduction for visualization. The functions RunUMAP, FindNeighbors, and FindClusters all come from the Seurat package.

brain <- RunUMAP(
  object = brain,
  reduction = 'lsi',
  dims = 2:30
)
brain <- FindNeighbors(
  object = brain,
  reduction = 'lsi',
  dims = 2:30
)
brain <- FindClusters(
  object = brain,
  algorithm = 3,
  resolution = 1.2,
  verbose = FALSE
)

DimPlot(object = brain, label = TRUE) + NoLegend()

Create a gene activity matrix

# Extend coordinates upstream to include the promoter
genebodyandpromoter.coords <- Extend(x = gene.ranges, upstream = 2000, downstream = 0)

# build a gene by cell matrix
gene.activities <- FeatureMatrix(
  fragments = fragment.path,
  features = genebodyandpromoter.coords,
  cells = colnames(brain),
  chunk = 10
)

# convert rownames from chromsomal coordinates into gene names
gene.key <- genebodyandpromoter.coords$gene_name
names(gene.key) <- GRangesToString(grange = genebodyandpromoter.coords)
rownames(gene.activities) <- make.unique(gene.key[rownames(gene.activities)])
gene.activities <- gene.activities[rownames(gene.activities)!="",]

#Add the gene activity matrix to the Seurat object as a new assay, and normalize it
brain[['RNA']] <- CreateAssayObject(counts = gene.activities)
brain <- NormalizeData(
  object = brain,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(brain$nCount_RNA)
)

DefaultAssay(brain) <- 'RNA'
FeaturePlot(
  object = brain,
  features = c('Sst','Pvalb',"Gad2","Neurod6","Rorb","Syt6"),
  pt.size = 0.1,
  max.cutoff = 'q95',
  ncol = 3
)

Integrating with scRNA-seq data

To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (the adult mouse brain). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here.

You can download the raw data for this experiment from the Allen Institute website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.

# Load the pre-processed scRNA-seq data
allen_rna <- readRDS("/home/stuartt/github/chrom/vignette_data/allen_brain.rds")
allen_rna <- FindVariableFeatures(
  object = allen_rna,
  nfeatures = 5000
)

transfer.anchors <- FindTransferAnchors(
  reference = allen_rna,
  query = brain,
  reduction = 'cca',
  dims = 1:40
)

predicted.labels <- TransferData(
  anchorset = transfer.anchors,
  refdata = allen_rna$subclass,
  weight.reduction = brain[['lsi']],
  dims = 2:30
)

brain <- AddMetaData(object = brain, metadata = predicted.labels)

plot1 <- DimPlot(allen_rna, group.by = 'subclass', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
plot2 <- DimPlot(brain, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
plot1 + plot2

Why did we change default parameters?

We changed default parameters for FindIntegrationAnchors and FindVariableFeatures (including more features and dimensions). You can run the analysis both ways, and observe very similar results. However, when using default parameters we mislabel cluster 11 cells as Vip-interneurons, when they are in fact a Meis2 expressing CGE-derived interneuron population recently described by us and others. The reason is that this subset is exceptionally rare in the scRNA-seq data (0.3%), and so the genes define this subset (for example, Meis2) were too lowly expressed to be selected in the initial set of variable features. We therefore need more genes and dimensions to facilitate cross-modality mapping. Interestingly, this subset is 10-fold more abundant in the scATAC-seq data compared to the scRNA-seq data (see this paper for possible explanations.)

You can see that the RNA-based classifications are entirely consistent with the UMAP visualization, computed only on the ATAC-seq data. We can now easily annotate our scATAC-seq derived clusters (alternately, we could use the RNA classifications themselves). We note three small clusters (13, 20, 21) which represent subdivisions of the scRNA-seq labels. Try transferring the ‘cluster’ label (which shows finer distinctions) from the allen scRNA-seq dataset, to annotate them!

# replace each label with its most likely prediction
for(i in levels(brain)) {
  cells_to_reid <- WhichCells(brain, idents = i)
  newid <- names(sort(table(brain$predicted.id[cells_to_reid]),decreasing=TRUE))[1]
  Idents(brain, cells = cells_to_reid) <- newid
}

Find differentially accessible peaks between clusters

Here, we find DA regions between excitatory neurons in different layers of the cortex.

#switch back to working with peaks instead of gene activities
DefaultAssay(brain) <- 'peaks'

da_peaks <- FindMarkers(
  object = brain,
  ident.1 = c("L2/3 IT"),
  ident.2 = c("L4", "L5 IT", "L6 IT"),
  min.pct = 0.4,
  test.use = 'LR',
  latent.vars = 'peak_region_fragments'
)

head(da_peaks)

##                                  p_val  avg_logFC pct.1 pct.2    p_val_adj
## chr4:86523678-86525285    3.771920e-67  0.5606889 0.401 0.036 5.929572e-62
## chr15:87605281-87607659   9.282503e-56  0.4797662 0.482 0.090 1.459237e-50
## chr2:118700082-118704897  2.367780e-55  0.3012936 0.621 0.182 3.722221e-50
## chr13:69329933-69331707   9.556422e-52 -0.4200059 0.143 0.442 1.502298e-46
## chr3:137056475-137058371  5.044049e-51  0.3496141 0.520 0.106 7.929396e-46
## chr10:107751762-107753240 7.203232e-51  0.3923405 0.609 0.186 1.132370e-45

plot1 <- VlnPlot(
  object = brain,
  features = rownames(da_peaks)[1],
  pt.size = 0.1,
  idents = c("L4","L5 IT","L2/3 IT")
)
plot2 <- FeaturePlot(
  object = brain,
  features = rownames(da_peaks)[1],
  pt.size = 0.1,
  max.cutoff = 'q95'
)
plot1 | plot2

open_l23 <- rownames(da_peaks[da_peaks$avg_logFC > 0.25, ])
open_l456 <- rownames(da_peaks[da_peaks$avg_logFC < -0.25, ])
closest_l23 <- ClosestFeature(regions = open_l23, annotation = gene.ranges, sep = c(':', '-'))
closest_l456 <- ClosestFeature(regions = open_l456, annotation = gene.ranges, sep = c(':', '-'))
head(closest_l23)

##                               gene_id gene_name   gene_biotype seq_coord_system
## ENSMUSG00000028492 ENSMUSG00000028492   Fam154a protein_coding       chromosome
## ENSMUSG00000054863 ENSMUSG00000054863   Fam19a5 protein_coding       chromosome
## ENSMUSG00000078137 ENSMUSG00000078137   Ankrd63 protein_coding       chromosome
## ENSMUSG00000028161 ENSMUSG00000028161    Ppp3ca protein_coding       chromosome
## ENSMUSG00000091455 ENSMUSG00000091455     Otogl protein_coding       chromosome
## ENSMUSG00000030592 ENSMUSG00000030592      Ryr1 protein_coding       chromosome
##                     symbol entrezid            closest_region
## ENSMUSG00000028492 Fam154a    75811    chr4:86444641-86558328
## ENSMUSG00000054863 Fam19a5   106014   chr15:87625230-87759336
## ENSMUSG00000078137 Ankrd63   383787  chr2:118699103-118703963
## ENSMUSG00000028161  Ppp3ca    19055  chr3:136670124-136937727
## ENSMUSG00000091455   Otogl   628870 chr10:107762223-107912134
## ENSMUSG00000030592    Ryr1    20190    chr7:29003340-29125151
##                                 query_region distance
## ENSMUSG00000028492    chr4:86523678-86525285        0
## ENSMUSG00000054863   chr15:87605281-87607659    17570
## ENSMUSG00000078137  chr2:118700082-118704897        0
## ENSMUSG00000028161  chr3:137056475-137058371   118747
## ENSMUSG00000091455 chr10:107751762-107753240     8982
## ENSMUSG00000030592    chr7:29070299-29073172        0

head(closest_l456)

##                               gene_id gene_name   gene_biotype seq_coord_system
## ENSMUSG00000034575 ENSMUSG00000034575     Papd7 protein_coding       chromosome
## ENSMUSG00000046321 ENSMUSG00000046321    Hs3st2 protein_coding       chromosome
## ENSMUSG00000027827 ENSMUSG00000027827    Kcnab1 protein_coding       chromosome
## ENSMUSG00000056222 ENSMUSG00000056222    Spock1 protein_coding       chromosome
## ENSMUSG00000032517 ENSMUSG00000032517      Mobp protein_coding       chromosome
## ENSMUSG00000031871 ENSMUSG00000031871      Cdh5 protein_coding       chromosome
##                    symbol entrezid           closest_region
## ENSMUSG00000034575  Papd7   210106  chr13:69497959-69533839
## ENSMUSG00000046321 Hs3st2   195646 chr7:121392296-121501768
## ENSMUSG00000027827 Kcnab1    16497   chr3:65109384-65378223
## ENSMUSG00000056222 Spock1    20745  chr13:57421195-57908314
## ENSMUSG00000032517   Mobp    17433 chr9:120149742-120181484
## ENSMUSG00000031871   Cdh5    12562 chr8:104101625-104144502
##                                query_region distance
## ENSMUSG00000034575  chr13:69329933-69331707   166251
## ENSMUSG00000046321 chr7:121391215-121395519        0
## ENSMUSG00000027827   chr3:65262154-65263636        0
## ENSMUSG00000056222  chr13:57636759-57639538        0
## ENSMUSG00000032517 chr9:120174034-120175458        0
## ENSMUSG00000031871 chr8:104126858-104127953        0

We can also create coverage plots grouped by cluster around any genomic region using the CoveragePlot function. These represent ‘pseudo-bulk’ accessibility tracks, where all cells within a cluster have been averaged together, in order to visualize a more robust chromatin landscape.

# set plotting order
levels(brain) <- c("L2/3 IT","L4","L5 IT","L5 PT","L6 CT", "L6 IT","NP","Sst","Pvalb","Vip","Lamp5","Meis2","Oligo","Astro","Endo","VLMC","Macrophage")

region1 <- rownames(da_peaks)[1]
region2 <- GRangesToString(subset(gene.ranges, symbol=="Gad2"))

CoveragePlot(
  object = brain,
  region = c(region1, region2),
  sep = c(":", "-"),
  annotation = gene.ranges,
  peaks = StringToGRanges(regions = rownames(brain), sep = c(":", "-")),
  extend.upstream = 5000,
  extend.downstream = 5000,
  ncol = 1
)

Session Info

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] patchwork_1.0.1            ggplot2_3.3.2             
##  [3] EnsDb.Mmusculus.v79_2.99.0 ensembldb_2.12.1          
##  [5] AnnotationFilter_1.12.0    GenomicFeatures_1.40.0    
##  [7] AnnotationDbi_1.50.0       Biobase_2.48.0            
##  [9] GenomicRanges_1.40.0       GenomeInfoDb_1.24.0       
## [11] IRanges_2.22.2             S4Vectors_0.26.1          
## [13] BiocGenerics_0.34.0        Seurat_3.1.5.9009         
## [15] Signac_0.2.5              
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.8             BiocFileCache_1.12.0       
##   [3] plyr_1.8.6                  igraph_1.2.5               
##   [5] lazyeval_0.2.2              splines_4.0.1              
##   [7] BiocParallel_1.22.0         listenv_0.8.0              
##   [9] digest_0.6.25               htmltools_0.5.0            
##  [11] magrittr_1.5                memoise_1.1.0              
##  [13] cluster_2.1.0               ROCR_1.0-11                
##  [15] globals_0.12.5              Biostrings_2.56.0          
##  [17] ggfittext_0.9.0             matrixStats_0.56.0         
##  [19] askpass_1.1                 pkgdown_1.5.1              
##  [21] prettyunits_1.1.1           colorspace_1.4-1           
##  [23] blob_1.2.1                  rappdirs_0.3.1             
##  [25] ggrepel_0.8.2               xfun_0.14                  
##  [27] dplyr_1.0.0                 crayon_1.3.4               
##  [29] RCurl_1.98-1.2              jsonlite_1.7.0             
##  [31] survival_3.2-3              zoo_1.8-8                  
##  [33] ape_5.4                     glue_1.4.1                 
##  [35] gtable_0.3.0                zlibbioc_1.34.0            
##  [37] XVector_0.28.0              leiden_0.3.3               
##  [39] DelayedArray_0.14.0         future.apply_1.6.0         
##  [41] scales_1.1.1                DBI_1.1.0                  
##  [43] Rcpp_1.0.5                  viridisLite_0.3.0          
##  [45] progress_1.2.2              reticulate_1.16            
##  [47] bit_1.1-15.2                rsvd_1.0.3                 
##  [49] htmlwidgets_1.5.1           httr_1.4.1                 
##  [51] RColorBrewer_1.1-2          ellipsis_0.3.1             
##  [53] ica_1.0-2                   farver_2.0.3               
##  [55] pkgconfig_2.0.3             XML_3.99-0.3               
##  [57] ggseqlogo_0.1               uwot_0.1.8                 
##  [59] dbplyr_1.4.4                tidyselect_1.1.0           
##  [61] labeling_0.3                rlang_0.4.6                
##  [63] reshape2_1.4.4              munsell_0.5.0              
##  [65] tools_4.0.1                 generics_0.0.2             
##  [67] RSQLite_2.2.0               ggridges_0.5.2             
##  [69] evaluate_0.14               stringr_1.4.0              
##  [71] yaml_2.2.1                  knitr_1.28                 
##  [73] bit64_0.9-7                 fs_1.4.1                   
##  [75] fitdistrplus_1.1-1          purrr_0.3.4                
##  [77] RANN_2.6.1                  pbapply_1.4-2              
##  [79] future_1.17.0               nlme_3.1-148               
##  [81] biomaRt_2.44.0              hdf5r_1.3.2                
##  [83] compiler_4.0.1              plotly_4.9.2.1             
##  [85] curl_4.3                    png_0.1-7                  
##  [87] tibble_3.0.1                stringi_1.4.6              
##  [89] desc_1.2.0                  lattice_0.20-41            
##  [91] ProtGenerics_1.20.0         Matrix_1.2-18              
##  [93] vctrs_0.3.1                 pillar_1.4.4               
##  [95] lifecycle_0.2.0             lmtest_0.9-37              
##  [97] RcppAnnoy_0.0.16            data.table_1.12.8          
##  [99] cowplot_1.0.0               bitops_1.0-6               
## [101] irlba_2.3.3                 gggenes_0.4.0              
## [103] rtracklayer_1.48.0          R6_2.4.1                   
## [105] KernSmooth_2.23-17          gridExtra_2.3              
## [107] codetools_0.2-16            MASS_7.3-51.6              
## [109] assertthat_0.2.1            SummarizedExperiment_1.18.1
## [111] openssl_1.4.2               rprojroot_1.3-2            
## [113] withr_2.2.0                 GenomicAlignments_1.24.0   
## [115] sctransform_0.2.1           Rsamtools_2.4.0            
## [117] GenomeInfoDbData_1.2.3      hms_0.5.3                  
## [119] grid_4.0.1                  tidyr_1.1.0                
## [121] rmarkdown_2.2               Rtsne_0.15

Compiled: July 06, 2020