For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of human peripheral blood mononuclear cells (PBMCs) provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website:

View data download code

To download all the required files, you can run the following lines in a shell:


First load in Signac, Seurat, and some other packages we will be using for analyzing human data.

if (!requireNamespace("EnsDb.Hsapiens.v75", quietly = TRUE))
## Warning: package 'S4Vectors' was built under R version 4.3.2
## Warning: package 'GenomeInfoDb' was built under R version 4.3.3
## Warning: package 'GenomicFeatures' was built under R version 4.3.2

Pre-processing workflow

When pre-processing chromatin data, Signac uses information from two related input files, both of which can be created using CellRanger:

  • Peak/Cell matrix. This is analogous to the gene expression count matrix used to analyze single-cell RNA-seq. However, instead of genes, each row of the matrix represents a region of the genome (a peak), that is predicted to represent a region of open chromatin. Each value in the matrix represents the number of Tn5 integration sites for each single barcode (i.e. a cell) that map within each peak. You can find more detail on the 10X Website.

  • Fragment file. This represents a full list of all unique fragments across all single cells. It is a substantially larger file, is slower to work with, and is stored on-disk (instead of in memory). However, the advantage of retaining this file is that it contains all fragments associated with each single cell, as opposed to only fragments that map to peaks. More information about the fragment file can be found on the 10x Genomics website or on the sinto website.

We start by creating a Seurat object using the peak/cell matrix and cell metadata generated by cellranger-atac, and store the path to the fragment file on disk in the Seurat object:

counts <- Read10X_h5(filename = "../vignette_data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "../vignette_data/atac_v1_pbmc_10k_singlecell.csv",
  header = TRUE,
  row.names = 1

chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  fragments = '../vignette_data/atac_v1_pbmc_10k_fragments.tsv.gz',
  min.cells = 10,
  min.features = 200

pbmc <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks", = metadata
## An object of class Seurat 
## 87561 features across 8728 samples within 1 assay 
## Active assay: peaks (87561 features, 0 variable features)
##  2 layers present: counts, data
What if I don’t have an H5 file?

If you do not have the .h5 file, you can still run an analysis. You may have data that is formatted as three files, a counts file (.mtx), a cell barcodes file, and a peaks file. If this is the case you can load the data using the Matrix::readMM() function:

counts <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx")
barcodes <- readLines("filtered_peak_bc_matrix/barcodes.tsv")
peaks <- read.table("filtered_peak_bc_matrix/peaks.bed", sep="\t")
peaknames <- paste(peaks$V1, peaks$V2, peaks$V3, sep="-")

colnames(counts) <- barcodes
rownames(counts) <- peaknames

Alternatively, you might only have a fragment file. In this case you can create a count matrix using the FeatureMatrix() function:

fragpath <- './data/atac_v1_pbmc_10k_fragments.tsv.gz'

# Define cells
# If you already have a list of cell barcodes to use you can skip this step
total_counts <- CountFragments(fragpath)
cutoff <- 1000 # Change this number depending on your dataset!
barcodes <- total_counts[total_counts$frequency_count > cutoff, ]$CB

# Create a fragment object
frags <- CreateFragmentObject(path = fragpath, cells = barcodes)

# First call peaks on the dataset
# If you already have a set of peaks you can skip this step
peaks <- CallPeaks(frags)

# Quantify fragments in each peak
counts <- FeatureMatrix(fragments = frags, features = peaks, cells = barcodes)

The ATAC-seq data is stored using a custom assay, the ChromatinAssay. This enables some specialized functions for analysing genomic single-cell assays such as scATAC-seq. By printing the assay we can see some of the additional information that can be contained in the ChromatinAssay, including motif information, gene annotations, and genome information.

## ChromatinAssay data with 87561 features for 8728 cells
## Variable features: 0 
## Genome: 
## Annotation present: FALSE 
## Motifs present: FALSE 
## Fragment files: 1

For example, we can call granges on a Seurat object with a ChromatinAssay set as the active assay (or on a ChromatinAssay) to see the genomic ranges associated with each feature in the object. See the object interaction vignette for more information about the ChromatinAssay class.

## GRanges object with 87561 ranges and 0 metadata columns:
##           seqnames            ranges strand
##              <Rle>         <IRanges>  <Rle>
##       [1]     chr1     565107-565550      *
##       [2]     chr1     569174-569639      *
##       [3]     chr1     713460-714823      *
##       [4]     chr1     752422-753038      *
##       [5]     chr1     762106-763359      *
##       ...      ...               ...    ...
##   [87557]     chrY 58993392-58993760      *
##   [87558]     chrY 58994571-58994823      *
##   [87559]     chrY 58996352-58997331      *
##   [87560]     chrY 59001782-59002175      *
##   [87561]     chrY 59017143-59017246      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

We can also add gene annotations to the pbmc object for the human genome. This will allow downstream functions to pull the gene annotation information directly from the object.

# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75)

# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg19"
# add the gene information to the object
Annotation(pbmc) <- annotations

Computing QC Metrics

We can now compute some QC metrics for the scATAC-seq experiment. We currently suggest the following metrics below to assess data quality. As with scRNA-seq, the expected range of values for these parameters will vary depending on your biological system, cell viability, and other factors.

  • Nucleosome banding pattern: The histogram of DNA fragment sizes (determined from the paired-end sequencing reads) should exhibit a strong nucleosome banding pattern corresponding to the length of DNA wrapped around a single nucleosome. We calculate this per single cell, and quantify the approximate ratio of mononucleosomal to nucleosome-free fragments (stored as nucleosome_signal)

  • Transcriptional start site (TSS) enrichment score. The ENCODE project has defined an ATAC-seq targeting score based on the ratio of fragments centered at the TSS to fragments in TSS-flanking regions (see Poor ATAC-seq experiments typically will have a low TSS enrichment score. We can compute this metric for each cell with the TSSEnrichment() function, and the results are stored in metadata under the column name TSS.enrichment.

  • Total number of fragments in peaks: A measure of cellular sequencing depth / complexity. Cells with very few reads may need to be excluded due to low sequencing depth. Cells with extremely high levels may represent doublets, nuclei clumps, or other artefacts.

  • Fraction of fragments in peaks: Represents the fraction of all fragments that fall within ATAC-seq peaks. Cells with low values (i.e. <15-20%) often represent low-quality cells or technical artifacts that should be removed. Note that this value can be sensitive to the set of peaks used.

  • Ratio reads in genomic blacklist regions The ENCODE project has provided a list of blacklist regions, representing reads which are often associated with artefactual signal. Cells with a high proportion of reads mapping to these areas (compared to reads mapping to peaks) often represent technical artifacts and should be removed. ENCODE blacklist regions for human (hg19 and GRCh38), mouse (mm10), Drosophila (dm3), and C. elegans (ce10) are included in the Signac package.

Note that the last three metrics can be obtained from the output of CellRanger (which is stored in the object metadata), but can also be calculated for non-10x datasets using Signac (more information at the end of this document).

# compute nucleosome signal score per cell
pbmc <- NucleosomeSignal(object = pbmc)

# compute TSS enrichment score per cell
pbmc <- TSSEnrichment(object = pbmc, fast = FALSE)

# add blacklist ratio and fraction of reads in peaks
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments

The relationship between variables stored in the object metadata can be visualized using the DensityScatter() function. This can also be used to quickly find suitable cutoff values for different QC metrics by setting quantiles=TRUE:

DensityScatter(pbmc, x = 'nCount_peaks', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)

We can inspect the TSS enrichment scores by grouping the cells based on the score and plotting the accessibility signal over all TSS sites. Setting the fast=TRUE option in TSSEnrichment() will only compute the TSS enrichment score without storing the entire cell by position matrix of Tn5 insertion frequency for each cell, and can save memory. However, setting fast=TRUE will not allow downstream plotting of the TSS enrichment signal for different groups of cells using the TSSPlot() function, shown here:

pbmc$high.tss <- ifelse(pbmc$TSS.enrichment > 3, 'High', 'Low')
TSSPlot(pbmc, = 'high.tss') + NoLegend()

We can also look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells that are outliers for the mononucleosomal / nucleosome-free ratio (based on the plots above) have different nucleosomal banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.

pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = pbmc, = 'nucleosome_group')

We can plot the distribution of each QC metric separately using a violin plot:

  object = pbmc,
  features = c('nCount_peaks', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal', 'pct_reads_in_peaks'),
  pt.size = 0.1,
  ncol = 5

Finally we remove cells that are outliers for these QC metrics.

pbmc <- subset(
  x = pbmc,
  subset = nCount_peaks > 3000 &
    nCount_peaks < 30000 &
    pct_reads_in_peaks > 15 &
    blacklist_ratio < 0.05 &
    nucleosome_signal < 4 &
    TSS.enrichment > 3
## An object of class Seurat 
## 87561 features across 7307 samples within 1 assay 
## Active assay: peaks (87561 features, 0 variable features)
##  2 layers present: counts, data

Normalization and linear dimensional reduction

  • Normalization: Signac performs term frequency-inverse document frequency (TF-IDF) normalization. This is a two-step normalization procedure, that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks.

  • Feature selection: The low dynamic range of scATAC-seq data makes it challenging to perform variable feature selection, as we do for scRNA-seq. Instead, we can choose to use only the top n% of features (peaks) for dimensional reduction, or remove features present in less than n cells with the FindTopFeatures() function. Here we will use all features, though we have seen very similar results when using only a subset of features (try setting min.cutoff to ‘q75’ to use the top 25% all peaks), with faster runtimes. Features used for dimensional reduction are automatically set as VariableFeatures() for the Seurat object by this function.

  • Dimension reduction: We next run singular value decomposition (SVD) on the TD-IDF matrix, using the features (peaks) selected above. This returns a reduced dimension representation of the object (for users who are more familiar with scRNA-seq, you can think of this as analogous to the output of PCA).

The combined steps of TF-IDF followed by SVD are known as latent semantic indexing (LSI), and were first introduced for the analysis of scATAC-seq data by Cusanovich et al. 2015.

pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)

The first LSI component often captures sequencing depth (technical variation) rather than biological variation. If this is the case, the component should be removed from downstream analysis. We can assess the correlation between each LSI component and sequencing depth using the DepthCor() function:


Here we see there is a very strong correlation between the first LSI component and the total number of counts for the cell. We will perform downstream steps without this component as we don’t want to group cells together based on their total sequencing depth, but rather by their patterns of accessibility at cell-type-specific peaks.

Non-linear dimension reduction and clustering

Now that the cells are embedded in a low-dimensional space we can use methods commonly applied for the analysis of scRNA-seq data to perform graph-based clustering and non-linear dimension reduction for visualization. The functions RunUMAP(), FindNeighbors(), and FindClusters() all come from the Seurat package.

pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3)
DimPlot(object = pbmc, label = TRUE) + NoLegend()

Create a gene activity matrix

The UMAP visualization reveals the presence of multiple cell groups in human blood. If you are familiar with scRNA-seq analyses of PBMC, you may recognize the presence of certain myeloid and lymphoid populations in the scATAC-seq data. However, annotating and interpreting clusters is more challenging in scATAC-seq data as much less is known about the functional roles of noncoding genomic regions than is known about protein coding regions (genes).

We can try to quantify the activity of each gene in the genome by assessing the chromatin accessibility associated with the gene, and create a new gene activity assay derived from the scATAC-seq data. Here we will use a simple approach of summing the fragments intersecting the gene body and promoter region (we also recommend exploring the Cicero tool, which can accomplish a similar goal, and we provide a vignette showing how to run Cicero within a Signac workflow here).

To create a gene activity matrix, we extract gene coordinates and extend them to include the 2 kb upstream region (as promoter accessibility is often correlated with gene expression). We then count the number of fragments for each cell that map to each of these regions, using the using the FeatureMatrix() function. These steps are automatically performed by the GeneActivity() function:

gene.activities <- GeneActivity(pbmc)
# add the gene activity matrix to the Seurat object as a new assay and normalize it
pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(
  object = pbmc,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(pbmc$nCount_RNA)

Now we can visualize the activities of canonical marker genes to help interpret our ATAC-seq clusters. Note that the activities will be much noisier than scRNA-seq measurements. This is because they represent measurements from sparse chromatin data, and because they assume a general correspondence between gene body/promoter accessibility and gene expression which may not always be the case. Nonetheless, we can begin to discern populations of monocytes, B, T, and NK cells based on these gene activity profiles. However, further subdivision of these cell types is challenging based on supervised analysis alone.

DefaultAssay(pbmc) <- 'RNA'

  object = pbmc,
  features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
  pt.size = 0.1,
  max.cutoff = 'q95',
  ncol = 3

Integrating with scRNA-seq data

To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (human PBMC). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here. We aim to identify shared correlation patterns in the gene activity matrix and scRNA-seq dataset to identify matched biological states across the two modalities. This procedure returns a classification score for each cell for each scRNA-seq-defined cluster label.

Here we load a pre-processed scRNA-seq dataset for human PBMCs, also provided by 10x Genomics. You can download the raw data for this experiment from the 10x website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.

# Load the pre-processed scRNA-seq data for PBMCs
pbmc_rna <- readRDS("../vignette_data/pbmc_10k_v3.rds")
pbmc_rna <- UpdateSeuratObject(pbmc_rna)
transfer.anchors <- FindTransferAnchors(
  reference = pbmc_rna,
  query = pbmc,
  reduction = 'cca'
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 17505 anchors
predicted.labels <- TransferData(
  anchorset = transfer.anchors,
  refdata = pbmc_rna$celltype,
  weight.reduction = pbmc[['lsi']],
  dims = 2:30
## Finding integration vectors
## Finding integration vector weights
## Predicting cell labels
pbmc <- AddMetaData(object = pbmc, metadata = predicted.labels)
plot1 <- DimPlot(
  object = pbmc_rna, = 'celltype',
  label = TRUE,
  repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')

plot2 <- DimPlot(
  object = pbmc, = '',
  label = TRUE,
  repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')

plot1 + plot2

You can see that the scRNA-based classifications are consistent with the UMAP visualization that was computed using the scATAC-seq data. Notice that we do not predict any platelet cells in the scATAC-seq dataset, which is expected as platelets are not nucleated and would not be detected by scATAC-seq.

To combine our scATAC-seq clustering and label tranfer results we can reassign cluster names to the most common predicted label for that clusters. Alternatively, we could proceed by use the per-cell predicted labels themselves.

# replace each label with its most likely prediction
for(i in levels(pbmc)) {
  cells_to_reid <- WhichCells(pbmc, idents = i)
  newid <- names(which.max(table(pbmc$[cells_to_reid])))
  Idents(pbmc, cells = cells_to_reid) <- newid

Find differentially accessible peaks between cell types

To find differentially accessible regions between clusters of cells, we can perform a differential accessibility (DA) test. A simple approach is to perform a Wilcoxon rank sum test, and the presto package has implemented an extremely fast Wilcoxon test that can be run on a Seurat object.

Another approach for differential testing is to utilize logistic regression for, as suggested by Ntranos et al. 2018 for scRNA-seq data, and add the total number of fragments as a latent variable to mitigate the effect of differential sequencing depth on the result. Here we will focus on comparing Naive CD4 cells and CD14 monocytes, but any groups of cells can be compared using these methods. We can also visualize these marker peaks on a violin plot, feature plot, dot plot, heat map, or any visualization tool in Seurat.

# change back to working with peaks instead of gene activities
DefaultAssay(pbmc) <- 'peaks'

da_peaks <- FindMarkers(
  object = pbmc,
  ident.1 = "CD4 Naive",
  ident.2 = "CD14+ Monocytes",
  test.use = 'LR',
  latent.vars = 'nCount_peaks'
##                                  p_val avg_log2FC pct.1 pct.2     p_val_adj
## chr14-99721608-99741934  1.414340e-280   5.571161 0.868 0.022 1.238410e-275
## chr14-99695477-99720910  5.529507e-222   5.100310 0.797 0.021 4.841691e-217
## chr17-80084198-80086094  8.962660e-221   7.032939 0.668 0.005 7.847795e-216
## chr7-142501666-142511108 6.506936e-212   4.757277 0.754 0.029 5.697538e-207
## chr2-113581628-113594911 5.746054e-188  -5.051770 0.035 0.663 5.031302e-183
## chr6-44025105-44028184   2.328105e-179  -4.394199 0.046 0.616 2.038512e-174
plot1 <- VlnPlot(
  object = pbmc,
  features = rownames(da_peaks)[1],
  pt.size = 0.1,
  idents = c("CD4 Naive","CD14+ Monocytes")
plot2 <- FeaturePlot(
  object = pbmc,
  features = rownames(da_peaks)[1],
  pt.size = 0.1

plot1 | plot2

Another way to find DA regions between two groups of cells is to look at the fold change accessibility between two groups of cells. This can be much faster than running more sophisticated DA tests, but is not able to account for latent variables such as differences in the total sequencing depth between cells, and does not perform any statistical test. However, this can still be a useful way to quickly explore data, and can be performed using the FoldChange() function in Seurat.

fc <- FoldChange(pbmc, ident.1 = "CD4 Naive", ident.2 = "CD14+ Monocytes")
# order by fold change
fc <- fc[order(fc$avg_log2FC, decreasing = TRUE), ]
##                          avg_log2FC pct.1 pct.2
## chr6-28416849-28417227     11.45411 0.067     0
## chr7-110665002-110665493   11.41080 0.073     0
## chr8-19317420-19317942     11.22146 0.061     0
## chr1-172836553-172836955   11.20312 0.058     0
## chr2-191380525-191380926   11.15811 0.056     0
## chr8-90255778-90256179     11.03921 0.050     0

Peak coordinates can be difficult to interpret alone. We can find the closest gene to each of these peaks using the ClosestFeature() function. If you explore the gene lists, you will see that peaks open in Naive T cells are close to genes such as BCL11B and GATA3 (key regulators of T cell differentiation ), while peaks open in monocytes are close to genes such as CEBPB (a key regulator of monocyte differentiation). We could follow up this result further by doing gene ontology enrichment analysis on the gene sets returned by ClosestFeature(), and there are many R packages that can do this (see the GOstats package for example).

open_cd4naive <- rownames(da_peaks[da_peaks$avg_log2FC > 3, ])
open_cd14mono <- rownames(da_peaks[da_peaks$avg_log2FC < -3, ])

closest_genes_cd4naive <- ClosestFeature(pbmc, regions = open_cd4naive)
closest_genes_cd14mono <- ClosestFeature(pbmc, regions = open_cd14mono)
##                           tx_id gene_name         gene_id   gene_biotype type
## ENST00000443726 ENST00000443726    BCL11B ENSG00000127152 protein_coding  cds
## ENST00000357195 ENST00000357195    BCL11B ENSG00000127152 protein_coding  cds
## ENST00000583593 ENST00000583593    CCDC57 ENSG00000176155 protein_coding  cds
## ENSE00002456092 ENST00000463701     PRSS1 ENSG00000204983 protein_coding exon
## ENST00000546420 ENST00000546420    CCDC64 ENSG00000135127 protein_coding  cds
## ENST00000455990 ENST00000455990     HOOK1 ENSG00000134709 protein_coding  cds
##                            closest_region              query_region distance
## ENST00000443726   chr14-99737498-99737555   chr14-99721608-99741934        0
## ENST00000357195   chr14-99697682-99697894   chr14-99695477-99720910        0
## ENST00000583593   chr17-80085568-80085694   chr17-80084198-80086094        0
## ENSE00002456092  chr7-142460719-142460923  chr7-142501666-142511108    40742
## ENST00000546420 chr12-120427684-120428101 chr12-120426014-120428613        0
## ENST00000455990    chr1-60280790-60280852    chr1-60279767-60281364        0
##                           tx_id     gene_name         gene_id   gene_biotype
## ENST00000432018 ENST00000432018          IL1B ENSG00000125538 protein_coding
## ENSE00001638912 ENST00000455005 RP5-1120P11.3 ENSG00000231881        lincRNA
## ENST00000445003 ENST00000445003 RP11-290F20.3 ENSG00000224397        lincRNA
## ENST00000568649 ENST00000568649         PPCDC ENSG00000138621 protein_coding
## ENST00000409245 ENST00000409245         TTC7A ENSG00000068724 protein_coding
## ENST00000484822 ENST00000484822          RXRA ENSG00000186350 protein_coding
##                 type           closest_region             query_region distance
## ENST00000432018  cds chr2-113593760-113593806 chr2-113581628-113594911        0
## ENSE00001638912 exon   chr6-44041650-44042535   chr6-44025105-44028184    13465
## ENST00000445003  gap  chr20-48884201-48894027  chr20-48889794-48893313        0
## ENST00000568649  cds  chr15-75335782-75335877  chr15-75334903-75336779        0
## ENST00000409245  cds   chr2-47300841-47301062   chr2-47297968-47301173        0
## ENST00000484822  gap chr9-137211331-137293477 chr9-137263243-137268534        0

Plotting genomic regions

We can plot the frequency of Tn5 integration across regions of the genome for cells grouped by cluster, cell type, or any other metadata stored in the object for any genomic region using the CoveragePlot() function. These represent pseudo-bulk accessibility tracks, where signal from all cells within a group have been averaged together to visualize the DNA accessibility in a region (thanks to Andrew Hill for giving the inspiration for this function in his excellent blog post). Alongside these accessibility tracks we can visualize other important information including gene annotation, peak coordinates, and genomic links (if they’re stored in the object). See the visualization vignette for more information.

# set plotting order
levels(pbmc) <- c("CD4 Naive","CD4 Memory","CD8 Naive","CD8 effector","Double negative T cell","NK dim", "NK bright", "pre-B cell",'B cell progenitor',"pDC","CD14+ Monocytes",'CD16+ Monocytes')

  object = pbmc,
  region = rownames(da_peaks)[1],
  extend.upstream = 40000,
  extend.downstream = 20000

We can also create an interactive version of these plots using the CoverageBrowser() function. Here is a recorded demonstration showing how we can use CoverageBrowser() to browser the genome and adjust plotting parameters interactively. The “Save plot” button in CoverageBrowser() will add the current plot to a list of ggplot objects that is returned when the browser session is ended by pressing the “Done” button, allowing interesting views to be saved during an interactive session.

Working with datasets that were not quantified using CellRanger

The CellRanger software from 10x Genomics generates several useful QC metrics per-cell, as well as a peak/cell matrix and an indexed fragments file. In the above vignette, we utilize the CellRanger outputs, but provide alternative functions in Signac for many of the same purposes here.

Generating a peak/cell or bin/cell matrix

The FeatureMatrix function can be used to generate a count matrix containing any set of genomic ranges in its rows. These regions could be a set of peaks, or bins that span the entire genome.

# not run
# peak_ranges should be a set of genomic ranges spanning the set of peaks to be quantified per cell
peak_matrix <- FeatureMatrix(
  fragments = Fragments(pbmc),
  features = peak_ranges

For convenience, we also include a GenomeBinMatrix() function that will generate a set of genomic ranges spanning the entire genome for you, and run FeatureMatrix() internally to produce a genome bin/cell matrix.

# not run
bin_matrix <- GenomeBinMatrix(
  fragments = Fragments(pbmc),
  genome = seqlengths(pbmc),
  binsize = 5000

Counting fraction of reads in peaks

The function FRiP() will count the fraction of reads in peaks for each cell, given a peak/cell assay and a bin/cell assay. Note that this can be run on a subset of the genome, so that a bin/cell assay does not need to be computed for the whole genome. This will return a Seurat object will metadata added corresponding to the fraction of reads in peaks for each cell.

# not run
total_fragments <- CountFragments("'../vignette_data/atac_v1_pbmc_10k_fragments.tsv.gz'")
rownames(total_fragments) <- total_fragments$CB
pbmc$fragments <- total_fragments[colnames(pbmc), "frequency_count"]

pbmc <- FRiP(
  object = pbmc,
  assay = 'peaks',
  total.fragments = 'fragments'

Counting fragments in genome blacklist regions

The ratio of reads in genomic blacklist regions, that are known to artifactually accumulate reads in genome sequencing assays, can be diagnostic of low-quality cells. We provide blacklist region coordinates for several genomes (hg19, hg38, mm9, mm10, ce10, ce11, dm3, dm6) in the Signac package for convenience. These regions were provided by the ENCODE consortium, and we encourage users to cite their paper if you use the regions in your analysis. The FractionCountsInRegion() function can be used to calculate the fraction of all counts within a given set of regions per cell. We can use this function and the blacklist regions to find the fraction of blacklist counts per cell.

# not run
pbmc$blacklist_fraction <- FractionCountsInRegion(
  object = pbmc, 
  assay = 'peaks',
  regions = blacklist_hg19
Session Info
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## locale:
## [1] C/UTF-8/C/C/C/C
## time zone: Asia/Singapore
## tzcode source: internal
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## other attached packages:
##  [1] patchwork_1.2.0           ggplot2_3.5.0            
##  [3] EnsDb.Hsapiens.v75_2.99.0 ensembldb_2.26.0         
##  [5] AnnotationFilter_1.26.0   GenomicFeatures_1.54.3   
##  [7] AnnotationDbi_1.64.1      Biobase_2.62.0           
##  [9] GenomicRanges_1.54.1      GenomeInfoDb_1.38.7      
## [11] IRanges_2.36.0            S4Vectors_0.40.2         
## [13] BiocGenerics_0.48.1       Seurat_5.0.3             
## [15] SeuratObject_5.0.1        sp_2.1-3                 
## [17] Signac_1.13.0            
## loaded via a namespace (and not attached):
##   [1] fs_1.6.3                    ProtGenerics_1.34.0        
##   [3] matrixStats_1.2.0           spatstat.sparse_3.0-3      
##   [5] bitops_1.0-7                httr_1.4.7                 
##   [7] RColorBrewer_1.1-3          tools_4.3.1                
##   [9] sctransform_0.4.1           backports_1.4.1            
##  [11] utf8_1.2.4                  R6_2.5.1                   
##  [13] lazyeval_0.2.2              uwot_0.1.16                
##  [15] withr_3.0.0                 prettyunits_1.2.0          
##  [17] gridExtra_2.3               progressr_0.14.0           
##  [19] cli_3.6.2                   textshaping_0.3.7          
##  [21] spatstat.explore_3.2-7      fastDummies_1.7.3          
##  [23] labeling_0.4.3              sass_0.4.9                 
##  [25] spatstat.data_3.0-4         ggridges_0.5.6             
##  [27] pbapply_1.7-2               pkgdown_2.0.7              
##  [29] Rsamtools_2.18.0            systemfonts_1.0.6          
##  [31] foreign_0.8-86              dichromat_2.0-0.1          
##  [33] parallelly_1.37.1           BSgenome_1.70.2            
##  [35] rstudioapi_0.16.0           RSQLite_2.3.5              
##  [37] generics_0.1.3              BiocIO_1.12.0              
##  [39] ica_1.0-3                   spatstat.random_3.2-3      
##  [41] dplyr_1.1.4                 Matrix_1.6-5               
##  [43] ggbeeswarm_0.7.2            fansi_1.0.6                
##  [45] abind_1.4-5                 lifecycle_1.0.4            
##  [47] yaml_2.3.8                  SummarizedExperiment_1.32.0
##  [49] SparseArray_1.2.4           BiocFileCache_2.10.1       
##  [51] Rtsne_0.17                  grid_4.3.1                 
##  [53] blob_1.2.4                  promises_1.2.1             
##  [55] crayon_1.5.2                miniUI_0.1.1.1             
##  [57] lattice_0.22-6              cowplot_1.1.3              
##  [59] KEGGREST_1.42.0             pillar_1.9.0               
##  [61] knitr_1.45                  rjson_0.2.21               
##  [63] future.apply_1.11.2         codetools_0.2-19           
##  [65] fastmatch_1.1-4             leiden_0.4.3.1             
##  [67] glue_1.7.0                  data.table_1.15.4          
##  [69] vctrs_0.6.5                 png_0.1-8                  
##  [71] spam_2.10-0                 gtable_0.3.4               
##  [73] cachem_1.0.8                xfun_0.43                  
##  [75] S4Arrays_1.2.1              mime_0.12                  
##  [77] survival_3.5-8              RcppRoll_0.3.0             
##  [79] fitdistrplus_1.1-11         ROCR_1.0-11                
##  [81] nlme_3.1-164                bit64_4.0.5                
##  [83] progress_1.2.3              filelock_1.0.3             
##  [85] RcppAnnoy_0.0.22            bslib_0.6.2                
##  [87] irlba_2.3.5.1               vipor_0.4.7                
##  [89] KernSmooth_2.23-22          rpart_4.1.23               
##  [91] colorspace_2.1-0            DBI_1.2.2                  
##  [93] Hmisc_5.1-2                 nnet_7.3-19                
##  [95] ggrastr_1.0.2               tidyselect_1.2.1           
##  [97] bit_4.0.5                   compiler_4.3.1             
##  [99] curl_5.2.1                  htmlTable_2.4.2            
## [101] hdf5r_1.3.10                xml2_1.3.6                 
## [103] desc_1.4.3                  DelayedArray_0.28.0        
## [105] plotly_4.10.4               rtracklayer_1.62.0         
## [107] checkmate_2.3.1             scales_1.3.0               
## [109] lmtest_0.9-40               rappdirs_0.3.3             
## [111] stringr_1.5.1               digest_0.6.35              
## [113] goftest_1.2-3               spatstat.utils_3.0-4       
## [115] rmarkdown_2.26              XVector_0.42.0             
## [117] htmltools_0.5.8             pkgconfig_2.0.3            
## [119] base64enc_0.1-3             MatrixGenerics_1.14.0      
## [121] highr_0.10                  dbplyr_2.5.0               
## [123] fastmap_1.1.1               rlang_1.1.3                
## [125] htmlwidgets_1.6.4           shiny_1.8.1                
## [127] farver_2.1.1                jquerylib_0.1.4            
## [129] zoo_1.8-12                  jsonlite_1.8.8             
## [131] BiocParallel_1.36.0         VariantAnnotation_1.48.1   
## [133] RCurl_1.98-1.14             magrittr_2.0.3             
## [135] Formula_1.2-5               GenomeInfoDbData_1.2.11    
## [137] dotCall64_1.1-1             munsell_0.5.0              
## [139] Rcpp_1.0.12                 reticulate_1.35.0          
## [141] stringi_1.8.3               zlibbioc_1.48.0            
## [143] MASS_7.3-60.0.1             plyr_1.8.9                 
## [145] parallel_4.3.1              listenv_0.9.1              
## [147] ggrepel_0.9.5               deldir_2.0-4               
## [149] Biostrings_2.70.2           splines_4.3.1              
## [151] tensor_1.5                  hms_1.1.3                  
## [153] igraph_2.0.3                spatstat.geom_3.2-9        
## [155] RcppHNSW_0.6.0              reshape2_1.4.4             
## [157] biomaRt_2.58.2              XML_3.99-0.16.1            
## [159] evaluate_0.23               biovizBase_1.50.0          
## [161] httpuv_1.6.15               RANN_2.6.1                 
## [163] tidyr_1.3.1                 purrr_1.0.2                
## [165] polyclip_1.10-6             future_1.33.2              
## [167] scattermore_1.2             xtable_1.8-4               
## [169] restfulr_0.0.15             RSpectra_0.16-1            
## [171] later_1.3.2                 viridisLite_0.4.2          
## [173] ragg_1.3.0                  tibble_3.2.1               
## [175] beeswarm_0.4.0              memoise_2.0.1              
## [177] GenomicAlignments_1.38.2    cluster_2.1.6              
## [179] globals_0.16.3