In this vignette we demonstrate how to merge multiple Seurat objects containing single-cell chromatin data. To demonstrate, we will use four scATAC-seq PBMC datasets provided by 10x Genomics:
To download the required data, run the following lines in a shell:
# 500 cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_filtered_peak_bc_matrix.h5
# 1k cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_filtered_peak_bc_matrix.h5
# 5k cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_filtered_peak_bc_matrix.h5
# 10k cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_filtered_peak_bc_matrix.h5
When merging multiple single-cell chromatin datasets, it’s important to be aware that if peak calling was performed on each dataset independently, the peaks are unlikely to be exactly the same. We therefore need to create a common set of peaks across all the datasets to be merged.
To create a unified set of peaks we can use functions from the GenomicRanges
package. The reduce
function from GenomicRanges will merge
all intersecting peaks. Another option is to use the
disjoin
function, that will create distinct non-overlapping
sets of peaks. Here is a visual example to illustrate the difference
between reduce
and disjoin
:
gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = c(20, 70, 300), end = c(120, 200, 400)))
gr
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 20-120 *
## [2] chr1 70-200 *
## [3] chr1 300-400 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If the peaks were identified independently in each experiment then they will likely not overlap perfectly. We can merge peaks from all the datasets to create a common peak set, and quantify this peak set in each experiment prior to merging the objects.
First we’ll load the peak coordinates for each experiment and convert
them to genomic ranges, the use the GenomicRanges::reduce
function to create a common set of peaks to quantify in each
dataset.
library(Signac)
library(Seurat)
library(GenomicRanges)
library(future)
plan("multicore", workers = 4)
options(future.globals.maxSize = 50000 * 1024^2) # for 50 Gb RAM
# read in peak sets
peaks.500 <- read.table(
file = "pbmc500/atac_pbmc_500_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
peaks.1k <- read.table(
file = "pbmc1k/atac_pbmc_1k_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
peaks.5k <- read.table(
file = "pbmc5k/atac_pbmc_5k_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
peaks.10k <- read.table(
file = "pbmc10k/atac_pbmc_10k_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
# convert to genomic ranges
gr.500 <- makeGRangesFromDataFrame(peaks.500)
gr.1k <- makeGRangesFromDataFrame(peaks.1k)
gr.5k <- makeGRangesFromDataFrame(peaks.5k)
gr.10k <- makeGRangesFromDataFrame(peaks.10k)
# Create a unified set of peaks to quantify in each dataset
combined.peaks <- reduce(x = c(gr.500, gr.1k, gr.5k, gr.10k))
# Filter out bad peaks based on length
peakwidths <- width(combined.peaks)
combined.peaks <- combined.peaks[peakwidths < 10000 & peakwidths > 20]
combined.peaks
## GRanges object with 89951 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 565153-565499 *
## [2] chr1 569185-569620 *
## [3] chr1 713551-714783 *
## [4] chr1 752418-753020 *
## [5] chr1 762249-763345 *
## ... ... ... ...
## [89947] chrY 23422151-23422632 *
## [89948] chrY 23583994-23584463 *
## [89949] chrY 23602466-23602779 *
## [89950] chrY 28816593-28817710 *
## [89951] chrY 58855911-58856251 *
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
To quantify our combined set of peaks we’ll need to create a Fragment object for each experiment. The Fragment class is a specialized class defined in Signac to hold all the information related to a single fragment file.
First we’ll load the cell metadata for each experiment so that we
know what cell barcodes are contained in each file, then we can create
Fragment objects using the CreateFragmentObject
function.
The CreateFragmentObject
function performs some checks to
ensure that the file is present on disk and that it is compressed and
indexed, computes the MD5 sum for the file and the tabix index so that
we can tell if the file is modified at any point, and checks that the
expected cells are present in the file.
# load metadata
md.500 <- read.table(
file = "pbmc500/atac_pbmc_500_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ] # remove the first row
md.1k <- read.table(
file = "pbmc1k/atac_pbmc_1k_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ]
md.5k <- read.table(
file = "pbmc5k/atac_pbmc_5k_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ]
md.10k <- read.table(
file = "pbmc10k/atac_pbmc_10k_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ]
# perform an initial filtering of low count cells
md.500 <- md.500[md.500$passed_filters > 500, ]
md.1k <- md.1k[md.1k$passed_filters > 500, ]
md.5k <- md.5k[md.5k$passed_filters > 500, ]
md.10k <- md.10k[md.10k$passed_filters > 1000, ] # sequenced deeper so set higher cutoff
# create fragment objects
frags.500 <- CreateFragmentObject(
path = "pbmc500/atac_pbmc_500_nextgem_fragments.tsv.gz",
cells = rownames(md.500)
)
## Computing hash
frags.1k <- CreateFragmentObject(
path = "pbmc1k/atac_pbmc_1k_nextgem_fragments.tsv.gz",
cells = rownames(md.1k)
)
## Computing hash
frags.5k <- CreateFragmentObject(
path = "pbmc5k/atac_pbmc_5k_nextgem_fragments.tsv.gz",
cells = rownames(md.5k)
)
## Computing hash
frags.10k <- CreateFragmentObject(
path = "pbmc10k/atac_pbmc_10k_nextgem_fragments.tsv.gz",
cells = rownames(md.10k)
)
## Computing hash
We can now create a matrix of peaks x cell for each sample using the
FeatureMatrix
function. This function is parallelized using
the future
package. See the parallelization
vignette for more information about using future
.
pbmc500.counts <- FeatureMatrix(
fragments = frags.500,
features = combined.peaks,
cells = rownames(md.500)
)
pbmc1k.counts <- FeatureMatrix(
fragments = frags.1k,
features = combined.peaks,
cells = rownames(md.1k)
)
pbmc5k.counts <- FeatureMatrix(
fragments = frags.5k,
features = combined.peaks,
cells = rownames(md.5k)
)
pbmc10k.counts <- FeatureMatrix(
fragments = frags.10k,
features = combined.peaks,
cells = rownames(md.10k)
)
We will now use the quantified matrices to create a Seurat object for each dataset, storing the Fragment object for each dataset in the assay.
pbmc500_assay <- CreateChromatinAssay(pbmc500.counts, fragments = frags.500)
pbmc500 <- CreateSeuratObject(pbmc500_assay, assay = "ATAC", meta.data=md.500)
pbmc1k_assay <- CreateChromatinAssay(pbmc1k.counts, fragments = frags.1k)
pbmc1k <- CreateSeuratObject(pbmc1k_assay, assay = "ATAC", meta.data=md.1k)
pbmc5k_assay <- CreateChromatinAssay(pbmc5k.counts, fragments = frags.5k)
pbmc5k <- CreateSeuratObject(pbmc5k_assay, assay = "ATAC", meta.data=md.5k)
pbmc10k_assay <- CreateChromatinAssay(pbmc10k.counts, fragments = frags.10k)
pbmc10k <- CreateSeuratObject(pbmc10k_assay, assay = "ATAC", meta.data=md.10k)
Now that the objects each contain an assay with the same set of
features, we can use the standard merge
function to merge
the objects. This will also merge all the fragment objects so that we
retain the fragment information for each cell in the final merged
object.
# add information to identify dataset of origin
pbmc500$dataset <- 'pbmc500'
pbmc1k$dataset <- 'pbmc1k'
pbmc5k$dataset <- 'pbmc5k'
pbmc10k$dataset <- 'pbmc10k'
# merge all datasets, adding a cell ID to make sure cell names are unique
combined <- merge(
x = pbmc500,
y = list(pbmc1k, pbmc5k, pbmc10k),
add.cell.ids = c("500", "1k", "5k", "10k")
)
combined[["ATAC"]]
## ChromatinAssay data with 89951 features for 21688 cells
## Variable features: 0
## Genome:
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 4
combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 20)
combined <- RunSVD(combined)
combined <- RunUMAP(combined, dims = 2:50, reduction = 'lsi')
DimPlot(combined, group.by = 'dataset', pt.size = 0.1)
The merged object contains all four fragment objects, and contains an internal mapping of cell names in the object to the cell names in each fragment file so that we can retrieve information from the files without having to change the cell names in each fragment file. We can check that functions that pull data from the fragment files work as expected on the merged object by plotting a region of the genome:
CoveragePlot(
object = combined,
group.by = 'dataset',
region = "chr14-99700000-99760000"
)
The above approach requires that we have access to a fragment file for each dataset. In some cases we may not have this data (although we can create a fragment file from the BAM file using sinto). In these cases, we can still create a merged object, with the caveat that the resulting merged count matrix may not be as accurate.
The merge
function defined in Signac for
ChromatinAssay
objects will consider overlapping peaks as
equivalent, and adjust the genomic ranges spanned by the peak so that
the features in each object being merged become equivalent. Note
that this can result in inaccuracies in the count matrix, as some peaks
will be extended to cover regions that were not originally
quantified. This is the best that can be done without
re-quantification, and we recommend always following the procedure
outlined above for object merging whenever possible.
Here we demonstrate merging the same four PBMC datasets without creating a common feature set:
# load the count matrix for each object that was generated by cellranger
counts.500 <- Read10X_h5("pbmc500/atac_pbmc_500_nextgem_filtered_peak_bc_matrix.h5")
counts.1k <- Read10X_h5("pbmc1k/atac_pbmc_1k_nextgem_filtered_peak_bc_matrix.h5")
counts.5k <- Read10X_h5("pbmc5k/atac_pbmc_5k_nextgem_filtered_peak_bc_matrix.h5")
counts.10k <- Read10X_h5("pbmc10k/atac_pbmc_10k_nextgem_filtered_peak_bc_matrix.h5")
# create objects
pbmc500_assay <- CreateChromatinAssay(counts = counts.500, sep = c(":", "-"), min.features = 500)
pbmc500 <- CreateSeuratObject(pbmc500_assay, assay = "peaks")
pbmc1k_assay <- CreateChromatinAssay(counts = counts.1k, sep = c(":", "-"), min.features = 500)
pbmc1k <- CreateSeuratObject(pbmc1k_assay, assay = "peaks")
pbmc5k_assay <- CreateChromatinAssay(counts = counts.5k, sep = c(":", "-"), min.features = 500)
pbmc5k <- CreateSeuratObject(pbmc5k_assay, assay = "peaks")
pbmc10k_assay <- CreateChromatinAssay(counts = counts.10k, sep = c(":", "-"), min.features = 1000)
pbmc10k <- CreateSeuratObject(pbmc10k_assay, assay = "peaks")
# add information to identify dataset of origin
pbmc500$dataset <- 'pbmc500'
pbmc1k$dataset <- 'pbmc1k'
pbmc5k$dataset <- 'pbmc5k'
pbmc10k$dataset <- 'pbmc10k'
# merge
combined <- merge(
x = pbmc500,
y = list(pbmc1k, pbmc5k, pbmc10k),
add.cell.ids = c("500", "1k", "5k", "10k")
)
# process
combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 20)
combined <- RunSVD(combined)
combined <- RunUMAP(combined, dims = 2:50, reduction = 'lsi')
DimPlot(combined, group.by = 'dataset', pt.size = 0.1)
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Singapore
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] future_1.34.0 Seurat_5.1.0 SeuratObject_5.0.2
## [4] sp_2.1-4 Signac_1.14.0 patchwork_1.2.0
## [7] ggbio_1.50.0 ggplot2_3.5.1 GenomicRanges_1.54.1
## [10] GenomeInfoDb_1.38.8 IRanges_2.36.0 S4Vectors_0.40.2
## [13] BiocGenerics_0.48.1
##
## loaded via a namespace (and not attached):
## [1] spatstat.sparse_3.0-3 fs_1.6.4
## [3] ProtGenerics_1.34.0 matrixStats_1.3.0
## [5] bitops_1.0-8 httr_1.4.7
## [7] RColorBrewer_1.1-3 tools_4.3.1
## [9] sctransform_0.4.1 backports_1.5.0
## [11] utf8_1.2.4 R6_2.5.1
## [13] uwot_0.1.16 lazyeval_0.2.2
## [15] withr_3.0.1 prettyunits_1.2.0
## [17] GGally_2.2.1 gridExtra_2.3
## [19] progressr_0.14.0 cli_3.6.3
## [21] Biobase_2.62.0 textshaping_0.4.0
## [23] spatstat.explore_3.2-7 fastDummies_1.7.3
## [25] labeling_0.4.3 sass_0.4.9
## [27] spatstat.data_3.0-4 ggridges_0.5.6
## [29] pbapply_1.7-2 pkgdown_2.0.9
## [31] Rsamtools_2.18.0 systemfonts_1.1.0
## [33] foreign_0.8-86 dichromat_2.0-0.1
## [35] parallelly_1.38.0 BSgenome_1.70.2
## [37] rstudioapi_0.16.0 RSQLite_2.3.7
## [39] generics_0.1.3 BiocIO_1.12.0
## [41] spatstat.random_3.2-3 ica_1.0-3
## [43] dplyr_1.1.4 Matrix_1.6-5
## [45] fansi_1.0.6 abind_1.4-5
## [47] lifecycle_1.0.4 yaml_2.3.10
## [49] SummarizedExperiment_1.32.0 SparseArray_1.2.4
## [51] BiocFileCache_2.10.2 Rtsne_0.17
## [53] grid_4.3.1 blob_1.2.4
## [55] promises_1.3.0 crayon_1.5.3
## [57] miniUI_0.1.1.1 lattice_0.22-6
## [59] cowplot_1.1.3 GenomicFeatures_1.54.4
## [61] KEGGREST_1.42.0 pillar_1.9.0
## [63] knitr_1.48 rjson_0.2.21
## [65] future.apply_1.11.2 codetools_0.2-19
## [67] fastmatch_1.1-4 leiden_0.4.3.1
## [69] glue_1.7.0 data.table_1.15.4
## [71] vctrs_0.6.5 png_0.1-8
## [73] spam_2.10-0 gtable_0.3.5
## [75] cachem_1.1.0 xfun_0.47
## [77] S4Arrays_1.2.1 mime_0.12
## [79] survival_3.6-4 RcppRoll_0.3.1
## [81] fitdistrplus_1.1-11 ROCR_1.0-11
## [83] nlme_3.1-164 bit64_4.0.5
## [85] progress_1.2.3 filelock_1.0.3
## [87] RcppAnnoy_0.0.22 bslib_0.8.0
## [89] irlba_2.3.5.1 KernSmooth_2.23-24
## [91] rpart_4.1.23 colorspace_2.1-1
## [93] DBI_1.2.3 Hmisc_5.1-3
## [95] nnet_7.3-19 tidyselect_1.2.1
## [97] bit_4.0.5 compiler_4.3.1
## [99] curl_5.2.1 graph_1.80.0
## [101] htmlTable_2.4.3 hdf5r_1.3.10
## [103] xml2_1.3.6 desc_1.4.3
## [105] DelayedArray_0.28.0 plotly_4.10.4
## [107] rtracklayer_1.62.0 checkmate_2.3.2
## [109] scales_1.3.0 lmtest_0.9-40
## [111] RBGL_1.78.0 rappdirs_0.3.3
## [113] goftest_1.2-3 stringr_1.5.1
## [115] digest_0.6.37 spatstat.utils_3.0-4
## [117] rmarkdown_2.28 XVector_0.42.0
## [119] htmltools_0.5.8.1 pkgconfig_2.0.3
## [121] base64enc_0.1-3 MatrixGenerics_1.14.0
## [123] highr_0.11 dbplyr_2.5.0
## [125] fastmap_1.2.0 ensembldb_2.26.0
## [127] rlang_1.1.4 htmlwidgets_1.6.4
## [129] shiny_1.9.1 farver_2.1.2
## [131] jquerylib_0.1.4 zoo_1.8-12
## [133] jsonlite_1.8.8 BiocParallel_1.36.0
## [135] VariantAnnotation_1.48.1 RCurl_1.98-1.16
## [137] magrittr_2.0.3 Formula_1.2-5
## [139] GenomeInfoDbData_1.2.11 dotCall64_1.1-1
## [141] munsell_0.5.1 Rcpp_1.0.13
## [143] reticulate_1.37.0 stringi_1.8.4
## [145] zlibbioc_1.48.2 MASS_7.3-60.0.1
## [147] plyr_1.8.9 ggstats_0.6.0
## [149] parallel_4.3.1 listenv_0.9.1
## [151] ggrepel_0.9.5 deldir_2.0-4
## [153] Biostrings_2.70.3 splines_4.3.1
## [155] tensor_1.5 hms_1.1.3
## [157] igraph_2.0.3 spatstat.geom_3.2-9
## [159] RcppHNSW_0.6.0 reshape2_1.4.4
## [161] biomaRt_2.58.2 XML_3.99-0.17
## [163] evaluate_0.24.0 biovizBase_1.50.0
## [165] BiocManager_1.30.23 httpuv_1.6.15
## [167] polyclip_1.10-6 RANN_2.6.1
## [169] tidyr_1.3.1 purrr_1.0.2
## [171] scattermore_1.2 xtable_1.8-4
## [173] restfulr_0.0.15 AnnotationFilter_1.26.0
## [175] RSpectra_0.16-2 later_1.3.2
## [177] viridisLite_0.4.2 ragg_1.3.2
## [179] OrganismDbi_1.44.0 tibble_3.2.1
## [181] memoise_2.0.1 AnnotationDbi_1.64.1
## [183] GenomicAlignments_1.38.2 cluster_2.1.6
## [185] globals_0.16.3