In this vignette we demonstrate how to merge multiple Seurat objects containing single-cell chromatin data. To demonstrate, we will use four scATAC-seq PBMC datasets provided by 10x Genomics:
When merging multiple single-cell chromatin datasets, it’s important to be aware that if peak calling was performed on each dataset independently, the peaks are unlikely to be exactly the same. We therefore need to create a common set of peaks across all the datasets to be merged.
To create a unified set of peaks we can use functions from the GenomicRanges package. The reduce
function from GenomicRanges will merge all intersecting peaks. Another option is to use the disjoin
function, that will create distinct non-overlapping sets of peaks. Here is a visual example to illustrate the difference between reduce
and disjoin
:
gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = c(20, 70, 300), end = c(120, 200, 400)))
gr
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 20-120 *
## [2] chr1 70-200 *
## [3] chr1 300-400 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If the peaks were identified independently in each experiment then they will likely not overlap perfectly. We can merge peaks from all the datasets to create a common peak set, and quantify this peak set in each experiment prior to merging the objects.
First we’ll load the peak coordinates for each experiment and convert them to genomic ranges, the use the GenomicRanges::reduce
function to create a common set of peaks to quantify in each dataset.
library(Signac)
library(Seurat)
library(GenomicRanges)
library(future)
plan("multiprocess", workers = 4)
options(future.globals.maxSize = 50000 * 1024^2) # for 50 Gb RAM
# read in peak sets
peaks.500 <- read.table(
file = "/home/stuartt/data/pbmc500/atac_pbmc_500_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
peaks.1k <- read.table(
file = "/home/stuartt/data/pbmc1k/atac_pbmc_1k_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
peaks.5k <- read.table(
file = "/home/stuartt/data/pbmc5k/atac_pbmc_5k_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
peaks.10k <- read.table(
file = "/home/stuartt/data/pbmc10k/atac_pbmc_10k_nextgem_peaks.bed",
col.names = c("chr", "start", "end")
)
# convert to genomic ranges
gr.500 <- makeGRangesFromDataFrame(peaks.500)
gr.1k <- makeGRangesFromDataFrame(peaks.1k)
gr.5k <- makeGRangesFromDataFrame(peaks.5k)
gr.10k <- makeGRangesFromDataFrame(peaks.10k)
# Create a unified set of peaks to quantify in each dataset
combined.peaks <- reduce(x = c(gr.500, gr.1k, gr.5k, gr.10k))
# Filter out bad peaks based on length
peakwidths <- width(combined.peaks)
combined.peaks <- combined.peaks[peakwidths < 10000 & peakwidths > 20]
combined.peaks
## GRanges object with 89951 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 565153-565499 *
## [2] chr1 569185-569620 *
## [3] chr1 713551-714783 *
## [4] chr1 752418-753020 *
## [5] chr1 762249-763345 *
## ... ... ... ...
## [89947] chrY 23422151-23422632 *
## [89948] chrY 23583994-23584463 *
## [89949] chrY 23602466-23602779 *
## [89950] chrY 28816593-28817710 *
## [89951] chrY 58855911-58856251 *
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
To quantify our combined set of peaks we’ll need to create a Fragment object for each experiment. The Fragment class is a specialized class defined in Signac to hold all the information related to a single fragment file.
First we’ll load the cell metadata for each experiment so that we know what cell barcodes are contained in each file, then we can create Fragment objects using the CreateFragmentObject
function. The CreateFragmentObject
function performs some checks to ensure that the file is present on disk and that it is compressed and indexed, computes the MD5 sum for the file and the tabix index so that we can tell if the file is modified at any point, and checks that the expected cells are present in the file.
# load metadata
md.500 <- read.table(
file = "/home/stuartt/data/pbmc500/atac_pbmc_500_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ] # remove the first row
md.1k <- read.table(
file = "/home/stuartt/data/pbmc1k/atac_pbmc_1k_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ]
md.5k <- read.table(
file = "/home/stuartt/data/pbmc5k/atac_pbmc_5k_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ]
md.10k <- read.table(
file = "/home/stuartt/data/pbmc10k/atac_pbmc_10k_nextgem_singlecell.csv",
stringsAsFactors = FALSE,
sep = ",",
header = TRUE,
row.names = 1
)[-1, ]
# perform an initial filtering of low count cells
md.500 <- md.500[md.500$passed_filters > 500, ]
md.1k <- md.1k[md.1k$passed_filters > 500, ]
md.5k <- md.5k[md.5k$passed_filters > 500, ]
md.10k <- md.10k[md.10k$passed_filters > 1000, ] # sequenced deeper so set higher cutoff
# create fragment objects
frags.500 <- CreateFragmentObject(
path = "/home/stuartt/data/pbmc500/atac_pbmc_500_nextgem_fragments.tsv.gz",
cells = rownames(md.500)
)
## Computing hash
frags.1k <- CreateFragmentObject(
path = "/home/stuartt/data/pbmc1k/atac_pbmc_1k_nextgem_fragments.tsv.gz",
cells = rownames(md.1k)
)
## Computing hash
frags.5k <- CreateFragmentObject(
path = "/home/stuartt/data/pbmc5k/atac_pbmc_5k_nextgem_fragments.tsv.gz",
cells = rownames(md.5k)
)
## Computing hash
frags.10k <- CreateFragmentObject(
path = "/home/stuartt/data/pbmc10k/atac_pbmc_10k_nextgem_fragments.tsv.gz",
cells = rownames(md.10k)
)
## Computing hash
We can now create a matrix of peaks x cell for each sample using the FeatureMatrix
function. This function is parallelized using the future
package. See the parallelization vignette for more information about using future
.
pbmc500.counts <- FeatureMatrix(
fragments = frags.500,
features = combined.peaks,
cells = rownames(md.500)
)
pbmc1k.counts <- FeatureMatrix(
fragments = frags.1k,
features = combined.peaks,
cells = rownames(md.1k)
)
pbmc5k.counts <- FeatureMatrix(
fragments = frags.5k,
features = combined.peaks,
cells = rownames(md.5k)
)
pbmc10k.counts <- FeatureMatrix(
fragments = frags.10k,
features = combined.peaks,
cells = rownames(md.10k)
)
We will now use the quantified matrices to create a Seurat object for each dataset, storing the Fragment object for each dataset in the assay.
pbmc500_assay <- CreateChromatinAssay(pbmc500.counts, fragments = frags.500)
pbmc500 <- CreateSeuratObject(pbmc500_assay, assay = "ATAC")
pbmc1k_assay <- CreateChromatinAssay(pbmc1k.counts, fragments = frags.1k)
pbmc1k <- CreateSeuratObject(pbmc1k_assay, assay = "ATAC")
pbmc5k_assay <- CreateChromatinAssay(pbmc5k.counts, fragments = frags.5k)
pbmc5k <- CreateSeuratObject(pbmc5k_assay, assay = "ATAC")
pbmc10k_assay <- CreateChromatinAssay(pbmc10k.counts, fragments = frags.10k)
pbmc10k <- CreateSeuratObject(pbmc10k_assay, assay = "ATAC")
Now that the objects each contain an assay with the same set of features, we can use the standard merge
function to merge the objects. This will also merge all the fragment objects so that we retain the fragment information for each cell in the final merged object.
# add information to identify dataset of origin
pbmc500$dataset <- 'pbmc500'
pbmc1k$dataset <- 'pbmc1k'
pbmc5k$dataset <- 'pbmc5k'
pbmc10k$dataset <- 'pbmc10k'
# merge all datasets, adding a cell ID to make sure cell names are unique
combined <- merge(
x = pbmc500,
y = list(pbmc1k, pbmc5k, pbmc10k),
add.cell.ids = c("500", "1k", "5k", "10k")
)
combined[["ATAC"]]
## ChromatinAssay data with 89951 features for 21688 cells
## Variable features: 0
## Genome:
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 4
combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 20)
combined <- RunSVD(combined)
combined <- RunUMAP(combined, dims = 2:50, reduction = 'lsi')
DimPlot(combined, group.by = 'dataset', pt.size = 0.1)
The merged object contains all four fragment objects, and contains an internal mapping of cell names in the object to the cell names in each fragment file so that we can retrieve information from the files without having to change the cell names in each fragment file. We can check that functions that pull data from the fragment files work as expected on the merged object by plotting a region of the genome:
CoveragePlot(
object = combined,
group.by = 'dataset',
region = "chr14-99700000-99760000"
)
The above approach requires that we have access to a fragment file for each dataset. In some cases we may not have this data (although we can create a fragment file from the BAM file using sinto). In these cases, we can still create a merged object, with the caveat that the resulting merged count matrix may not be as accurate.
The merge
function defined in Signac for ChromatinAssay
objects will consider overlapping peaks as equivalent, and adjust the genomic ranges spanned by the peak so that the features in each object being merged become equivalent. Note that this can result in inaccuracies in the count matrix, as some peaks will be extended to cover regions that were not originally quantified. This is the best that can be done without re-quantification, and we recommend always following the procedure outlined above for object merging whenever possible.
Here we demonstrate merging the same four PBMC datasets without creating a common feature set:
# load the count matrix for each object that was generated by cellranger
counts.500 <- Read10X_h5("/home/stuartt/data/pbmc500/atac_pbmc_500_nextgem_filtered_peak_bc_matrix.h5")
counts.1k <- Read10X_h5("/home/stuartt/data/pbmc1k/atac_pbmc_1k_nextgem_filtered_peak_bc_matrix.h5")
counts.5k <- Read10X_h5("/home/stuartt/data/pbmc5k/atac_pbmc_5k_nextgem_filtered_peak_bc_matrix.h5")
counts.10k <- Read10X_h5("/home/stuartt/data/pbmc10k/atac_pbmc_10k_nextgem_filtered_peak_bc_matrix.h5")
# create objects
pbmc500_assay <- CreateChromatinAssay(counts = counts.500, sep = c(":", "-"), min.features = 500)
pbmc500 <- CreateSeuratObject(pbmc500_assay, assay = "peaks")
pbmc1k_assay <- CreateChromatinAssay(counts = counts.1k, sep = c(":", "-"), min.features = 500)
pbmc1k <- CreateSeuratObject(pbmc1k_assay, assay = "peaks")
pbmc5k_assay <- CreateChromatinAssay(counts = counts.5k, sep = c(":", "-"), min.features = 500)
pbmc5k <- CreateSeuratObject(pbmc5k_assay, assay = "peaks")
pbmc10k_assay <- CreateChromatinAssay(counts = counts.10k, sep = c(":", "-"), min.features = 1000)
pbmc10k <- CreateSeuratObject(pbmc10k_assay, assay = "peaks")
# add information to identify dataset of origin
pbmc500$dataset <- 'pbmc500'
pbmc1k$dataset <- 'pbmc1k'
pbmc5k$dataset <- 'pbmc5k'
pbmc10k$dataset <- 'pbmc10k'
# merge
combined <- merge(
x = pbmc500,
y = list(pbmc1k, pbmc5k, pbmc10k),
add.cell.ids = c("500", "1k", "5k", "10k")
)
# process
combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 20)
combined <- RunSVD(combined)
combined <- RunUMAP(combined, dims = 2:50, reduction = 'lsi')
DimPlot(combined, group.by = 'dataset', pt.size = 0.1)
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] future_1.21.0 SeuratObject_4.0.0 Seurat_4.0.1.9005
## [4] Signac_1.2.0 patchwork_1.1.1 ggbio_1.38.0
## [7] ggplot2_3.3.3 GenomicRanges_1.42.0 GenomeInfoDb_1.26.5
## [10] IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.1 reticulate_1.19
## [3] tidyselect_1.1.0 RSQLite_2.2.7
## [5] AnnotationDbi_1.52.0 htmlwidgets_1.5.3
## [7] docopt_0.7.1 grid_4.0.1
## [9] BiocParallel_1.24.1 Rtsne_0.15
## [11] munsell_0.5.0 codetools_0.2-18
## [13] ragg_1.1.2 ica_1.0-2
## [15] miniUI_0.1.1.1 withr_2.4.2
## [17] colorspace_2.0-0 Biobase_2.50.0
## [19] OrganismDbi_1.32.0 highr_0.9
## [21] knitr_1.33 rstudioapi_0.13
## [23] ROCR_1.0-11 tensor_1.5
## [25] listenv_0.8.0 MatrixGenerics_1.2.1
## [27] labeling_0.4.2 slam_0.1-48
## [29] GenomeInfoDbData_1.2.4 polyclip_1.10-0
## [31] bit64_4.0.5 farver_2.1.0
## [33] rprojroot_2.0.2 parallelly_1.24.0
## [35] vctrs_0.3.7 generics_0.1.0
## [37] xfun_0.22 biovizBase_1.38.0
## [39] BiocFileCache_1.14.0 lsa_0.73.2
## [41] ggseqlogo_0.1 R6_2.5.0
## [43] hdf5r_1.3.3 AnnotationFilter_1.14.0
## [45] spatstat.utils_2.1-0 bitops_1.0-7
## [47] cachem_1.0.4 reshape_0.8.8
## [49] DelayedArray_0.16.3 assertthat_0.2.1
## [51] promises_1.2.0.1 scales_1.1.1
## [53] nnet_7.3-15 gtable_0.3.0
## [55] globals_0.14.0 goftest_1.2-2
## [57] ensembldb_2.14.0 rlang_0.4.10
## [59] systemfonts_1.0.1 RcppRoll_0.3.0
## [61] splines_4.0.1 rtracklayer_1.50.0
## [63] lazyeval_0.2.2 dichromat_2.0-0
## [65] spatstat.geom_2.1-0 checkmate_2.0.0
## [67] abind_1.4-5 BiocManager_1.30.12
## [69] yaml_2.2.1 reshape2_1.4.4
## [71] GenomicFeatures_1.42.3 backports_1.2.1
## [73] httpuv_1.6.0 Hmisc_4.5-0
## [75] RBGL_1.66.0 tools_4.0.1
## [77] spatstat.core_2.1-2 ellipsis_0.3.1
## [79] jquerylib_0.1.4 RColorBrewer_1.1-2
## [81] ggridges_0.5.3 Rcpp_1.0.6
## [83] plyr_1.8.6 base64enc_0.1-3
## [85] progress_1.2.2 zlibbioc_1.36.0
## [87] purrr_0.3.4 RCurl_1.98-1.3
## [89] prettyunits_1.1.1 deldir_0.2-10
## [91] rpart_4.1-15 openssl_1.4.3
## [93] pbapply_1.4-3 cowplot_1.1.1
## [95] zoo_1.8-9 SummarizedExperiment_1.20.0
## [97] ggrepel_0.9.1 cluster_2.1.2
## [99] fs_1.5.0 magrittr_2.0.1
## [101] RSpectra_0.16-0 scattermore_0.7
## [103] data.table_1.14.0 lmtest_0.9-38
## [105] RANN_2.6.1 SnowballC_0.7.0
## [107] ProtGenerics_1.22.0 fitdistrplus_1.1-3
## [109] matrixStats_0.58.0 hms_1.0.0
## [111] mime_0.10 evaluate_0.14
## [113] xtable_1.8-4 XML_3.99-0.6
## [115] jpeg_0.1-8.1 sparsesvd_0.2
## [117] gridExtra_2.3 compiler_4.0.1
## [119] biomaRt_2.46.3 tibble_3.1.1
## [121] KernSmooth_2.23-18 crayon_1.4.1
## [123] htmltools_0.5.1.1 mgcv_1.8-33
## [125] later_1.2.0 Formula_1.2-4
## [127] tidyr_1.1.3 DBI_1.1.1
## [129] tweenr_1.0.2 dbplyr_2.1.1
## [131] MASS_7.3-53.1 rappdirs_0.3.3
## [133] Matrix_1.3-2 igraph_1.2.6
## [135] pkgconfig_2.0.3 pkgdown_1.6.1
## [137] GenomicAlignments_1.26.0 foreign_0.8-81
## [139] spatstat.sparse_2.0-0 plotly_4.9.3
## [141] xml2_1.3.2 bslib_0.2.4
## [143] XVector_0.30.0 stringr_1.4.0
## [145] VariantAnnotation_1.36.0 digest_0.6.27
## [147] sctransform_0.3.2 RcppAnnoy_0.0.18
## [149] graph_1.68.0 spatstat.data_2.1-0
## [151] Biostrings_2.58.0 fastmatch_1.1-0
## [153] rmarkdown_2.7 leiden_0.3.7
## [155] htmlTable_2.1.0 uwot_0.1.10
## [157] curl_4.3 shiny_1.6.0
## [159] Rsamtools_2.6.0 nlme_3.1-152
## [161] lifecycle_1.0.0 jsonlite_1.7.2
## [163] viridisLite_0.4.0 desc_1.3.0
## [165] askpass_1.1 BSgenome_1.58.0
## [167] fansi_0.4.2 pillar_1.6.0
## [169] lattice_0.20-41 GGally_2.1.1
## [171] fastmap_1.1.0 httr_1.4.2
## [173] survival_3.2-11 glue_1.4.2
## [175] qlcMatrix_0.9.7 png_0.1-7
## [177] bit_4.0.4 ggforce_0.3.3
## [179] stringi_1.5.3 sass_0.3.1
## [181] blob_1.2.1 textshaping_0.3.3
## [183] latticeExtra_0.6-29 memoise_2.0.0
## [185] dplyr_1.0.5 irlba_2.3.3
## [187] future.apply_1.7.0