Merging objects

In this vignette we demonstrate how to merge multiple Seurat objects containing single-cell chromatin data. To demonstrate, we will use four scATAC-seq PBMC datasets provided by 10x Genomics:

View data download code

To download the required data, run the following lines in a shell:

# 500 cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_filtered_peak_bc_matrix.h5

# 1k cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_1k_nextgem/atac_pbmc_1k_nextgem_filtered_peak_bc_matrix.h5

# 5k cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_filtered_peak_bc_matrix.h5

# 10k cell
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_filtered_peak_bc_matrix.h5

When merging multiple single-cell chromatin datasets, it’s important to be aware that if peak calling was performed on each dataset independently, the peaks are unlikely to be exactly the same. We therefore need to create a common set of peaks across all the datasets to be merged.

To create a unified set of peaks we can use functions from the GenomicRanges package. The reduce function from GenomicRanges will merge all intersecting peaks. Another option is to use the disjoin function, that will create distinct non-overlapping sets of peaks. Here is a visual example to illustrate the difference between reduce and disjoin:

gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = c(20, 70, 300), end = c(120, 200, 400)))
gr

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1    20-120      *
##   [2]     chr1    70-200      *
##   [3]     chr1   300-400      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Creating a common peak set

If the peaks were identified independently in each experiment then they will likely not overlap perfectly. We can merge peaks from all the datasets to create a common peak set, and quantify this peak set in each experiment prior to merging the objects.

First we’ll load the peak coordinates for each experiment and convert them to genomic ranges, the use the GenomicRanges::reduce function to create a common set of peaks to quantify in each dataset.

library(Signac)
library(Seurat)
library(GenomicRanges)
library(future)

plan("multicore", workers = 4)
options(future.globals.maxSize = 50000 * 1024^2) # for 50 Gb RAM

# read in peak sets
peaks.500 <- read.table(
  file = "pbmc500/atac_pbmc_500_nextgem_peaks.bed",
  col.names = c("chr", "start", "end")
)
peaks.1k <- read.table(
  file = "pbmc1k/atac_pbmc_1k_nextgem_peaks.bed",
  col.names = c("chr", "start", "end")
)
peaks.5k <- read.table(
  file = "pbmc5k/atac_pbmc_5k_nextgem_peaks.bed",
  col.names = c("chr", "start", "end")
)
peaks.10k <- read.table(
  file = "pbmc10k/atac_pbmc_10k_nextgem_peaks.bed",
  col.names = c("chr", "start", "end")
)

# convert to genomic ranges
gr.500 <- makeGRangesFromDataFrame(peaks.500)
gr.1k <- makeGRangesFromDataFrame(peaks.1k)
gr.5k <- makeGRangesFromDataFrame(peaks.5k)
gr.10k <- makeGRangesFromDataFrame(peaks.10k)

# Create a unified set of peaks to quantify in each dataset
combined.peaks <- reduce(x = c(gr.500, gr.1k, gr.5k, gr.10k))

# Filter out bad peaks based on length
peakwidths <- width(combined.peaks)
combined.peaks <- combined.peaks[peakwidths  < 10000 & peakwidths > 20]
combined.peaks

## GRanges object with 89951 ranges and 0 metadata columns:
##           seqnames            ranges strand
##              <Rle>         <IRanges>  <Rle>
##       [1]     chr1     565153-565499      *
##       [2]     chr1     569185-569620      *
##       [3]     chr1     713551-714783      *
##       [4]     chr1     752418-753020      *
##       [5]     chr1     762249-763345      *
##       ...      ...               ...    ...
##   [89947]     chrY 23422151-23422632      *
##   [89948]     chrY 23583994-23584463      *
##   [89949]     chrY 23602466-23602779      *
##   [89950]     chrY 28816593-28817710      *
##   [89951]     chrY 58855911-58856251      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Create Fragment objects

To quantify our combined set of peaks we’ll need to create a Fragment object for each experiment. The Fragment class is a specialized class defined in Signac to hold all the information related to a single fragment file.

First we’ll load the cell metadata for each experiment so that we know what cell barcodes are contained in each file, then we can create Fragment objects using the CreateFragmentObject function. The CreateFragmentObject function performs some checks to ensure that the file is present on disk and that it is compressed and indexed, computes the MD5 sum for the file and the tabix index so that we can tell if the file is modified at any point, and checks that the expected cells are present in the file.

# load metadata
md.500 <- read.table(
  file = "pbmc500/atac_pbmc_500_nextgem_singlecell.csv",
  stringsAsFactors = FALSE,
  sep = ",",
  header = TRUE,
  row.names = 1
)[-1, ] # remove the first row

md.1k <- read.table(
  file = "pbmc1k/atac_pbmc_1k_nextgem_singlecell.csv",
  stringsAsFactors = FALSE,
  sep = ",",
  header = TRUE,
  row.names = 1
)[-1, ]

md.5k <- read.table(
  file = "pbmc5k/atac_pbmc_5k_nextgem_singlecell.csv",
  stringsAsFactors = FALSE,
  sep = ",",
  header = TRUE,
  row.names = 1
)[-1, ]

md.10k <- read.table(
  file = "pbmc10k/atac_pbmc_10k_nextgem_singlecell.csv",
  stringsAsFactors = FALSE,
  sep = ",",
  header = TRUE,
  row.names = 1
)[-1, ]

# perform an initial filtering of low count cells
md.500 <- md.500[md.500$passed_filters > 500, ]
md.1k <- md.1k[md.1k$passed_filters > 500, ]
md.5k <- md.5k[md.5k$passed_filters > 500, ]
md.10k <- md.10k[md.10k$passed_filters > 1000, ] # sequenced deeper so set higher cutoff

# create fragment objects
frags.500 <- CreateFragmentObject(
  path = "pbmc500/atac_pbmc_500_nextgem_fragments.tsv.gz",
  cells = rownames(md.500)
)

## Computing hash

frags.1k <- CreateFragmentObject(
  path = "pbmc1k/atac_pbmc_1k_nextgem_fragments.tsv.gz",
  cells = rownames(md.1k)
)

## Computing hash

frags.5k <- CreateFragmentObject(
  path = "pbmc5k/atac_pbmc_5k_nextgem_fragments.tsv.gz",
  cells = rownames(md.5k)
)

## Computing hash

frags.10k <- CreateFragmentObject(
  path = "pbmc10k/atac_pbmc_10k_nextgem_fragments.tsv.gz",
  cells = rownames(md.10k)
)

## Computing hash

Quantify peaks in each dataset

We can now create a matrix of peaks x cell for each sample using the FeatureMatrix function. This function is parallelized using the future package. See the parallelization vignette for more information about using future.

pbmc500.counts <- FeatureMatrix(
  fragments = frags.500,
  features = combined.peaks,
  cells = rownames(md.500)
)

pbmc1k.counts <- FeatureMatrix(
  fragments = frags.1k,
  features = combined.peaks,
  cells = rownames(md.1k)
)

pbmc5k.counts <- FeatureMatrix(
  fragments = frags.5k,
  features = combined.peaks,
  cells = rownames(md.5k)
)

pbmc10k.counts <- FeatureMatrix(
  fragments = frags.10k,
  features = combined.peaks,
  cells = rownames(md.10k)
)

Create the objects

We will now use the quantified matrices to create a Seurat object for each dataset, storing the Fragment object for each dataset in the assay.

pbmc500_assay <- CreateChromatinAssay(pbmc500.counts, fragments = frags.500)
pbmc500 <- CreateSeuratObject(pbmc500_assay, assay = "ATAC", meta.data=md.500)

pbmc1k_assay <- CreateChromatinAssay(pbmc1k.counts, fragments = frags.1k)
pbmc1k <- CreateSeuratObject(pbmc1k_assay, assay = "ATAC", meta.data=md.1k)

pbmc5k_assay <- CreateChromatinAssay(pbmc5k.counts, fragments = frags.5k)
pbmc5k <- CreateSeuratObject(pbmc5k_assay, assay = "ATAC", meta.data=md.5k)

pbmc10k_assay <- CreateChromatinAssay(pbmc10k.counts, fragments = frags.10k)
pbmc10k <- CreateSeuratObject(pbmc10k_assay, assay = "ATAC", meta.data=md.10k)

Merge objects

Now that the objects each contain an assay with the same set of features, we can use the standard merge function to merge the objects. This will also merge all the fragment objects so that we retain the fragment information for each cell in the final merged object.

# add information to identify dataset of origin
pbmc500$dataset <- 'pbmc500'
pbmc1k$dataset <- 'pbmc1k'
pbmc5k$dataset <- 'pbmc5k'
pbmc10k$dataset <- 'pbmc10k'

# merge all datasets, adding a cell ID to make sure cell names are unique
combined <- merge(
  x = pbmc500,
  y = list(pbmc1k, pbmc5k, pbmc10k),
  add.cell.ids = c("500", "1k", "5k", "10k")
)
combined[["ATAC"]]

## ChromatinAssay data with 89951 features for 21688 cells
## Variable features: 0 
## Genome: 
## Annotation present: FALSE 
## Motifs present: FALSE 
## Fragment files: 4

combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 20)
combined <- RunSVD(combined)
combined <- RunUMAP(combined, dims = 2:50, reduction = 'lsi')

DimPlot(combined, group.by = 'dataset', pt.size = 0.1)

The merged object contains all four fragment objects, and contains an internal mapping of cell names in the object to the cell names in each fragment file so that we can retrieve information from the files without having to change the cell names in each fragment file. We can check that functions that pull data from the fragment files work as expected on the merged object by plotting a region of the genome:

CoveragePlot(
  object = combined,
  group.by = 'dataset',
  region = "chr14-99700000-99760000"
)

Merging without a common feature set

The above approach requires that we have access to a fragment file for each dataset. In some cases we may not have this data (although we can create a fragment file from the BAM file using sinto). In these cases, we can still create a merged object, with the caveat that the resulting merged count matrix may not be as accurate.

The merge function defined in Signac for ChromatinAssay objects will consider overlapping peaks as equivalent, and adjust the genomic ranges spanned by the peak so that the features in each object being merged become equivalent. Note that this can result in inaccuracies in the count matrix, as some peaks will be extended to cover regions that were not originally quantified. This is the best that can be done without re-quantification, and we recommend always following the procedure outlined above for object merging whenever possible.

Here we demonstrate merging the same four PBMC datasets without creating a common feature set:

# load the count matrix for each object that was generated by cellranger
counts.500 <- Read10X_h5("pbmc500/atac_pbmc_500_nextgem_filtered_peak_bc_matrix.h5")
counts.1k <- Read10X_h5("pbmc1k/atac_pbmc_1k_nextgem_filtered_peak_bc_matrix.h5")
counts.5k <- Read10X_h5("pbmc5k/atac_pbmc_5k_nextgem_filtered_peak_bc_matrix.h5")
counts.10k <- Read10X_h5("pbmc10k/atac_pbmc_10k_nextgem_filtered_peak_bc_matrix.h5")

# create objects
pbmc500_assay <- CreateChromatinAssay(counts = counts.500, sep = c(":", "-"), min.features = 500)
pbmc500 <- CreateSeuratObject(pbmc500_assay, assay = "peaks")
pbmc1k_assay <- CreateChromatinAssay(counts = counts.1k, sep = c(":", "-"), min.features = 500)
pbmc1k <- CreateSeuratObject(pbmc1k_assay, assay = "peaks")
pbmc5k_assay <- CreateChromatinAssay(counts = counts.5k, sep = c(":", "-"), min.features = 500)
pbmc5k <- CreateSeuratObject(pbmc5k_assay, assay = "peaks")
pbmc10k_assay <- CreateChromatinAssay(counts = counts.10k, sep = c(":", "-"), min.features = 1000)
pbmc10k <- CreateSeuratObject(pbmc10k_assay, assay = "peaks")

# add information to identify dataset of origin
pbmc500$dataset <- 'pbmc500'
pbmc1k$dataset <- 'pbmc1k'
pbmc5k$dataset <- 'pbmc5k'
pbmc10k$dataset <- 'pbmc10k'

# merge
combined <- merge(
  x = pbmc500,
  y = list(pbmc1k, pbmc5k, pbmc10k),
  add.cell.ids = c("500", "1k", "5k", "10k")
)

# process 
combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 20)
combined <- RunSVD(combined)
combined <- RunUMAP(combined, dims = 2:50, reduction = 'lsi')

DimPlot(combined, group.by = 'dataset', pt.size = 0.1)

Session Info

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Singapore
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] future_1.34.0        Seurat_5.1.0         SeuratObject_5.0.2  
##  [4] sp_2.1-4             Signac_1.14.0        patchwork_1.2.0     
##  [7] ggbio_1.50.0         ggplot2_3.5.1        GenomicRanges_1.54.1
## [10] GenomeInfoDb_1.38.8  IRanges_2.36.0       S4Vectors_0.40.2    
## [13] BiocGenerics_0.48.1 
## 
## loaded via a namespace (and not attached):
##   [1] spatstat.sparse_3.0-3       fs_1.6.4                   
##   [3] ProtGenerics_1.34.0         matrixStats_1.3.0          
##   [5] bitops_1.0-8                httr_1.4.7                 
##   [7] RColorBrewer_1.1-3          tools_4.3.1                
##   [9] sctransform_0.4.1           backports_1.5.0            
##  [11] utf8_1.2.4                  R6_2.5.1                   
##  [13] uwot_0.1.16                 lazyeval_0.2.2             
##  [15] withr_3.0.1                 prettyunits_1.2.0          
##  [17] GGally_2.2.1                gridExtra_2.3              
##  [19] progressr_0.14.0            cli_3.6.3                  
##  [21] Biobase_2.62.0              textshaping_0.4.0          
##  [23] spatstat.explore_3.2-7      fastDummies_1.7.3          
##  [25] labeling_0.4.3              sass_0.4.9                 
##  [27] spatstat.data_3.0-4         ggridges_0.5.6             
##  [29] pbapply_1.7-2               pkgdown_2.0.9              
##  [31] Rsamtools_2.18.0            systemfonts_1.1.0          
##  [33] foreign_0.8-86              dichromat_2.0-0.1          
##  [35] parallelly_1.38.0           BSgenome_1.70.2            
##  [37] rstudioapi_0.16.0           RSQLite_2.3.7              
##  [39] generics_0.1.3              BiocIO_1.12.0              
##  [41] spatstat.random_3.2-3       ica_1.0-3                  
##  [43] dplyr_1.1.4                 Matrix_1.6-5               
##  [45] fansi_1.0.6                 abind_1.4-5                
##  [47] lifecycle_1.0.4             yaml_2.3.10                
##  [49] SummarizedExperiment_1.32.0 SparseArray_1.2.4          
##  [51] BiocFileCache_2.10.2        Rtsne_0.17                 
##  [53] grid_4.3.1                  blob_1.2.4                 
##  [55] promises_1.3.0              crayon_1.5.3               
##  [57] miniUI_0.1.1.1              lattice_0.22-6             
##  [59] cowplot_1.1.3               GenomicFeatures_1.54.4     
##  [61] KEGGREST_1.42.0             pillar_1.9.0               
##  [63] knitr_1.48                  rjson_0.2.21               
##  [65] future.apply_1.11.2         codetools_0.2-19           
##  [67] fastmatch_1.1-4             leiden_0.4.3.1             
##  [69] glue_1.7.0                  data.table_1.15.4          
##  [71] vctrs_0.6.5                 png_0.1-8                  
##  [73] spam_2.10-0                 gtable_0.3.5               
##  [75] cachem_1.1.0                xfun_0.47                  
##  [77] S4Arrays_1.2.1              mime_0.12                  
##  [79] survival_3.6-4              RcppRoll_0.3.1             
##  [81] fitdistrplus_1.1-11         ROCR_1.0-11                
##  [83] nlme_3.1-164                bit64_4.0.5                
##  [85] progress_1.2.3              filelock_1.0.3             
##  [87] RcppAnnoy_0.0.22            bslib_0.8.0                
##  [89] irlba_2.3.5.1               KernSmooth_2.23-24         
##  [91] rpart_4.1.23                colorspace_2.1-1           
##  [93] DBI_1.2.3                   Hmisc_5.1-3                
##  [95] nnet_7.3-19                 tidyselect_1.2.1           
##  [97] bit_4.0.5                   compiler_4.3.1             
##  [99] curl_5.2.1                  graph_1.80.0               
## [101] htmlTable_2.4.3             hdf5r_1.3.10               
## [103] xml2_1.3.6                  desc_1.4.3                 
## [105] DelayedArray_0.28.0         plotly_4.10.4              
## [107] rtracklayer_1.62.0          checkmate_2.3.2            
## [109] scales_1.3.0                lmtest_0.9-40              
## [111] RBGL_1.78.0                 rappdirs_0.3.3             
## [113] goftest_1.2-3               stringr_1.5.1              
## [115] digest_0.6.37               spatstat.utils_3.0-4       
## [117] rmarkdown_2.28              XVector_0.42.0             
## [119] htmltools_0.5.8.1           pkgconfig_2.0.3            
## [121] base64enc_0.1-3             MatrixGenerics_1.14.0      
## [123] highr_0.11                  dbplyr_2.5.0               
## [125] fastmap_1.2.0               ensembldb_2.26.0           
## [127] rlang_1.1.4                 htmlwidgets_1.6.4          
## [129] shiny_1.9.1                 farver_2.1.2               
## [131] jquerylib_0.1.4             zoo_1.8-12                 
## [133] jsonlite_1.8.8              BiocParallel_1.36.0        
## [135] VariantAnnotation_1.48.1    RCurl_1.98-1.16            
## [137] magrittr_2.0.3              Formula_1.2-5              
## [139] GenomeInfoDbData_1.2.11     dotCall64_1.1-1            
## [141] munsell_0.5.1               Rcpp_1.0.13                
## [143] reticulate_1.37.0           stringi_1.8.4              
## [145] zlibbioc_1.48.2             MASS_7.3-60.0.1            
## [147] plyr_1.8.9                  ggstats_0.6.0              
## [149] parallel_4.3.1              listenv_0.9.1              
## [151] ggrepel_0.9.5               deldir_2.0-4               
## [153] Biostrings_2.70.3           splines_4.3.1              
## [155] tensor_1.5                  hms_1.1.3                  
## [157] igraph_2.0.3                spatstat.geom_3.2-9        
## [159] RcppHNSW_0.6.0              reshape2_1.4.4             
## [161] biomaRt_2.58.2              XML_3.99-0.17              
## [163] evaluate_0.24.0             biovizBase_1.50.0          
## [165] BiocManager_1.30.23         httpuv_1.6.15              
## [167] polyclip_1.10-6             RANN_2.6.1                 
## [169] tidyr_1.3.1                 purrr_1.0.2                
## [171] scattermore_1.2             xtable_1.8-4               
## [173] restfulr_0.0.15             AnnotationFilter_1.26.0    
## [175] RSpectra_0.16-2             later_1.3.2                
## [177] viridisLite_0.4.2           ragg_1.3.2                 
## [179] OrganismDbi_1.44.0          tibble_3.2.1               
## [181] memoise_2.0.1               AnnotationDbi_1.64.1       
## [183] GenomicAlignments_1.38.2    cluster_2.1.6              
## [185] globals_0.16.3