vignettes/integration.Rmd
integration.Rmd
Here we demonstrate the integration of multiple single-cell ATAC-seq datasets from the adult mouse brain, profiled using 10x Genomics and sci-ATAC-seq. The dataset from 10x Genomics is the same as used in our mouse brain vignette, and you can find links to download the raw data as well as the code used to produce the processed Seurat object used in this vignette on the mouse brain vignette page.
The sci-ATAC-seq dataset was generated by Cusanovich and Hill et al.. Raw data can be downloaded from the authors’ website:
We will demonstrate the use of Seurat v3 integration methods described here on scATAC-seq data, for both dataset integration and label transfer between datasets, as well as use of the harmony package for dataset integration.
First, we load the required packages, read in the data and create a Seurat object for the sci-ATAC-seq data.
library(Signac)
library(Seurat)
library(S4Vectors)
library(patchwork)
set.seed(1234)
# this object was created following the mouse brain vignette
tenx <- readRDS(file = "../vignette_data/adult_mouse_brain.rds")
tenx$tech <- '10x'
tenx$celltype <- Idents(tenx)
sci.metadata <- read.table(
file = "../vignette_data/sci/cell_metadata.txt",
header = TRUE,
row.names = 1,
sep = "\t"
)
# subset to include only the brain data
sci.metadata <- sci.metadata[sci.metadata$tissue == 'PreFrontalCortex', ]
sci.counts <- readRDS(file = "../vignette_data/sci/atac_matrix.binary.qc_filtered.rds")
sci.counts <- sci.counts[, rownames(x = sci.metadata)]
The sci-ATAC-seq data was mapped to the mm9 mouse genome, whereas the 10x data was mapped to mm10, and so the coordinate space of the peaks in each experiment are different. We can lift over the mm9 coordinates to mm10 using the rtracklayer
package, and rename the sci-ATAC-seq peaks with mm10 coordinates. Liftover chains can be downloaded from UCSC. See this post on biostars for more information about lifting over genomic coordinates.
sci_peaks_mm9 <- StringToGRanges(regions = rownames(sci.counts), sep = c("_", "_"))
mm9_mm10 <- rtracklayer::import.chain("/home/stuartt/data/liftover/mm9ToMm10.over.chain")
sci_peaks_mm10 <- rtracklayer::liftOver(x = sci_peaks_mm9, chain = mm9_mm10)
names(sci_peaks_mm10) <- rownames(sci.counts)
# discard any peaks that were mapped to >1 region in mm10
correspondence <- elementNROWS(sci_peaks_mm10)
sci_peaks_mm10 <- sci_peaks_mm10[correspondence == 1]
sci_peaks_mm10 <- unlist(sci_peaks_mm10)
sci.counts <- sci.counts[names(sci_peaks_mm10), ]
# rename peaks with mm10 coordinates
rownames(sci.counts) <- GRangesToString(grange = sci_peaks_mm10)
# create object and perform some basic QC filtering
sci_assay <- CreateChromatinAssay(
counts = sci.counts,
assay = 'sci',
genome = "mm10",
ranges = sci_peaks_mm10,
project = 'sci'
)
sci <- CreateSeuratObject(
counts = sci_assay,
meta.data = sci.metadata,
project = "sci",
assay = "sci"
)
sci <- sci[, sci$nFeature_sci > 2000 &
sci$nCount_sci > 5000 &
!(sci$cell_label %in% c("Collisions", "Unknown"))]
sci$tech <- 'sciATAC'
sci <- RunTFIDF(sci)
sci <- FindTopFeatures(sci, min.cutoff = 50)
sci <- RunSVD(sci)
sci <- RunUMAP(sci, reduction = 'lsi', dims = 2:30)
sci
## An object of class Seurat
## 436131 features across 3344 samples within 1 assay
## Active assay: sci (436131 features, 190944 variable features)
## 2 dimensional reductions calculated: lsi, umap
We now have two scATAC-seq objects containing features in a common coordinate space. However, as the peak calling was performed independently in each experiment, it is unlikely that the peak coordinates overlap perfectly.
In order to have common features across the datasets we want to integrate, we can count reads in the sci-ATAC-seq peaks in the 10x Genomics dataset and create a new assay with these counts. For efficiency, we will downsample the sci-ATAC- seq peaks used. We have found integration results to be quite robust to the number and set of peaks used.
# find peaks that intersect in both datasets
intersecting.regions <- findOverlaps(query = sci, subject = tenx)
# find the coordinates of peaks in the sci-ATAC-seq that intersect peaks in the 10x
intersections.tenx <- unique(queryHits(intersecting.regions))
# choose a subset of intersecting peaks
peaks.use <- sort(granges(sci)[sample(intersections.tenx, size = 10000, replace = FALSE)])
# count fragments per cell overlapping the set of peaks in the 10x data
sci_peaks_tenx <- FeatureMatrix(
fragments = Fragments(tenx),
features = peaks.use,
cells = colnames(tenx)
)
# create a new assay and add it to the 10x dataset
tenx[['sciPeaks']] <- CreateChromatinAssay(
counts = sci_peaks_tenx,
min.cells = 1,
ranges = peaks.use,
genome = 'mm10'
)
DefaultAssay(tenx) <- 'sciPeaks'
tenx <- RunTFIDF(tenx)
Before performing integration it is always a good idea to check if there are dataset-specific difference present that need to be removed. If not, we could simply merge the objects without performing integration. In this case there is a strong difference between the two datasets due to the different technologies used. This effect can be removed using the integration methods in Seurat v3 described in this paper.
# Look at the data without integration first
# create a new assay in the sci-ATAC-seq dataset containing the common peaks
peaknames <- GRangesToString(grange = peaks.use)
sci[['sciPeaks']] <- CreateChromatinAssay(
counts <- GetAssayData(sci, assay = "sci", slot = "counts")[peaknames, ],
ranges = peaks.use,
genome = "mm10"
)
# run TF-IDF for the new assay
DefaultAssay(sci) <- "sciPeaks"
sci <- RunTFIDF(sci)
unintegrated <- merge(sci, tenx)
DefaultAssay(unintegrated) <- "sciPeaks"
unintegrated <- RunTFIDF(unintegrated)
unintegrated <- FindTopFeatures(unintegrated, min.cutoff = 50)
unintegrated <- RunSVD(unintegrated)
unintegrated <- RunUMAP(unintegrated, reduction = 'lsi', dims = 2:30)
p1 <- DimPlot(unintegrated, group.by = 'tech', pt.size = 0.1) + ggplot2::ggtitle("Unintegrated")
# find integration anchors between 10x and sci-ATAC
anchors <- FindIntegrationAnchors(
object.list = list(tenx, sci),
anchor.features = rownames(tenx),
assay = c('sciPeaks', 'sciPeaks'),
k.filter = NA
)
# integrate data and create a new merged object
integrated <- IntegrateData(
anchorset = anchors,
weight.reduction = sci[['lsi']],
dims = 2:30,
preserve.order = TRUE
)
# we now have a "corrected" TF-IDF matrix, and can run LSI again on this corrected matrix
integrated <- RunSVD(
object = integrated,
n = 30,
reduction.name = 'integratedLSI'
)
integrated <- RunUMAP(
object = integrated,
dims = 2:30,
reduction = 'integratedLSI'
)
p2 <- DimPlot(integrated, group.by = 'tech', pt.size = 0.1) + ggplot2::ggtitle("Integrated")
p1 + p2
We can also use Seurat to transfer data from one dataset to another. We previously demonstrated this between scRNA-seq datasets, and between scRNA-seq and scATAC-seq datasets (https://doi.org/10.1016/j.cell.2019.05.031). Here, we demonstrate data transfer between two single-cell ATAC-seq datasets by transferring the cell type label from the 10x Genomics scATAC-seq data to the sci-ATAC-seq data.
transfer.anchors <- FindTransferAnchors(
reference = tenx,
query = sci,
reference.assay = 'sciPeaks',
query.assay = 'sciPeaks',
reduction = 'cca',
features = rownames(tenx),
k.filter = NA
)
predicted.id <- TransferData(
anchorset = transfer.anchors,
refdata = tenx$celltype,
weight.reduction = sci[['lsi']],
dims = 2:30
)
sci <- AddMetaData(
object = sci,
metadata = predicted.id
)
sci$predicted.id <- factor(sci$predicted.id, levels = levels(tenx$celltype)) # to make the colors match
p3 <- DimPlot(tenx, group.by = 'celltype', label = TRUE) + NoLegend() + ggplot2::ggtitle("Celltype labels 10x scATAC-seq")
p4 <- DimPlot(sci, group.by = 'predicted.id', label = TRUE) + NoLegend() + ggplot2::ggtitle("Predicted labels sci-ATAC-seq")
p3 + p4
While Seurat adjusts the underlying high-dimensional data in order to correct for differences between the groups, harmony (developed by the Raychaudhuri lab) adjusts the low-dimensional cell embeddings to to reduce the dependence between cluster assignment and dataset of origin. You can read more about Harmony here.
Harmony requires a single object as input, so here we use the merged the sci-ATAC-seq and scATAC-seq datasets that were merged in a coordinate-aware manner using the merge()
function, and then compute LSI embeddings using the merged object. Once we have computed the LSI embeddings, we can run the RunHarmony()
function from the harmony
package and provide the technology used as a grouping variable to remove the batch difference between the sci-ATAC-seq and 10x Genomics scATAC-seq datasets. This produces a set of “corrected” LSI embeddings that can be used for further dimension reduction with UMAP or tSNE, and for clustering.
library(harmony)
hm.integrated <- RunHarmony(
object = unintegrated,
group.by.vars = 'tech',
reduction = 'lsi',
assay.use = 'sciPeaks',
project.dim = FALSE
)
# re-compute the UMAP using corrected LSI embeddings
hm.integrated <- RunUMAP(hm.integrated, dims = 2:30, reduction = 'harmony')
p5 <- DimPlot(hm.integrated, group.by = 'tech', pt.size = 0.1) + ggplot2::ggtitle("Harmony integration")
p1 + p5
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] harmony_1.0 Rcpp_1.0.6 GenomeInfoDb_1.26.5
## [4] IRanges_2.24.1 patchwork_1.1.1 S4Vectors_0.28.1
## [7] BiocGenerics_0.36.0 SeuratObject_4.0.0 Seurat_4.0.1.9005
## [10] Signac_1.2.0
##
## loaded via a namespace (and not attached):
## [1] fastmatch_1.1-0 systemfonts_1.0.1
## [3] plyr_1.8.6 igraph_1.2.6
## [5] lazyeval_0.2.2 splines_4.0.1
## [7] BiocParallel_1.24.1 listenv_0.8.0
## [9] SnowballC_0.7.0 scattermore_0.7
## [11] ggplot2_3.3.3 digest_0.6.27
## [13] htmltools_0.5.1.1 fansi_0.4.2
## [15] magrittr_2.0.1 memoise_2.0.0
## [17] tensor_1.5 cluster_2.1.2
## [19] ROCR_1.0-11 globals_0.14.0
## [21] Biostrings_2.58.0 matrixStats_0.58.0
## [23] docopt_0.7.1 pkgdown_1.6.1
## [25] spatstat.sparse_2.0-0 colorspace_2.0-0
## [27] ggrepel_0.9.1 textshaping_0.3.3
## [29] xfun_0.22 dplyr_1.0.5
## [31] sparsesvd_0.2 crayon_1.4.1
## [33] RCurl_1.98-1.3 jsonlite_1.7.2
## [35] spatstat.data_2.1-0 survival_3.2-11
## [37] zoo_1.8-9 glue_1.4.2
## [39] polyclip_1.10-0 gtable_0.3.0
## [41] zlibbioc_1.36.0 XVector_0.30.0
## [43] leiden_0.3.7 DelayedArray_0.16.3
## [45] future.apply_1.7.0 abind_1.4-5
## [47] scales_1.1.1 DBI_1.1.1
## [49] miniUI_0.1.1.1 viridisLite_0.4.0
## [51] xtable_1.8-4 reticulate_1.19
## [53] spatstat.core_2.1-2 htmlwidgets_1.5.3
## [55] httr_1.4.2 RColorBrewer_1.1-2
## [57] ellipsis_0.3.1 ica_1.0-2
## [59] XML_3.99-0.6 pkgconfig_2.0.3
## [61] farver_2.1.0 ggseqlogo_0.1
## [63] sass_0.3.1 uwot_0.1.10
## [65] deldir_0.2-10 utf8_1.2.1
## [67] labeling_0.4.2 tidyselect_1.1.0
## [69] rlang_0.4.10 reshape2_1.4.4
## [71] later_1.2.0 munsell_0.5.0
## [73] tools_4.0.1 cachem_1.0.4
## [75] generics_0.1.0 ggridges_0.5.3
## [77] evaluate_0.14 stringr_1.4.0
## [79] fastmap_1.1.0 yaml_2.2.1
## [81] ragg_1.1.2 goftest_1.2-2
## [83] knitr_1.33 fs_1.5.0
## [85] fitdistrplus_1.1-3 purrr_0.3.4
## [87] RANN_2.6.1 pbapply_1.4-3
## [89] future_1.21.0 nlme_3.1-152
## [91] mime_0.10 slam_0.1-48
## [93] RcppRoll_0.3.0 compiler_4.0.1
## [95] plotly_4.9.3 png_0.1-7
## [97] spatstat.utils_2.1-0 tibble_3.1.1
## [99] tweenr_1.0.2 bslib_0.2.4
## [101] stringi_1.5.3 highr_0.9
## [103] RSpectra_0.16-0 desc_1.3.0
## [105] lattice_0.20-41 Matrix_1.3-2
## [107] vctrs_0.3.7 pillar_1.6.0
## [109] lifecycle_1.0.0 spatstat.geom_2.1-0
## [111] lmtest_0.9-38 jquerylib_0.1.4
## [113] RcppAnnoy_0.0.18 data.table_1.14.0
## [115] cowplot_1.1.1 bitops_1.0-7
## [117] irlba_2.3.3 rtracklayer_1.50.0
## [119] httpuv_1.6.0 GenomicRanges_1.42.0
## [121] R6_2.5.0 promises_1.2.0.1
## [123] lsa_0.73.2 KernSmooth_2.23-18
## [125] gridExtra_2.3 parallelly_1.24.0
## [127] codetools_0.2-18 MASS_7.3-53.1
## [129] assertthat_0.2.1 SummarizedExperiment_1.20.0
## [131] rprojroot_2.0.2 GenomicAlignments_1.26.0
## [133] qlcMatrix_0.9.7 sctransform_0.3.2
## [135] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4
## [137] mgcv_1.8-33 grid_4.0.1
## [139] rpart_4.1-15 tidyr_1.1.3
## [141] rmarkdown_2.7 MatrixGenerics_1.2.1
## [143] Rtsne_0.15 ggforce_0.3.3
## [145] Biobase_2.50.0 shiny_1.6.0