vignettes/integrate_atac.Rmd
integrate_atac.Rmd
Here we demonstrate the integration of multiple single-cell chromatin datasets derived from human PBMCs. One dataset was generated using the 10x Genomics multiome technology, and includes DNA accessibility and gene expression information for each cell. The other dataset was profiled using 10x Genomics scATAC-seq, and includes DNA accessibility data only.
We will integrate the two datasets together using the shared DNA accessibility assay, using tools available in the Seurat package. Furthermore, we will demonstrate transferring both continuous (gene expression) and categorical (cell labels) information from a reference to a query single-cell chromatin dataset.
The PBMC multiome and scATAC-seq data can be downloaded from the 10x website:
# multiome
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi
# scATAC
wget https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_fragments.tsv.gz.tbi
Here we’ll load the PBMC multiome data pre-processed in our multiome vignette, and create a new object from the scATAC-seq data:
library(Signac)
library(Seurat)
library(ggplot2)
# load the pre-processed multiome data
pbmc.multi <- readRDS("../vignette_data/pbmc_multiomic.rds")
# process the scATAC data
# first count fragments per cell
fragpath <- "../vignette_data/atac_pbmc_10k_nextgem_fragments.tsv.gz"
fragcounts <- CountFragments(fragments = fragpath)
atac.cells <- fragcounts[fragcounts$frequency_count > 2000, "CB"]
# create the fragment object
atac.frags <- CreateFragmentObject(path = fragpath, cells = atac.cells)
An important first step in any integrative analysis of single-cell chromatin data is to ensure that the same features are measured in each dataset. Here, we quantify the multiome peaks in the ATAC dataset to ensure that there are common features across the two datasets. See the merge vignette for more information about merging chromatin assays.
# quantify multiome peaks in the scATAC-seq dataset
counts <- FeatureMatrix(
fragments = atac.frags,
features = granges(pbmc.multi),
cells = atac.cells
)
# create object
atac.assay <- CreateChromatinAssay(
counts = counts,
min.features = 1000,
fragments = atac.frags
)
pbmc.atac <- CreateSeuratObject(counts = atac.assay, assay = "peaks")
pbmc.atac <- subset(pbmc.atac, nCount_peaks > 2000 & nCount_peaks < 30000)
# compute LSI
pbmc.atac <- FindTopFeatures(pbmc.atac, min.cutoff = 10)
pbmc.atac <- RunTFIDF(pbmc.atac)
pbmc.atac <- RunSVD(pbmc.atac)
Next we can merge the multiome and scATAC datasets together and observe that there is a difference between them that appears to be due to the batch (experiment and technology-specific variation).
# first add dataset-identifying metadata
pbmc.atac$dataset <- "ATAC"
pbmc.multi$dataset <- "Multiome"
# merge
pbmc.combined <- merge(pbmc.atac, pbmc.multi)
# process the combined dataset
pbmc.combined <- FindTopFeatures(pbmc.combined, min.cutoff = 10)
pbmc.combined <- RunTFIDF(pbmc.combined)
pbmc.combined <- RunSVD(pbmc.combined)
pbmc.combined <- RunUMAP(pbmc.combined, reduction = "lsi", dims = 2:30)
p1 <- DimPlot(pbmc.combined, group.by = "dataset")
To find integration anchors between the two datasets, we need to
project them into a shared low-dimensional space. To do this, we’ll use
reciprocal LSI projection (projecting each dataset into the others LSI
space) by setting reduction="rlsi"
. For more information
about the data integration methods in Seurat, see our recent paper and the Seurat website.
Rather than integrating the normalized data matrix, as is typically
done for scRNA-seq data, we’ll integrate the low-dimensional cell
embeddings (the LSI coordinates) across the datasets using the
IntegrateEmbeddings()
function. This is much better suited
to scATAC-seq data, as we typically have a very sparse matrix with a
large number of features. Note that this requires that we first compute
an uncorrected LSI embedding using the merged dataset (as we did
above).
# find integration anchors
integration.anchors <- FindIntegrationAnchors(
object.list = list(pbmc.multi, pbmc.atac),
anchor.features = rownames(pbmc.multi),
reduction = "rlsi",
dims = 2:30
)
# integrate LSI embeddings
integrated <- IntegrateEmbeddings(
anchorset = integration.anchors,
reductions = pbmc.combined[["lsi"]],
new.reduction.name = "integrated_lsi",
dims.to.integrate = 1:30
)
# create a new UMAP using the integrated embeddings
integrated <- RunUMAP(integrated, reduction = "integrated_lsi", dims = 2:30)
p2 <- DimPlot(integrated, group.by = "dataset")
Finally, we can compare the results of the merged and integrated datasets, and find that the integration has successfully removed the technology-specific variation in the dataset while retaining the cell-type-specific (biological) variation.
Here we’ve demonstrated the integration method using two datasets, but the same workflow can be applied to integrate any number of datasets.
In cases where we have a large, high-quality dataset, or a dataset containing unique information not present in other datasets (cell type annotations or additional data modalities, for example), we often want to use that dataset as a reference and map queries onto it so that we can interpret these query datasets in the context of the existing reference.
To demonstrate how to do this using single-cell chromatin reference
and query datasets, we’ll treat the PBMC multiome dataset here as a
reference and map the scATAC-seq dataset to it using the
FindTransferAnchors()
and MapQuery()
functions
from Seurat.
# compute UMAP and store the UMAP model
pbmc.multi <- RunUMAP(pbmc.multi, reduction = "lsi", dims = 2:30, return.model = TRUE)
# find transfer anchors
transfer.anchors <- FindTransferAnchors(
reference = pbmc.multi,
query = pbmc.atac,
reference.reduction = "lsi",
reduction = "lsiproject",
dims = 2:30
)
# map query onto the reference dataset
pbmc.atac <- MapQuery(
anchorset = transfer.anchors,
reference = pbmc.multi,
query = pbmc.atac,
refdata = pbmc.multi$predicted.id,
reference.reduction = "lsi",
new.reduction.name = "ref.lsi",
reduction.model = 'umap'
)
MapQuery()
doing?
MapQuery()
is a wrapper function that runs
TransferData()
, IntegrateEmbeddings()
, and
ProjectUMAP()
for a query dataset, and sets sensible
default parameters based on how the anchor object was generated. For
finer control over the parameters used by each of these functions, you
can pass parameters through MapQuery()
to each function
using the transferdata.args
,
integrateembeddings.args
, and projectumap.args
arguments for MapQuery()
, or you can run each of the
functions yourself. For example:
pbmc.atac <- TransferData(
anchorset = transfer.anchors,
reference = pbmc.multi,
weight.reduction = "lsiproject",
query = pbmc.atac,
refdata = list(
celltype = "predicted.id",
predicted_RNA = "RNA")
)
pbmc.atac <- IntegrateEmbeddings(
anchorset = transfer.anchors,
reference = pbmc.multi,
query = pbmc.atac,
reductions = "lsiproject",
new.reduction.name = "ref.lsi"
)
pbmc.atac <- ProjectUMAP(
query = pbmc.atac,
query.reduction = "ref.lsi",
reference = pbmc.multi,
reference.reduction = "lsi",
reduction.model = "umap"
)
By running MapQuery()
, we have mapped the scATAC-seq
dataset onto the the multimodal reference, and enabled cell type labels
to be transferred from reference to query. We can visualize these
reference mapping results and the cell type labels now associated with
the scATAC-seq dataset:
p1 <- DimPlot(pbmc.multi, reduction = "umap", group.by = "predicted.id", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("Reference")
p2 <- DimPlot(pbmc.atac, reduction = "ref.umap", group.by = "predicted.id", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("Query")
p1 | p2
For more information about multimodal reference mapping, see the Seurat vignette.
Above we transferred categorical information (the cell labels) and
mapped the query data onto an existing reference UMAP. We can also
transfer continuous data from the reference to the query in the same
way. Here we demonstrate transferring the gene expression values from
the PBMC multiome dataset (that measured DNA accessibility and gene
expression in the same cells) to the PBMC scATAC-seq dataset (that
measured DNA accessibility only). Note that we could also transfer these
values using the MapQuery()
function call above by setting
the refdata
parameter to a list of values.
# predict gene expression values
rna <- TransferData(
anchorset = transfer.anchors,
refdata = GetAssayData(pbmc.multi, assay = "RNA", slot = "data"),
weight.reduction = pbmc.atac[["lsi"]],
dims = 2:30
)
# add predicted values as a new assay
pbmc.atac[["predicted"]] <- rna
We can look at some immune marker genes and see that the predicted expression patterns match our expectation based on known expression patterns.
DefaultAssay(pbmc.atac) <- "predicted"
FeaturePlot(
object = pbmc.atac,
features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
pt.size = 0.1,
max.cutoff = 'q95',
reduction = "ref.umap",
ncol = 3
)
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Singapore
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.4.4 SeuratObject_4.1.4 Seurat_4.4.0 Signac_1.11.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.7
## [4] magrittr_2.0.3 spatstat.utils_3.0-3 farver_2.1.1
## [7] rmarkdown_2.25 fs_1.6.3 zlibbioc_1.46.0
## [10] ragg_1.2.6 vctrs_0.6.3 ROCR_1.0-11
## [13] memoise_2.0.1 spatstat.explore_3.2-3 Rsamtools_2.16.0
## [16] RCurl_1.98-1.12 RcppRoll_0.3.0 htmltools_0.5.6.1
## [19] sass_0.4.7 sctransform_0.4.0 parallelly_1.36.0
## [22] KernSmooth_2.23-21 bslib_0.5.1 htmlwidgets_1.6.2
## [25] desc_1.4.2 ica_1.0-3 plyr_1.8.9
## [28] plotly_4.10.2 zoo_1.8-12 cachem_1.0.8
## [31] igraph_1.5.1 mime_0.12 lifecycle_1.0.3
## [34] pkgconfig_2.0.3 Matrix_1.6-1.1 R6_2.5.1
## [37] fastmap_1.1.1 GenomeInfoDbData_1.2.10 fitdistrplus_1.1-11
## [40] future_1.33.0 shiny_1.7.5 digest_0.6.33
## [43] colorspace_2.1-0 patchwork_1.1.3 S4Vectors_0.38.2
## [46] tensor_1.5 rprojroot_2.0.3 irlba_2.3.5.1
## [49] textshaping_0.3.7 GenomicRanges_1.52.0 labeling_0.4.3
## [52] progressr_0.14.0 spatstat.sparse_3.0-2 fansi_1.0.5
## [55] polyclip_1.10-6 abind_1.4-5 httr_1.4.7
## [58] compiler_4.3.1 withr_2.5.1 BiocParallel_1.34.2
## [61] MASS_7.3-60 tools_4.3.1 lmtest_0.9-40
## [64] httpuv_1.6.11 future.apply_1.11.0 goftest_1.2-3
## [67] glue_1.6.2 nlme_3.1-162 promises_1.2.1
## [70] grid_4.3.1 Rtsne_0.16 cluster_2.1.4
## [73] reshape2_1.4.4 generics_0.1.3 gtable_0.3.4
## [76] spatstat.data_3.0-1 tidyr_1.3.0 data.table_1.14.8
## [79] sp_2.1-0 utf8_1.2.3 XVector_0.40.0
## [82] spatstat.geom_3.2-5 BiocGenerics_0.46.0 RcppAnnoy_0.0.21
## [85] ggrepel_0.9.3 RANN_2.6.1 pillar_1.9.0
## [88] stringr_1.5.0 later_1.3.1 splines_4.3.1
## [91] dplyr_1.1.3 lattice_0.21-8 deldir_1.0-9
## [94] survival_3.5-5 tidyselect_1.2.0 Biostrings_2.68.1
## [97] miniUI_0.1.1.1 pbapply_1.7-2 knitr_1.44
## [100] gridExtra_2.3 IRanges_2.34.1 scattermore_1.2
## [103] stats4_4.3.1 xfun_0.40 matrixStats_1.0.0
## [106] stringi_1.7.12 lazyeval_0.2.2 yaml_2.3.7
## [109] evaluate_0.22 codetools_0.2-19 tibble_3.2.1
## [112] cli_3.6.1 uwot_0.1.16 xtable_1.8-4
## [115] reticulate_1.32.0 systemfonts_1.0.5 munsell_0.5.0
## [118] jquerylib_0.1.4 Rcpp_1.0.11 GenomeInfoDb_1.36.3
## [121] spatstat.random_3.1-6 globals_0.16.2 png_0.1-8
## [124] parallel_4.3.1 ellipsis_0.3.2 pkgdown_2.0.7
## [127] bitops_1.0-7 listenv_0.9.0 viridisLite_0.4.2
## [130] scales_1.2.1 ggridges_0.5.4 leiden_0.4.3
## [133] purrr_1.0.2 crayon_1.5.2 rlang_1.1.1
## [136] cowplot_1.1.1 fastmatch_1.1-4