Match DNA sequence characteristics — MatchRegionStats • Signac

Return a vector if genomic regions that match the distribution of a set of query regions for any given set of characteristics, specified in the input meta.feature dataframe.

Usage

MatchRegionStats(
  meta.feature,
  query.feature,
  features.match = c("GC.percent"),
  n = 10000,
  verbose = TRUE,
  ...
)

Arguments

meta.feature: A dataframe containing DNA sequence information for features to choose from
query.feature: A dataframe containing DNA sequence information for features to match.
features.match: Which features of the query to match when selecting a set of regions. A vector of column names present in the feature metadata can be supplied to match multiple characteristics at once. Default is GC content.
n: Number of regions to select, with characteristics matching the query
verbose: Display messages
...: Arguments passed to other functions

Value

Returns a character vector

Details

For each requested feature to match, a density distribution is estimated using the stats::density() function, and a set of weights for each feature in the dataset estimated based on the density distribution. If multiple features are to be matched (for example, GC content and overall accessibility), features are first transformed such that they are uncorrelated with each other using a Cholesky decomposition and a joint density distribution is then computed by multiplying the individual feature weights. A set of features with characteristics matching the query regions is then selected using the base::sample() function, with the probability of randomly selecting each feature equal to the joint density distribution weight. If the wrswoR package is available, the wrswoR::sample_int_crank() function is used for faster sampling.

Examples

metafeatures <- atac_small[["peaks"]][[]]
query.feature <- metafeatures[1:10, ]
features.choose <- metafeatures[11:nrow(metafeatures), ]
MatchRegionStats(
  meta.feature = features.choose,
  query.feature = query.feature,
  features.match = "GC.percent",
  n = 10
)
#> Matching region characteristics using nearest-neighbor distance
#>  [1] "chr1:1012999-1013896" "chr1:1261037-1261825" "chr1:1103886-1104761"
#>  [4] "chr1:1098941-1099797" "chr1:1264764-1265656" "chr1:890356-891196"  
#>  [7] "chr1:1188896-1189774" "chr1:1222590-1223380" "chr1:955190-956101"  
#> [10] "chr1:1259851-1260705"