| Title: | Splitting a Count Matrix into Independent Folds |
|---|---|
| Description: | Implements the count splitting methodology from Neufeld et al. (2022) <doi:10.1093/biostatistics/kxac047> and Neufeld et al. (2023) <arXiv:2307.12985>. Intended for turning a matrix of single-cell RNA sequencing counts, or similar count datasets, into independent folds that can be used for training/testing or cross validation. Assumes that the entries in the matrix are from a Poisson or a negative binomial distribution. |
| Authors: | Anna Neufeld [aut, cre, cph], Mischko Heming [ctb], Joshua Popp [ctb], Sebastian Quadrini [ctb] |
| Maintainer: | Anna Neufeld <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 4.0.2 |
| Built: | 2026-05-31 07:15:58 UTC |
| Source: | https://github.com/anna-neufeld/countsplit |
Takes one matrix of counts and splits it into a specified number of folds. Each fold is a matrix of counts with the same dimension as the original matrix. Summing element-wise across the folds yields the original data matrix.
countsplit(X, folds = 2, epsilon = rep(1/folds, folds), overdisps = NULL)countsplit(X, folds = 2, epsilon = rep(1/folds, folds), overdisps = NULL)
X |
A cell-by-gene matrix of integer counts. Note that this differs from many scRNA-seq packages, where gene-by-cell is the convention. When Poisson count splitting is used ( |
folds |
An integer specifying how many folds you would like to split your data into. |
epsilon |
A vector, which has length |
overdisps |
If NULL, then Poisson count splitting will be performed. Otherwise, this parameter should be a vector of non-negative numbers whose length is equal to the number of columns of |
When the argument overdisps is set to NULL, this function performs the Poisson count splitting methodology outlined in
Neufeld et al. (2022). With this setting, the folds of data are independent only if the original data were drawn from a Poisson distribution.
If the data are thought to be overdispersed relative to the Poisson, then we may instead model them as coming from a negative binomial distribution,
If we assume that , where this parameterization means that and , then
we should pass in overdisps = . If this is the correct assumption, then the resulting folds of data will be independent.
This is the negative binomial count splitting method of Neufeld et al. (2023).
Please see our tutorials and vignettes for more details.
A list of length folds. Each element in the list stores a sparse matrix with the same dimensions as the data X. Each list element is a fold of data.
reference
library(countsplit) library(Matrix) library(Rcpp) # A Poisson count splitting example. n=400 p=2 X <- matrix(rpois(n*p, 7), nrow=n, ncol=p) split <- countsplit(X, folds=2) Xtrain <- split[[1]] Xtest <- split[[2]] cor(Xtrain[,1], Xtest[,1]) cor(Xtrain[,2], Xtest[,2]) # A negative binomial count splitting example. X <- matrix(rnbinom(n*p, mu=7, size=7), nrow=n, ncol=p) split <- countsplit(X, folds=2, overdisps=c(7,7)) Xtrain <- split[[1]] Xtest <- split[[2]] cor(Xtrain[,1], Xtest[,1]) cor(Xtrain[,2], Xtest[,2])library(countsplit) library(Matrix) library(Rcpp) # A Poisson count splitting example. n=400 p=2 X <- matrix(rpois(n*p, 7), nrow=n, ncol=p) split <- countsplit(X, folds=2) Xtrain <- split[[1]] Xtest <- split[[2]] cor(Xtrain[,1], Xtest[,1]) cor(Xtrain[,2], Xtest[,2]) # A negative binomial count splitting example. X <- matrix(rnbinom(n*p, mu=7, size=7), nrow=n, ncol=p) split <- countsplit(X, folds=2, overdisps=c(7,7)) Xtrain <- split[[1]] Xtest <- split[[2]] cor(Xtrain[,1], Xtest[,1]) cor(Xtrain[,2], Xtest[,2])
Takes one matrix of counts and estimates the overdispersion parameter of each row.
est_overdispersions(matrix, min_cells = 5, n_genes = NULL, ...)est_overdispersions(matrix, min_cells = 5, n_genes = NULL, ...)
matrix |
A genes (rows) x cells (columns) count matrix. |
min_cells |
Minimum number of cells expressing a gene for it to be modeled by |
n_genes |
Number of genes to use in |
... |
Additional arguments passed to |
A numeric vector of overdispersion values per row of matrix; genes not modeled return Inf.
if (requireNamespace("sctransform", quietly = TRUE)) { library(Matrix) } n1 <- 500 n2 <- 500 p <- 200 s1 <- 3 # True overdispersion parameter of first set of rows s2 <- 9 # True overdispersion parameter of second set of rows mean1 <- 9 mean2 <- 9 # Make both matrices with the columns (same cells) col_names <- paste0("cell_", seq_len(p)) X1 <- matrix(rnbinom(n1 * p, mu = mean1, size = s1), nrow = n1, ncol = p, dimnames = list(paste0("gene_", seq_len(n1)), col_names)) X2 <- matrix(rnbinom(n2 * p, mu = mean2, size = s2), nrow = n2, ncol = p, dimnames = list(paste0("gene_", (n1 + 1):(n1 + n2)), col_names)) # Stack by rows such that the resultant matrix is 1000 x 200 X <- rbind(X1, X2) # Estimate the overdispersions of the rows of the combined expression matrix estover <- est_overdispersions(X) # Create a vector containing the true overdispersion parameters of first and second set of rows v1 <- rep(s1, times = n1) v2 <- rep(s2, times = n2) trueover <- c(v1, v2) # Calculate the correlation between the estimated and true overdispersions cor(estover, trueover) countsplit(X, estover)if (requireNamespace("sctransform", quietly = TRUE)) { library(Matrix) } n1 <- 500 n2 <- 500 p <- 200 s1 <- 3 # True overdispersion parameter of first set of rows s2 <- 9 # True overdispersion parameter of second set of rows mean1 <- 9 mean2 <- 9 # Make both matrices with the columns (same cells) col_names <- paste0("cell_", seq_len(p)) X1 <- matrix(rnbinom(n1 * p, mu = mean1, size = s1), nrow = n1, ncol = p, dimnames = list(paste0("gene_", seq_len(n1)), col_names)) X2 <- matrix(rnbinom(n2 * p, mu = mean2, size = s2), nrow = n2, ncol = p, dimnames = list(paste0("gene_", (n1 + 1):(n1 + n2)), col_names)) # Stack by rows such that the resultant matrix is 1000 x 200 X <- rbind(X1, X2) # Estimate the overdispersions of the rows of the combined expression matrix estover <- est_overdispersions(X) # Create a vector containing the true overdispersion parameters of first and second set of rows v1 <- rep(s1, times = n1) v2 <- rep(s2, times = n2) trueover <- c(v1, v2) # Calculate the correlation between the estimated and true overdispersions cor(estover, trueover) countsplit(X, estover)