Title: | Longitudinal Regression Trees and Forests |
---|---|
Description: | Builds regression trees and random forests for longitudinal or functional data using a spline projection method. Implements and extends the work of Yu and Lambert (1999) <doi:10.1080/10618600.1999.10474847>. This method allows trees and forests to be built while considering either level and shape or only shape of response trajectories. |
Authors: | Anna Neufeld [aut, cre], Brianna Heggeseth [aut, ths] |
Maintainer: | Anna Neufeld <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2025-03-01 03:36:02 UTC |
Source: | https://github.com/anna-neufeld/splinetree |
Returns the average number of terminal nodes for trees in a forest
avSize(forest)
avSize(forest)
forest |
A model returned by splineForest() |
The average number of terminal nodes in forest
avSize(forest)
avSize(forest)
Given a terminal node number, this function returns the data belonging to this terminal node. If the dataType argument is 'all', returns all rows of data from the original dataset that fall in this node. Otherwise, the flattened data that belongs to this node is returned (one row of data per ID, original responses replaced by spline coefficients).
getNodeData(tree, node, dataType = "all")
getNodeData(tree, node, dataType = "all")
tree |
a model returned from splineTree() |
node |
The number of the node to retrieve data from. Must be valid number of a terminal node. Node numbers can be seen using stPrint(tree) or treeSummary(tree). |
dataType |
If "all", the data returned is from the original dataset (one row per individual observation with original response values). If "flat", the data returned is the flattened data (one row per person/unit), with individual spline coefficients instead of response values. |
A dataframe which holds all the data that falls into this node of the tree.
## Not run: split_formula <- BMI ~ HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, 'ID', nlsySample, degree=1, df=3, intercept=TRUE, cp=0.006, minNodeSize=20) ## End(Not run) node6data <- getNodeData(tree, 6, dataType = 'all') plot(BMI~AGE, data=node6data)
## Not run: split_formula <- BMI ~ HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, 'ID', nlsySample, degree=1, df=3, intercept=TRUE, cp=0.006, minNodeSize=20) ## End(Not run) node6data <- getNodeData(tree, 6, dataType = 'all') plot(BMI~AGE, data=node6data)
Corresponds to plotting only the second panel of stPlot(). If model$intercept==FALSE, estimated intercepts are added to each trajectory so that the trajectories are plotted at the level of reasonable response values.
nodePlot(model, colors = NULL)
nodePlot(model, colors = NULL)
model |
A model returned from splineTree() |
colors |
A list of colors to use. By default, uses colors drawn from a rainbow. |
Given a named vector of variable importance measures, this function makes a barplot of the relative importances. The importances are scaled to sum to 1. An appropriate input is one column of the output from varImpY() or varImpCoeff().
plotImp(importance_vector, ...)
plotImp(importance_vector, ...)
importance_vector |
a named vector where the names are the variables and the vector stores the importances. |
... |
additional arguments to plot, such as "main", "cex", etc. |
imp <- varImpCoeff(forest)[,3] plotImp(imp, main="Standardized Variable Importance")
imp <- varImpCoeff(forest)[,3] plotImp(imp, main="Standardized Variable Importance")
Creates a simple plot of the predicted trajectory at a given node. Option to include the data that falls in the node on the same plot.
plotNode(tree, node, includeData = FALSE, estimateIntercept = TRUE)
plotNode(tree, node, includeData = FALSE, estimateIntercept = TRUE)
tree |
A model returned from splineTree() |
node |
A node number. Must be a valid terminal node for the given spline tree. To view valid terminal node numbers, use stPrint() or treeSummary(). |
includeData |
Would you like to see the data from the node plotted along with the predicted trajectory? |
estimateIntercept |
If the tree was built without an intercept, should the average starting response of all the individuals in the node be added to the trajectory to give the plot interpretable values? Or should the shape of the trajectory be plotted without any regard to the intercept? |
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) plotNode(tree, 6, includeData=TRUE)
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) plotNode(tree, 6, includeData=TRUE)
Returns a matrix of spline coefficients for each observation in the testset. If no testset is provided, returns predicted coefficients for the individuals in training set; in this case, the columns of the returned predictions correspond to the rows of the flattened training dataset (found in tree$parms$flat_data).
predictCoeffs(tree, testset = tree$parms$flat_data)
predictCoeffs(tree, testset = tree$parms$flat_data)
tree |
A model created with splineTree() |
testset |
The dataset to predict coefficients for. Default is the flattened dataset used to make the tree. |
importFrom treeClust rpart.predict.leaves
A matrix of spline coefficients. The dimension of the matrix is the degrees of freedom of the spline by the number of units in the test set. The ith column of the matrix holds the predicted coefficients for the ith row in the testset.
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) preds <- predictCoeffs(tree)
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) preds <- predictCoeffs(tree)
Uses the forest to predict spline coefficients. Returns a matrix of predicted spline coefficients where the columns of the returned matrix correspond to rows of the testdata. The number of rows of the returned matrix is equal to the degrees of freedom of the forest. If no testdata is provided, forest$flat_data is used. When testdata is not provided, predictions will be made according to one of three methods. The "method" parameter must be either "oob", "itb", or "all". This parameter specifies which trees are used in making a prediction for a certain datapoint. This parameter is not relevant when predicting for a testset that is distinct from the training set.
predictCoeffsForest(forest, method = "oob", testdata = NULL)
predictCoeffsForest(forest, method = "oob", testdata = NULL)
forest |
A model returned from splineForest() |
method |
A string; either "oob", "itb", or "all". If "oob" (the default), predictions for a given data point are made only using trees for which this data point was "out of the bag" (not in the random subsample). If "itb", predictions for a given data point are made using only the trees for which this datapoint was "in the bag" (in the random subsample). If "all", all trees are used for every datapoint. |
testdata |
The test data to make predictions for. If this is provided, then all trees are used for all datapoints. |
A matrix of predicted spline coefficients. The dimensions are forest$df x nrow(testdata). Each column of the matrix corresponds to a row of the testdata.
trainingSetPreds <- predictCoeffsForest(forest) newData <- data.frame("WHITE" = 0, "BLACK"=1, "HISP"=0, "Num_sibs"=3, "HGC_MOTHER"=12, "HGC_FATHER"=12, "SEX"=1) predictCoeffsForest(forest, testdata = newData)
trainingSetPreds <- predictCoeffsForest(forest) newData <- data.frame("WHITE" = 0, "BLACK"=1, "HISP"=0, "Num_sibs"=3, "HGC_MOTHER"=12, "HGC_FATHER"=12, "SEX"=1) predictCoeffsForest(forest, testdata = newData)
Returns a vector of predicted responses for the testData. If testData is ommitted, returns predictions for the training data. This function is most meaningful if model$intercept==TRUE.
predictY(model, testData = NULL)
predictY(model, testData = NULL)
model |
A model created with splineTree() |
testData |
The data to return predictions for. If ommitted, uses the training data. |
A vector of predictions with rows corresponding to the testdata.
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) plot(predictY(tree), tree$parms$data[[tree$parms$yvar]])
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) plot(predictY(tree), tree$parms$data[[tree$parms$yvar]])
Uses the forest to make predictions of responses for individuals. This method should only be used on forests where forest$intercept=TRUE. If the testdata parameter is null, makes predictions for each row of the training data. In this case, the methods parameter (which should be set to "oob", "itb", or "all") determines the method used for prediction. If the testdata parameter is not null, the methods parameter is ignored and all trees are used for the prediction of every datapoint.
predictYForest(forest, method = "oob", testdata = NULL)
predictYForest(forest, method = "oob", testdata = NULL)
forest |
A model returned from splineForest() |
method |
A string. Must be either "oob", "itb", or "all". Only relevant when testdata is NULL. The default value is "oob". If "oob", predictions for a given data point are made only using trees for which this data point was "out of the bag" (not in the random subsample). If "itb", predictions for a given data point are made using only the trees for which this datapoint was in the bag (in the random subsample). If "all", all trees are used for every datapoint. |
testdata |
the Test data to make predictions for. If this is provided, then all trees are used for all datapoints. |
A vector of predicted responses. The indices of the vector correspond to rows of the testdata.
trainingSetPreds <- predictYForest(forest) newData <- data.frame("AGE"=21, "WHITE" = 0, "BLACK"=1, "HISP"=0, "Num_sibs"=3, "HGC_MOTHER"=12, "HGC_FATHER"=12, "SEX"=1) predictYForest(forest, testdata = newData)
trainingSetPreds <- predictYForest(forest) newData <- data.frame("AGE"=21, "WHITE" = 0, "BLACK"=1, "HISP"=0, "Num_sibs"=3, "HGC_MOTHER"=12, "HGC_FATHER"=12, "SEX"=1) predictYForest(forest, testdata = newData)
Computes an R^2 measure for a splinetree based on the projected sum of squared errors. Returns 1-SSE/SST. SSE is the sum of projection squared errors between individual smoothed trajectories and predicted smoothed trajectories evaluated on a fixed grid. SST is the sum of projection squared errors between individual smoothed trajectories and the overall population mean trajectory, evaluated on the same fixed grid. If model$intercept==TRUE, then there is the option to ignore the intercept coefficient when computing this metric. When the intercept is ignored, the metric captures how well the model explains variation in shape, and ignores any variation in intercept explained by the model.
projectedR2(model, includeIntercept = FALSE)
projectedR2(model, includeIntercept = FALSE)
model |
a model created with splineTree() |
includeIntercept |
If FALSE and if the model was built with an intercept, the projected squared errors are computed while ignoring the intercept. If the model was built without an intercept, this parameter does not do anything. |
The percentage of variation in projected trajectory explained by the model. Computed as 1-SSE/SST. See description.
r2 <- projectedR2(tree)
r2 <- projectedR2(tree)
Computes an R-squared-like evaluation metric for a spline forest. Goal is to see how well the predicted spline coefficients for each individual match the spline coefficients obtained when fitting a spline only to this individual's data (we call these coefficients the true coefficients). Computes 1-SSE/SST, where SSE is the total sum of squared projection errors of the true coefficients compared to the predicted coefficients, and SST is the total sum of squared projection errors of the true coefficients compared to the population mean coefficients. If this is an intercept forest, have the option to compute these sum of squares either with the intercept included or with the intercept ignored to isolate the shape.
projectedR2Forest(forest, method = "oob", removeIntercept = TRUE)
projectedR2Forest(forest, method = "oob", removeIntercept = TRUE)
forest |
The output of a call to splineForest() |
method |
How would you like to compute this metric? The choices are "oob", "itb", or "all". "oob" means that predictions for a datapoint can only be made using trees for which that datapoint was "out of the bag" (not in the random subsample). "all" means that all trees are used in the prediction for every datapoint. "itb" means that predictions for a datapoint are made using only the trees for which this datapoint was IN the random subsample. |
removeIntercept |
If true, the projection sum of squared error is computed while ignoring the intercept coefficient. This will help capture the tree's performance at clustering based on shape, not based on level. This parameter is only meaningful if this forest was built using an intercept. |
Returns 1-SSE/SST, where SSE is the total sum of squared projection errors of the true coefficients compared to the predicted coefficients, and SST is the total sum of squared projection errors of the true coefficients compared to the population mean coefficients.
projectedR2Forest(forest, method="all", removeIntercept=TRUE) projectedR2Forest(forest, method="all", removeIntercept=FALSE)
projectedR2Forest(forest, method="all", removeIntercept=TRUE) projectedR2Forest(forest, method="all", removeIntercept=FALSE)
Prunes each tree in the list forest$Trees according to the provided complexity parameter. Returns a new forest.
pruneForest(forest, cp)
pruneForest(forest, cp)
forest |
A model returned by splineForest() |
cp |
The complexity parameter that will be used to prune each tree (see rpart package documentation for detailed description of complexity parameter) |
A new spline forest model (named list) where each tree has been pruned to the desired level.
print(avSize(forest)) print(avSize(pruneForest(forest, cp=0.007))) print(avSize(pruneForest(forest, cp=0.01)))
print(avSize(forest)) print(avSize(pruneForest(forest, cp=0.007))) print(avSize(pruneForest(forest, cp=0.01)))
Uses ggplot to create a paneled spaghetti plot of the data, where each panel corresponds to a terminal node in the tree. Allows users to visualize homogeneity of trajectories within the terminal nodes of the tree while also looking at the trajectories of different nodes side by side.
spaghettiPlot(model, colors = NULL)
spaghettiPlot(model, colors = NULL)
model |
a model returned from splineTree() |
colors |
optional argument specifying colors to be used for each panel. |
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySubset, degree = 1, df = 3, intercept = TRUE, cp = 0.005) spaghettiPlot(tree)
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySubset, degree = 1, df = 3, intercept = TRUE, cp = 0.005) spaghettiPlot(tree)
Builds an ensemble of regression trees for longitudinal or functional data using the spline projection method. The resulting model contains a list of spline trees along with some additional information. All parameters are used in the same way that they are used in the splineTree() function. The additional parameter ntree specifies how many trees should be in the ensemble, and prob controls the probability of selecting a given variable for split consideration at a node. This method may take several minutes to run- saving the forest after building it is recommended.
splineForest(splitFormula, tformula, idvar, data, knots = NULL, df = NULL, degree = 3, intercept = FALSE, nGrid = 7, gridPoints = NULL, ntree = 50, prob = 0.3, cp = 0.001, minNodeSize = 1, bootstrap = FALSE)
splineForest(splitFormula, tformula, idvar, data, knots = NULL, df = NULL, degree = 3, intercept = FALSE, nGrid = 7, gridPoints = NULL, ntree = 50, prob = 0.3, cp = 0.001, minNodeSize = 1, bootstrap = FALSE)
splitFormula |
Formula specifying the longitudinal response variable and the time-constant variables that will be used for splitting in the tree. |
tformula |
Formula specifying the longitudinal response variable and the variable that acts as the time variable. |
idvar |
The name of the variable that serves as the ID variable for grouping observations. Must be in quotes |
data |
dataframe that contains all variables specified in the formulas- in long format. |
knots |
Specified locations for internal knots in the spline basis. Defaults to NULL, which corresponds to no internal knots. |
df |
Degrees of freedom of the spline basis. If this is specified but the knots parameter is NULL, then the appropriate number of internal knots will be added at quantiles of the training data. If both df and knots are unspecified, the spline basis will have no internal knots. |
degree |
Specifies degree of spline basis used in the tree. |
intercept |
Specifies whether or not the splitting process will consider the intercept coefficient of the spline projections. Defaults to FALSE, which means that the tree will split based on trajectory shape, ignoring response level. |
nGrid |
Number of grid points to evaluate projection sum of squares at. If gridPoints is not supplied, then this is the number of grid points that will be automatically placed at quantiles of the time variable. The default is 7. |
gridPoints |
Optional. A vector of numbers that will be used as the grid on which to evaluate the projection sum of squares. Should fall roughly within the range of the time variable. |
ntree |
Number of trees in the forest. |
prob |
Probability of selecting a variable to included as a candidate for each split. |
cp |
Complexity parameter passed to the rpart building process. Default is the rpart default of 0.01 |
minNodeSize |
Minimum number of observational units that can be in a terminal node. Controls tree size and helps avoid overfitting. Default is 10. |
bootstrap |
Boolean specifying whether bootstrap sampling should be used when choosing data to use for each tree. When set to FALSE (the default), sampling without replacement is used and 63.5 is used for each tree. When set to TRUE, a bootstrap sample is used for each tree. |
The ensemble method is highly similar to the random forest methodology of Breiman (2001). Each tree in the ensemble is fit to a random sample of 63.5 the subset of variables considered at each node is determined by a random process. The prob parameter specifies the probability that a given variable will be selected at a certain node. Because the method is based on probability, the same number of variables are not considered for splitting at each node (as in the randomForest package). Note that if prob is small and the number of variables in the splitFormula is also small, there is a high probability that no variables will be considered for splitting at a certain node, which is problematic. The fewer total variables there are, the larger prob should be to ensure good results.
A spline forest model, which is a named list with 15 components. The list stores a list of trees (in model$Trees), along with information about the spline basis used (model$intercept, model$innerKnots, model$boundaryKnots, etc.), and information about which datapoints were used to build each tree (model$oob_indices and model$index). Note that each element in model$Trees is an rpart object but it is not the same as a model returned from splineTree() because it does not store all relevant information in model$parms.
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] splitForm <-~HISP+WHITE+BLACK+HGC_MOTHER+HGC_FATHER+SEX+Num_sibs sampleForest <- splineForest(splitForm, BMI~AGE, 'ID', nlsySubset, degree=1, cp=0.005, ntree=10)
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] splitForm <-~HISP+WHITE+BLACK+HGC_MOTHER+HGC_FATHER+SEX+Num_sibs sampleForest <- splineForest(splitForm, BMI~AGE, 'ID', nlsySubset, degree=1, cp=0.005, ntree=10)
Builds a regression tree for longitudinal or functional data using the spline projection method. The underlying tree building process uses the rpart package, and the resulting spline tree is an rpart object with additional stored information. The parameters df, knots, degree, intercept allow for flexibility in customizing the spline basis used for projection. The parameters nGrid and gridPoints allow for flexibility in the grid on which the projection sum of squares is evaluated. The parameters minNodeSize and cp allow for flexibility in controlling the size of the final tree.
splineTree(splitFormula, tformula, idvar, data, knots = NULL, df = NULL, degree = 3, intercept = FALSE, nGrid = 7, gridPoints = NULL, minNodeSize = 10, cp = 0.01)
splineTree(splitFormula, tformula, idvar, data, knots = NULL, df = NULL, degree = 3, intercept = FALSE, nGrid = 7, gridPoints = NULL, minNodeSize = 10, cp = 0.01)
splitFormula |
Formula specifying the longitudinal response variable and the time-constant variables that will be used for splitting in the tree. |
tformula |
Formula specifying the longitudinal response variable and the variable that acts as the time variable. |
idvar |
The name of the variable that serves as the ID variable for grouping observations. Must be a string. |
data |
dataframe in long format that contains all variables specified in the formulas. |
knots |
Specified locations for internal knots in the spline basis. Defaults to NULL, which corresponds to no internal knots. |
df |
Degrees of freedom of the spline basis. If this is specified but the knots parameter is NULL, then the appropriate number of internal knots will be added at quantiles of the training data. If both df and knots are unspecified, the spline basis will have no internal knots. If knots is specified, this parameter will be ignored. |
degree |
Specifies degree of spline basis used for projection. |
intercept |
Specifies whether or not the set of basis functions will include the intercept function. Defaults to FALSE, which means that the tree will split based on trajectory shape, ignoring response level. |
nGrid |
Number of grid points to evaluate projection sum of squares at. If gridPoints is not supplied, this argument will be used and the appropriate number of grid points will be placed at equally spaced quantiles of the time variable. The default is 7. |
gridPoints |
Optional. A vector of numbers that will be used as the grid on which to evaluate the projection sum of squares. Should fall roughly within the range of the time variable. |
minNodeSize |
Minimum number of observational units that can be in a terminal node. Controls tree size and helps avoid overfitting. Defaults to 10. |
cp |
Complexity parameter passed to the rpart building process. Controls tree size. Defaults to the rpart default of 0.01. |
An rpart object with additional splinetree-specific information stored in model$parms. The important attributes of the rpart object include model$frame, model$where, and model$cptable. model$frame holds information about each node in the tree. The ith entry in model$where tells us which row of model$frame describes the node that the ith individual in the flattened dataset falls into. model$parms$flat_data holds the flattened dataset that was used to build the tree. model$cptable displays the complexity parameters that would be needed to prune the tree to various desired sizes. Apart from holding the flattened dataset, model$parms holds the boundary knots and the internal knots of the spline basis used to build the tree. These are sometimes important to recover later.
nlsySample_subset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 500),] splitForm <- ~HISP+WHITE+BLACK+HGC_MOTHER+HGC_FATHER+SEX+Num_sibs tree1 <- splineTree(splitForm, BMI~AGE, 'ID', nlsySample_subset, degree=3, intercept=TRUE, cp=0.005) stPrint(tree1) stPlot(tree1)
nlsySample_subset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 500),] splitForm <- ~HISP+WHITE+BLACK+HGC_MOTHER+HGC_FATHER+SEX+Num_sibs tree1 <- splineTree(splitForm, BMI~AGE, 'ID', nlsySample_subset, degree=3, intercept=TRUE, cp=0.005) stPrint(tree1) stPlot(tree1)
Creates a tree plot of a spline tree. This corresponds to plotting only the first panel of stPlot(). Code for this function was borrowed from the longRPart package on github.
splineTreePlot(model, colors = NULL)
splineTreePlot(model, colors = NULL)
model |
a model returned from splineTree() |
colors |
a list of colors that will be used for the terminal nodes (if NULL, will use a rainbow) |
Creates a two paneled plot of a splinetree that shows both the tree and the trajectories side by side. Note that this function has trouble when the plot window is not wide enough. If nothing shows up in RStudio, try increasing the size of the plot window and trying again. For a tree without an intercept, intercepts are estimated after-the-fact for each node using the average starting value in the data so that the plotted trajectories have reasonable response values.
stPlot(model, colors = NULL)
stPlot(model, colors = NULL)
model |
A model returned from splineTree() |
colors |
A list of colors that will be used for the trajectories (if NULL, will automatically select colors from rainbow color scheme. |
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) stPlot(tree, colors = c("red", "orange", "green", "blue", "cyan", "magenta"))
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) stPlot(tree, colors = c("red", "orange", "green", "blue", "cyan", "magenta"))
The printout provides numbered labels for the terminal nodes, a description of the split at each node, the number of observations found at each node, and the predicted spline coefficients for each node. This code is primarily taken from rpart base code for print.rpart. It has been modified to ensure that the full vector of coefficients is printed for each node.
stPrint(t, cp, digits = getOption("digits"))
stPrint(t, cp, digits = getOption("digits"))
t |
A model returned by splineTree() |
cp |
Optional- if provided, a pruned version of the tree will be printed. The tree will be pruned using the provided cp as the complexity parameter. |
digits |
Specifies how many digits of each coefficient should be printed |
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) stPrint(tree)
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) stPrint(tree)
If no argument is provided for the parameter node
, summaries are printed for every
terminal node. Otherwise, the summary of just the requested node is printed.
terminalNodeSummary(tree, node = NULL)
terminalNodeSummary(tree, node = NULL)
tree |
A model returned by splineTree(). |
node |
The number of the node that you want summarized. To see which nodes correspond to which numbers, see stPrint(tree) or treeSummary(tree). If this parameter is provided, must correspond to a valid terminal node in the tree. |
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) terminalNodeSummary(tree)
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) terminalNodeSummary(tree)
Computes the Adjusted Rand Index of the clusterings of the population created by the two trees. In the case of correlated covariates, two trees that split on entirely different variables may actually describe similar partitions of the population. This metric allows us to detect when two trees are partitioning the population similarly. A value close to 1 indicates a similar clustering.
treeSimilarity(tree1, tree2)
treeSimilarity(tree1, tree2)
tree1 |
a model returned from splineTree() |
tree2 |
a model returned from splineTree() |
The Adjusted Rand Index of the clusterings created by the two trees.
mclust::adjustedRandIndex
splitForm <- ~SEX+Num_sibs+HGC_MOTHER+HGC_FATHER nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] tree1 <- splineTree(splitForm, BMI~AGE, "ID", nlsySubset, degree=1, df=2, intercept=FALSE, cp=0.005) tree2 <- splineTree(splitForm, BMI~AGE, "ID", nlsySubset, degree=1, df=3, intercept=TRUE, cp=0.005) treeSimilarity(tree1, tree2)
splitForm <- ~SEX+Num_sibs+HGC_MOTHER+HGC_FATHER nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] tree1 <- splineTree(splitForm, BMI~AGE, "ID", nlsySubset, degree=1, df=2, intercept=FALSE, cp=0.005) tree2 <- splineTree(splitForm, BMI~AGE, "ID", nlsySubset, degree=1, df=3, intercept=TRUE, cp=0.005) treeSimilarity(tree1, tree2)
Returns number of terminal nodes in a tree.
treeSize(model)
treeSize(model)
model |
A model returned by splineTree(). Also works on any rpart object |
The number of terminal nodes in the tree
## Not run: split_formula <- ~ HISP + WHITE + BLACK + SEX + HGC_FATHER + HGC_MOTHER + Num_sibs tree <- splineTree(split_formula, BMI~AGE, 'ID', nlsySample, degree=1, df=3, intercept=TRUE, cp=0.006, minNodeSize=20) ## End(Not run) treeSize(tree)
## Not run: split_formula <- ~ HISP + WHITE + BLACK + SEX + HGC_FATHER + HGC_MOTHER + Num_sibs tree <- splineTree(split_formula, BMI~AGE, 'ID', nlsySample, degree=1, df=3, intercept=TRUE, cp=0.006, minNodeSize=20) ## End(Not run) treeSize(tree)
Provides a similar output to model$frame, but with the redundant information of yval and yval2 removed. Also omits the deviance, the complexity, and the weight. Useful for viewing node numbers and for extracting coefficients for a given node.
treeSummary(model)
treeSummary(model)
model |
A model built with splineTree() |
A dataframe. The number of rows is the same as the number of nodes in the tree. The row names display the node labels of each node. The "var" attribute either displays the split variable selected at each node, or <leaf> if this node is a terminal node. The "n" attribute displays the number of individuals in the node. The "dev" attribute reports the projected sum of squares at this node; terminal nodes have the smallest values for "dev" because this is what the tree building process is supposed to minimize. The "coeffs" attribute displays the coefficients predicted for each node.
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySubset, degree = 1, df = 3, intercept = TRUE, cp = 0.005) treeSummary(tree)
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),] split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySubset, degree = 1, df = 3, intercept = TRUE, cp = 0.005) treeSummary(tree)
Returns the random forest variable importance based on the permutation accuracy measure, which is calculated as the difference in mean squared error between the original data and from randomly permuting the values of a variable.
varImpCoeff(forest, removeIntercept = TRUE, method = "oob")
varImpCoeff(forest, removeIntercept = TRUE, method = "oob")
forest |
a random forest, generated from splineForest() |
removeIntercept |
a boolean value, TRUE if you want to exclude the intercept in the calculations, FALSE otherwise. |
method |
the method to be used. This must be one of "oob" (out of bag), "all", "itb" (in the bag). |
a matrix of variable importance metrics.
importanceMatrix <- varImpCoeff(forest, removeIntercept=TRUE)
importanceMatrix <- varImpCoeff(forest, removeIntercept=TRUE)
Returns the random forest variable importance based on the permutation accuracy measure, which is calculated as the difference in mean squared error between the original data and from randomly permuting the values of a variable.
varImpY(forest, method = "oob")
varImpY(forest, method = "oob")
forest |
a random forest, generated from splineForest() |
method |
the method to be used. This must be one of "oob" (out of bag), "all", "itb" (in the bag). |
The "method" parameter deals with the way in which forest performance should be measured. Since variable importance is based on a change in performance, the "method" parameter is necessary for a variable importance measure. The choices are "oob" (out of bag), "all", or "itb" (in the bag).
A matrix storing variable importance metrics. The rows correspond to split variables. The columns are different methods of measuring importance. The first column is the absolute importance (mean difference in performance between permuted and unpermuted datasets). The second column measures the mean percent difference in performance. The third column standardizes the differences by dividing them by their standard deviation.
importanceMatrix <- varImpY(forest, method="oob") plotImp(importanceMatrix[,3])
importanceMatrix <- varImpY(forest, method="oob") plotImp(importanceMatrix[,3])
Computes the percentage of variation in response explained by the spline tree. This metric is only meaningful if model$intercept==TRUE. If the tree includes an intercept, the measure will be between 0 and 1.
yR2(model)
yR2(model)
model |
a model created with splineTree() |
An R^2 goodness measure. 1-SSE/SST where SSE is the sum of squared errors between predicted responses and true responses, and SST is sum of squared errors of true responses around population mean. Note that if the tree passed in was built without an intercept, this function will return NULL.
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) yR2(tree)
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER tree <- splineTree(split_formula, BMI~AGE, idvar = "ID", data = nlsySample, degree = 1, df = 3, intercept = TRUE, cp = 0.005) yR2(tree)
Computes the R-squared metric for a spline forest. Goal is to see how well the predicted response values match the actual response values. Note that this function should only be used on forests where the intercept parameter is TRUE. A simple 1-SSE/SST calculation.
yR2Forest(forest, method = "oob")
yR2Forest(forest, method = "oob")
forest |
The output from a call to splineForest() |
method |
How would you like to compute this metric? The choices are "oob", "itb", or "all". "oob" means that predictions for a datapoint can only be made using trees for which that datapoint was "out of the bag" (not in the random subsample). "all" means that all trees are used in the prediction for every datapoint. "itb" means that predictions for a datapoint are made using only the trees for which this datapoint was IN the random subsample. |
Returns 1-SSE/SST, where SSE is the total sum of squared errors of the true responses and predicted responses, and SST is the total sum of squared errors of the responses around their mean. If this forest was not built with an intercept, returns NULL.
yR2Forest(forest, method="all")
yR2Forest(forest, method="all")