I want to estimate the relative importance of variables in explaining a response variable ("dep_var", a numeric variable based on a 4-point Likert scale). I am mostly intersted in the relative importance of six (correlated) factors ("f1" to "f6").
To achieve this, I want to rely on the partykit
package in R
and run a cforest
with varimp(conditional = T)
as suggested by Strebel et al. (2008) and elsewhere. I have NA
values in the data, but according to documentation, cforest
can handle them. (Correspondingly, its default na.action
is na.pass
with surrogate variable selection). However, when I run the varimp
function, I get an error message instead of results.
What I did so far:
Create Random Forest
I first produced a random forest object using the partykit::cforest
function.
library(partykit)
library(tidyverse)
formula <- paste0("dep_var ~ ", paste(c("f1", "f2", "f3", "f4", "f5", "f6", "num1", "dummy1", "dummy2", "group"), collapse = " + ")) %>% as.formula()
set.seed(100)
y <- partykit::cforest(formula = formula, data = d, cores = 6, ntree = 1000, control = partykit::ctree_control(maxsurrogate = 4))
Determine Variable Importance
Subsequently, I run the partykit::varimp
function on this forest object. It should provide me with the relative importance of all variables included in the formula provided in cforest.
partykit::varimp(y, conditional = T)
Error
However, when I run cforests and varimp on my data, I receive this error message:
Error in FUN(model = model, trafo = trafo, data = data, subset = subset, : NROW(Y) == length(ix) is not TRUE
Sample Data to Reproduce Error
library(tidyverse)
d <- structure(list(dep_var = c(4, 4, 2, 1, 2, 2, 1, 1, 1, 4, 2, 4,
3, 1, 4, 1, 1, 1, 2, 2, 2, 1, 2, 3, 4, 3, 4, 2, 3, 2, 4, 3, 2,
3, 4, 4, 3, 3, 2, 3, 4, 4, 1, 2, 3, 4, 2, 3, 4, 3, 3, 2, 2, 4,
4, 1, 2, 3, 2, 1, 4, 1, 2, 3, 2, 3, 3, 1, 4, 3, 2, 4, 3, 1, 1,
2, 1, 2, 1, 3, 4, 1, 4, 2, 3, 2, 1, 2, 4, 1, 2, 4, 2, 4, 2, 1,
4, 1, 3, 4, 3, 4, 3, 1, 3, 2, 4, 3, 3, 3, 2, 2, 2, 2, 2, 4, 3,
2, 4, 3, 3, 2, 3, 2, 4, 1, 3, 3, 2, 3, 2, 1, 3, 2, 1, 1, 4, 3,
2, 4, 4, 1, 4, 3, 3, 4, 2, 1, 2, 2, 2, 3, 4, 2, 3, 4, 1, 3, 3,
2, 2, 4, 2, 3, 1, 4, 4, 3, 2, 3, 1, 3, 2, 4, 2, 3, 3, 1, 2, 1,
2, 2, 1, 3, 2, 2, 3, 1, 3, 3, 3, 2, 2, 2, 3, 4, 3, 3, 3), f1 = structure(c(NA,
1L, 2L, 3L, 2L, 1L, 3L, NA, NA, NA, 3L, NA, 3L, 3L, NA, 2L, NA,
1L, NA, 3L, 3L, NA, 2L, 3L, 2L, 2L, NA, NA, 2L, NA, 3L, 2L, NA,
2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, NA,
NA, 3L, 3L, 2L, 3L, 2L, 3L, 3L, NA, 1L, NA, 3L, 1L, 3L, 2L, 3L,
3L, 2L, 3L, 3L, 1L, 2L, NA, 2L, NA, NA, 3L, 2L, NA, 3L, 3L, 3L,
2L, 3L, 2L, 2L, 2L, 2L, NA, 2L, 3L, 2L, 1L, 2L, NA, NA, 3L, 3L,
3L, 3L, 3L, 3L, NA, NA, NA, 1L, 3L, 3L, 1L, 3L, 3L, NA, 1L, NA,
2L, 1L, 2L, 3L, 1L, NA, 3L, 3L, NA, 2L, NA, NA, 2L, 3L, 3L, 1L,
3L, 2L, 3L, 3L, 2L, 2L, NA, NA, 2L, 3L, 3L, 3L, 3L, 2L, 2L, NA,
3L, 3L, NA, 3L, 2L, NA, 1L, NA, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L,
3L, 2L, 2L, NA, 1L, 3L, 2L, 2L, 3L, 3L, NA, 3L, NA, 3L, 1L, 3L,
2L, 2L, 2L, 2L, 3L, NA, NA, 1L, 3L, 2L, 3L, NA, 2L, 2L, 3L, 3L,
NA, 3L, 2L, 3L, NA, 1L), levels = c("long", "middle", "short"
), class = "factor"), f2 = structure(c(1L, 1L, 2L, 3L, 3L, 1L,
1L, 3L, 1L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 3L,
2L, 2L, 2L, 2L, 3L, NA, 2L, 1L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L,
3L, 3L, 3L, NA, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L,
3L, 1L, 2L, 1L, 2L, NA, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 1L,
2L, NA, 2L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L,
2L, NA, 3L, 3L, 3L, 2L, 3L, NA, NA, 3L, 3L, 2L, 2L, 3L, 3L, 1L,
2L, 3L, 2L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 3L, 1L, 2L, 3L, 1L,
1L, 3L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 3L,
3L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 2L, 3L,
3L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 1L, 2L,
3L, 2L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L,
2L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 1L,
2L), levels = c("long", "middle", "short"), class = "factor"),
f3 = structure(c(1L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, 2L, 3L,
3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 3L,
3L, NA, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 3L,
1L, 3L, 1L, 3L, 2L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L, 1L,
2L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 3L,
3L, 1L, 3L, 2L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, NA, 1L, 3L,
1L, 3L, 3L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 1L, 3L, 1L,
2L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L,
1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 3L, 3L,
3L, NA, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
3L, 1L, 3L, 3L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 1L, 1L, 1L,
1L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 3L,
1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 3L), levels = c("long", "middle",
"short"), class = "factor"), f4 = structure(c(3L, 3L, 3L,
3L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 2L, 1L,
3L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, NA, 1L, 3L, 2L, 3L, 3L, 1L,
1L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 3L, 2L,
1L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 1L, 2L,
2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L,
3L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 1L, 2L, 3L,
1L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 3L, 2L,
3L, 3L, 1L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L, 2L, 2L, 2L, 1L,
3L, 3L, 3L, 2L, 3L, 1L, 3L, 1L, 1L, 3L, 2L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 3L, 2L, 1L,
3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 2L,
1L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 2L, 3L,
3L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 2L, 1L,
3L), levels = c("long", "middle", "short"), class = "factor"),
f5 = structure(c(3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L,
3L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L,
3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 1L, 2L, 2L, 3L, 2L,
3L, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, NA, 1L, 3L, 3L, 3L,
3L, 3L, 3L, 1L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, NA, 3L, 3L, 1L, NA, 3L, 2L,
3L, 3L, 2L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
2L, 1L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 2L, 1L, 3L, 2L, 3L,
2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, NA, 3L, 3L, 2L,
3L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L,
2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("long", "middle",
"short"), class = "factor"), f6 = structure(c(3L, 3L, 3L,
3L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L,
3L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, NA, 3L, 2L,
2L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 3L, 2L, NA,
2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 2L,
3L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L,
1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 2L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 2L,
3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, NA, 3L, 3L, 2L, 2L,
3L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, NA, 2L,
2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L,
2L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 1L,
2L), levels = c("long", "middle", "short"), class = "factor"),
num1 = c(55, 34, 59, 62, 20, 39, 68, 19, 39, 61, 84, 53,
40, 32, 32, 64, 30, 37, 54, 62, 71, 67, 56, 48, 44, 86, 79,
42, 60, 59, 30, 63, 39, 87, 54, 36, 34, 60, 70, 50, 69, 64,
44, 76, 53, 54, 39, 33, 88, 76, 20, 23, 40, 39, 60, 45, 60,
73, 45, 47, 41, 48, 40, 77, 36, NA, 39, 64, 62, 53, 47, 64,
56, 34, 69, 71, 75, 62, 19, 73, 62, 40, 46, 49, 57, 72, 61,
32, 33, 67, 72, 38, 64, 54, 58, 46, 61, 36, 36, 58, 51, 51,
40, 86, 27, 65, 43, 36, 43, 55, 47, 58, 73, 51, 46, 43, 49,
32, 42, 75, 57, 21, 68, 70, 51, 49, 54, 48, 23, 21, 66, 67,
56, 33, 68, 44, 72, 38, 58, 48, 72, 46, 81, 62, 68, 62, 32,
50, 25, 27, 41, 36, 63, 25, 46, 43, 76, 62, 69, 71, 57, 35,
41, 32, 69, 59, 65, 32, 61, 40, 52, 69, 55, 42, 30, 89, 71,
43, 35, 72, 37, 59, 30, 65, 50, 40, 49, 37, 67, 70, 68, 63,
32, 52, 73, 68, 36, 72, 35), dummy1 = structure(c(1L, 2L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L,
1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L,
2L, 1L), levels = c("1", "2"), class = "factor"), dummy2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, NA, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, NA, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 2L, 1L, 2L, 2L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), levels = c("0", "1"), class = "factor"), group = structure(c(8L,
6L, 8L, 8L, 8L, 2L, 8L, 8L, 1L, 8L, 1L, 3L, 8L, 8L, 5L, 7L,
1L, 8L, 8L, 2L, 6L, 1L, 6L, 1L, 5L, 2L, 2L, NA, 3L, 2L, 8L,
8L, 3L, 8L, 3L, 3L, 6L, 4L, 8L, 2L, 5L, NA, 8L, 1L, 2L, 5L,
2L, 6L, 2L, 1L, 8L, 1L, 2L, 3L, 6L, 6L, 3L, 2L, 2L, 8L, NA,
1L, 8L, 6L, 6L, 8L, 1L, 6L, 8L, 1L, 2L, NA, 2L, 4L, 1L, 8L,
6L, 1L, NA, 2L, 2L, 8L, 8L, 8L, 2L, 8L, 8L, 6L, 3L, 8L, 1L,
1L, 2L, 3L, 2L, 8L, 1L, 1L, 3L, 8L, 2L, 8L, 3L, 2L, 4L, 1L,
3L, 4L, 5L, 1L, 8L, 8L, 3L, 6L, 3L, 3L, 8L, 3L, 6L, 1L, 2L,
1L, 8L, 8L, 1L, 1L, 7L, 8L, 3L, 5L, 4L, 3L, 6L, NA, 4L, 1L,
3L, 8L, 6L, 6L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 6L, 3L, 8L, 4L,
4L, 2L, 2L, 7L, 1L, 2L, 8L, 2L, 2L, 8L, 6L, 8L, 6L, 4L, 4L,
2L, 2L, 3L, 1L, 2L, 3L, 8L, 2L, 3L, 2L, 4L, 2L, 5L, 8L, 3L,
7L, 1L, 1L, 3L, NA, 4L, 2L, 8L, 2L, 1L, 8L, 6L, 4L, 1L, 8L,
3L, 1L, 8L), levels = c("1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor")), row.names = c(NA, -199L), class = c("tbl_df",
"tbl", "data.frame"))
d <- d %>% mutate(across(!all_of(c("dep_var", "num1")), ~ .x %>% as.factor()))
Triangulating the Root of the Problem
1.) I have restarted R
and my computer multiple times without any effect.
2.) I tried excluding all NAs in the dataset – in regard to the NA
values of the response variable as well as in regard to the factors. I did so via manipulating the dataset and via the na.action = na.exclude
option in the cforest function. This remedies my problem mostly.
(Though, even this does not reliably solve my problem: For some other dependent variables and/or with different ntree numbers, I receive the following error then: Warning: scheduled core 2 encountered error in user code, all values of the job will be affectedError in colMeans(ret, na.rm = TRUE) : 'x' must be numeric
. Moreover, for some reason, in with the provided sample code, the variance importance is only reported for a subset of variables in the forest.)
However, excluding observations with NA
values is not what i strive for, as it unnecessarily removes observations from the analysis; and it suggests that surrogates are not properly identified in the different trees/the forest.
I would appreciate any suggestion on where the problem lies to be able to extract the conditional variable importance from a random forest with na.pass
and surrogates.