r - partykit: Error when using varimp on cforest for data set including NA values - Stack Overflow

admin2025-05-02  1

I want to estimate the relative importance of variables in explaining a response variable ("dep_var", a numeric variable based on a 4-point Likert scale). I am mostly intersted in the relative importance of six (correlated) factors ("f1" to "f6"). To achieve this, I want to rely on the partykit package in R and run a cforest with varimp(conditional = T) as suggested by Strebel et al. (2008) and elsewhere. I have NAvalues in the data, but according to documentation, cforest can handle them. (Correspondingly, its default na.action is na.pass with surrogate variable selection). However, when I run the varimp function, I get an error message instead of results.

What I did so far:

Create Random Forest
I first produced a random forest object using the partykit::cforest function.

library(partykit)
library(tidyverse)

formula <- paste0("dep_var ~ ", paste(c("f1", "f2", "f3", "f4", "f5", "f6", "num1", "dummy1", "dummy2", "group"), collapse = " + ")) %>% as.formula()

set.seed(100)

y <- partykit::cforest(formula = formula, data = d, cores = 6, ntree = 1000, control = partykit::ctree_control(maxsurrogate = 4))

Determine Variable Importance
Subsequently, I run the partykit::varimp function on this forest object. It should provide me with the relative importance of all variables included in the formula provided in cforest.

partykit::varimp(y, conditional = T)

Error
However, when I run cforests and varimp on my data, I receive this error message:

Error in FUN(model = model, trafo = trafo, data = data, subset = subset, : NROW(Y) == length(ix) is not TRUE

Sample Data to Reproduce Error

library(tidyverse)

d <- structure(list(dep_var = c(4, 4, 2, 1, 2, 2, 1, 1, 1, 4, 2, 4, 
3, 1, 4, 1, 1, 1, 2, 2, 2, 1, 2, 3, 4, 3, 4, 2, 3, 2, 4, 3, 2, 
3, 4, 4, 3, 3, 2, 3, 4, 4, 1, 2, 3, 4, 2, 3, 4, 3, 3, 2, 2, 4, 
4, 1, 2, 3, 2, 1, 4, 1, 2, 3, 2, 3, 3, 1, 4, 3, 2, 4, 3, 1, 1, 
2, 1, 2, 1, 3, 4, 1, 4, 2, 3, 2, 1, 2, 4, 1, 2, 4, 2, 4, 2, 1, 
4, 1, 3, 4, 3, 4, 3, 1, 3, 2, 4, 3, 3, 3, 2, 2, 2, 2, 2, 4, 3, 
2, 4, 3, 3, 2, 3, 2, 4, 1, 3, 3, 2, 3, 2, 1, 3, 2, 1, 1, 4, 3, 
2, 4, 4, 1, 4, 3, 3, 4, 2, 1, 2, 2, 2, 3, 4, 2, 3, 4, 1, 3, 3, 
2, 2, 4, 2, 3, 1, 4, 4, 3, 2, 3, 1, 3, 2, 4, 2, 3, 3, 1, 2, 1, 
2, 2, 1, 3, 2, 2, 3, 1, 3, 3, 3, 2, 2, 2, 3, 4, 3, 3, 3), f1 = structure(c(NA, 
1L, 2L, 3L, 2L, 1L, 3L, NA, NA, NA, 3L, NA, 3L, 3L, NA, 2L, NA, 
1L, NA, 3L, 3L, NA, 2L, 3L, 2L, 2L, NA, NA, 2L, NA, 3L, 2L, NA, 
2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, NA, 
NA, 3L, 3L, 2L, 3L, 2L, 3L, 3L, NA, 1L, NA, 3L, 1L, 3L, 2L, 3L, 
3L, 2L, 3L, 3L, 1L, 2L, NA, 2L, NA, NA, 3L, 2L, NA, 3L, 3L, 3L, 
2L, 3L, 2L, 2L, 2L, 2L, NA, 2L, 3L, 2L, 1L, 2L, NA, NA, 3L, 3L, 
3L, 3L, 3L, 3L, NA, NA, NA, 1L, 3L, 3L, 1L, 3L, 3L, NA, 1L, NA, 
2L, 1L, 2L, 3L, 1L, NA, 3L, 3L, NA, 2L, NA, NA, 2L, 3L, 3L, 1L, 
3L, 2L, 3L, 3L, 2L, 2L, NA, NA, 2L, 3L, 3L, 3L, 3L, 2L, 2L, NA, 
3L, 3L, NA, 3L, 2L, NA, 1L, NA, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 
3L, 2L, 2L, NA, 1L, 3L, 2L, 2L, 3L, 3L, NA, 3L, NA, 3L, 1L, 3L, 
2L, 2L, 2L, 2L, 3L, NA, NA, 1L, 3L, 2L, 3L, NA, 2L, 2L, 3L, 3L, 
NA, 3L, 2L, 3L, NA, 1L), levels = c("long", "middle", "short"
), class = "factor"), f2 = structure(c(1L, 1L, 2L, 3L, 3L, 1L, 
1L, 3L, 1L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 
2L, 2L, 2L, 2L, 3L, NA, 2L, 1L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L, 
3L, 3L, 3L, NA, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 3L, 1L, 3L, 3L, 
3L, 1L, 2L, 1L, 2L, NA, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 1L, 
2L, NA, 2L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 
2L, NA, 3L, 3L, 3L, 2L, 3L, NA, NA, 3L, 3L, 2L, 2L, 3L, 3L, 1L, 
2L, 3L, 2L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 3L, 1L, 2L, 3L, 1L, 
1L, 3L, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 3L, 
3L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 2L, 3L, 
3L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 1L, 2L, 
3L, 2L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 
2L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 1L, 
2L), levels = c("long", "middle", "short"), class = "factor"), 
    f3 = structure(c(1L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, 2L, 3L, 
    3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 3L, 
    3L, NA, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 3L, 
    1L, 3L, 1L, 3L, 2L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 
    2L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 3L, 
    3L, 1L, 3L, 2L, 3L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, NA, 1L, 3L, 
    1L, 3L, 3L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 1L, 3L, 1L, 
    2L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 
    1L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 3L, 3L, 
    3L, NA, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 
    3L, 1L, 3L, 3L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 1L, 1L, 1L, 
    1L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 
    1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 3L), levels = c("long", "middle", 
    "short"), class = "factor"), f4 = structure(c(3L, 3L, 3L, 
    3L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 2L, 1L, 
    3L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, NA, 1L, 3L, 2L, 3L, 3L, 1L, 
    1L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 3L, 2L, 
    1L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 1L, 2L, 
    2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 
    3L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 1L, 3L, 1L, 2L, 3L, 
    1L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 3L, 2L, 
    3L, 3L, 1L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L, 2L, 2L, 2L, 1L, 
    3L, 3L, 3L, 2L, 3L, 1L, 3L, 1L, 1L, 3L, 2L, 1L, 2L, 2L, 2L, 
    3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 3L, 2L, 1L, 
    3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 2L, 2L, 
    1L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, 
    3L, 2L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 2L, 1L, 
    3L), levels = c("long", "middle", "short"), class = "factor"), 
    f5 = structure(c(3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 
    3L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 
    3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 1L, 2L, 2L, 3L, 2L, 
    3L, 3L, 2L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, NA, 1L, 3L, 3L, 3L, 
    3L, 3L, 3L, 1L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
    2L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, NA, 3L, 3L, 1L, NA, 3L, 2L, 
    3L, 3L, 2L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
    2L, 1L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 2L, 1L, 3L, 2L, 3L, 
    2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 
    3L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, NA, 3L, 3L, 2L, 
    3L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 
    3L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 
    2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("long", "middle", 
    "short"), class = "factor"), f6 = structure(c(3L, 3L, 3L, 
    3L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 3L, 3L, 1L, 3L, 1L, 2L, 
    3L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, NA, 3L, 2L, 
    2L, 2L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 
    2L, 2L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 3L, 2L, NA, 
    2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 2L, 
    3L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L, 
    1L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 
    2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 2L, 
    3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 
    3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, NA, 3L, 3L, 2L, 2L, 
    3L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, NA, 2L, 
    2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 
    2L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 1L, 
    2L), levels = c("long", "middle", "short"), class = "factor"), 
    num1 = c(55, 34, 59, 62, 20, 39, 68, 19, 39, 61, 84, 53, 
    40, 32, 32, 64, 30, 37, 54, 62, 71, 67, 56, 48, 44, 86, 79, 
    42, 60, 59, 30, 63, 39, 87, 54, 36, 34, 60, 70, 50, 69, 64, 
    44, 76, 53, 54, 39, 33, 88, 76, 20, 23, 40, 39, 60, 45, 60, 
    73, 45, 47, 41, 48, 40, 77, 36, NA, 39, 64, 62, 53, 47, 64, 
    56, 34, 69, 71, 75, 62, 19, 73, 62, 40, 46, 49, 57, 72, 61, 
    32, 33, 67, 72, 38, 64, 54, 58, 46, 61, 36, 36, 58, 51, 51, 
    40, 86, 27, 65, 43, 36, 43, 55, 47, 58, 73, 51, 46, 43, 49, 
    32, 42, 75, 57, 21, 68, 70, 51, 49, 54, 48, 23, 21, 66, 67, 
    56, 33, 68, 44, 72, 38, 58, 48, 72, 46, 81, 62, 68, 62, 32, 
    50, 25, 27, 41, 36, 63, 25, 46, 43, 76, 62, 69, 71, 57, 35, 
    41, 32, 69, 59, 65, 32, 61, 40, 52, 69, 55, 42, 30, 89, 71, 
    43, 35, 72, 37, 59, 30, 65, 50, 40, 49, 37, 67, 70, 68, 63, 
    32, 52, 73, 68, 36, 72, 35), dummy1 = structure(c(1L, 2L, 
    2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 1L, 1L, 
    1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 
    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 
    1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
    2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 
    1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 
    2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 
    1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 
    2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 
    2L, 1L), levels = c("1", "2"), class = "factor"), dummy2 = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
    2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 
    1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
    2L, 2L, 2L, 1L, 1L, NA, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, NA, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 
    1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 
    1L, 2L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), levels = c("0", "1"), class = "factor"), group = structure(c(8L, 
    6L, 8L, 8L, 8L, 2L, 8L, 8L, 1L, 8L, 1L, 3L, 8L, 8L, 5L, 7L, 
    1L, 8L, 8L, 2L, 6L, 1L, 6L, 1L, 5L, 2L, 2L, NA, 3L, 2L, 8L, 
    8L, 3L, 8L, 3L, 3L, 6L, 4L, 8L, 2L, 5L, NA, 8L, 1L, 2L, 5L, 
    2L, 6L, 2L, 1L, 8L, 1L, 2L, 3L, 6L, 6L, 3L, 2L, 2L, 8L, NA, 
    1L, 8L, 6L, 6L, 8L, 1L, 6L, 8L, 1L, 2L, NA, 2L, 4L, 1L, 8L, 
    6L, 1L, NA, 2L, 2L, 8L, 8L, 8L, 2L, 8L, 8L, 6L, 3L, 8L, 1L, 
    1L, 2L, 3L, 2L, 8L, 1L, 1L, 3L, 8L, 2L, 8L, 3L, 2L, 4L, 1L, 
    3L, 4L, 5L, 1L, 8L, 8L, 3L, 6L, 3L, 3L, 8L, 3L, 6L, 1L, 2L, 
    1L, 8L, 8L, 1L, 1L, 7L, 8L, 3L, 5L, 4L, 3L, 6L, NA, 4L, 1L, 
    3L, 8L, 6L, 6L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 6L, 3L, 8L, 4L, 
    4L, 2L, 2L, 7L, 1L, 2L, 8L, 2L, 2L, 8L, 6L, 8L, 6L, 4L, 4L, 
    2L, 2L, 3L, 1L, 2L, 3L, 8L, 2L, 3L, 2L, 4L, 2L, 5L, 8L, 3L, 
    7L, 1L, 1L, 3L, NA, 4L, 2L, 8L, 2L, 1L, 8L, 6L, 4L, 1L, 8L, 
    3L, 1L, 8L), levels = c("1", "2", "3", "4", "5", "6", "7", 
    "8"), class = "factor")), row.names = c(NA, -199L), class = c("tbl_df", 
"tbl", "data.frame"))

d <- d %>% mutate(across(!all_of(c("dep_var", "num1")), ~ .x %>% as.factor()))

Triangulating the Root of the Problem

1.) I have restarted R and my computer multiple times without any effect.

2.) I tried excluding all NAs in the dataset – in regard to the NAvalues of the response variable as well as in regard to the factors. I did so via manipulating the dataset and via the na.action = na.exclude option in the cforest function. This remedies my problem mostly. (Though, even this does not reliably solve my problem: For some other dependent variables and/or with different ntree numbers, I receive the following error then: Warning: scheduled core 2 encountered error in user code, all values of the job will be affectedError in colMeans(ret, na.rm = TRUE) : 'x' must be numeric. Moreover, for some reason, in with the provided sample code, the variance importance is only reported for a subset of variables in the forest.)

However, excluding observations with NAvalues is not what i strive for, as it unnecessarily removes observations from the analysis; and it suggests that surrogates are not properly identified in the different trees/the forest.

I would appreciate any suggestion on where the problem lies to be able to extract the conditional variable importance from a random forest with na.pass and surrogates.

转载请注明原文地址:http://anycun.com/QandA/1746117633a91912.html