I have a large dataframe df1
with string values in multiple columns:
df1 <-
data.frame(col1 = rep(c("A", "B", "C"),3),
col2 = rep(c("C", "A", "B"),3),
col3 = 1:9)
col1 col2 col3
1 A C 1
2 B A 2
3 C B 3
4 A C 4
5 B A 5
6 C B 6
7 A C 7
8 B A 8
9 C B 9
I want to replace some of the string values with alternate values.
I have a second dataframe df2
with the values to be changed col1
and the alternate values col2
.
df2 <- data.frame(col1 = c("A", "B"),
col2 = c("D", "E"))
col1 col2
1 A D
2 B E
So in this example I want all instances of "A" and "B" appearing in df1
to be replaced with "D" and "E" respectively, as per df2
.
The final output would look like:
col1 col2 col3
1 D C 1
2 E D 2
3 C E 3
4 D C 4
5 E D 5
6 C E 6
7 D C 7
8 E D 8
9 C E 9
I have tried using code using across
and lapply
but I am having trouble linking to the second dataframe, where normally I would use a join.
I have a large dataframe df1
with string values in multiple columns:
df1 <-
data.frame(col1 = rep(c("A", "B", "C"),3),
col2 = rep(c("C", "A", "B"),3),
col3 = 1:9)
col1 col2 col3
1 A C 1
2 B A 2
3 C B 3
4 A C 4
5 B A 5
6 C B 6
7 A C 7
8 B A 8
9 C B 9
I want to replace some of the string values with alternate values.
I have a second dataframe df2
with the values to be changed col1
and the alternate values col2
.
df2 <- data.frame(col1 = c("A", "B"),
col2 = c("D", "E"))
col1 col2
1 A D
2 B E
So in this example I want all instances of "A" and "B" appearing in df1
to be replaced with "D" and "E" respectively, as per df2
.
The final output would look like:
col1 col2 col3
1 D C 1
2 E D 2
3 C E 3
4 D C 4
5 E D 5
6 C E 6
7 D C 7
8 E D 8
9 C E 9
I have tried using code using across
and lapply
but I am having trouble linking to the second dataframe, where normally I would use a join.
You could make use of the superseded function recode
: This is one of the uses that I am unable to replicate using case_match/ case_when
. It was easy to use. Note that you could reverse the df
ie rev(df2)
and use fct_recode
which is not superseded
yet.
df1 %>%
mutate(across(col1:col2, ~recode(.x, !!!deframe(df2))))
col1 col2 col3
1 D C 1
2 E D 2
3 C E 3
4 D C 4
5 E D 5
6 C E 6
7 D C 7
8 E D 8
9 C E 9
NB: If any knows how to replicate the same using case_match/case_when
please go ahead add the solution
you could also use str_replace_all
Though It works in this scenario, it is not advisable since it might end up replacing portions of the strings instead of the whole string:
df1 %>%
mutate(across(col1:col2, ~str_replace_all(.x, deframe(df2))))
col1 col2 col3
1 D C 1
2 E D 2
3 C E 3
4 D C 4
5 E D 5
6 C E 6
7 D C 7
8 E D 8
9 C E 9
This is easy with package data.table:
library(data.table)
setDT(df1)
setDT(df2)
#reshape to long format
df1 <- melt(df1, id.vars = "col3")
#update join
df1[df2, value := i.col2, on = c("value==col1")]
#reshape to wide format
dcast(df1, col3 ~ variable)
#Key: <col3>
# col3 col1 col2
# <int> <char> <char>
#1: 1 D C
#2: 2 E D
#3: 3 C E
#4: 4 D C
#5: 5 E D
#6: 6 C E
#7: 7 D C
#8: 8 E D
#9: 9 C E
This is one of those things that's surprising it doesn't have a nice helper function, but we can write one simply:
find_replace = function(x, find, replace) {
stopifnot(length(find) == length(replace))
for(i in seq_along(find)) {
x[x == find[i]] = replace[i]
}
x
}
df1 |>
mutate(across(c(col1, col2), \(x)
find_replace(x, find = df2$col1, replace = df2$col2)
))
# col1 col2 col3
# 1 D C 1
# 2 E D 2
# 3 C E 3
# 4 D C 4
# 5 E D 5
# 6 C E 6
# 7 D C 7
# 8 E D 8
# 9 C E 9
You can try "named vectors" as dictionary for lookup, and then coalesce
dict <- with(df2, setNames(col2, col1))
df1 %>%
mutate(across(1:2, ~ coalesce(dict[.x], .x)))
which gives
col1 col2 col3
1 D C 1
2 E D 2
3 C E 3
4 D C 4
5 E D 5
6 C E 6
7 D C 7
8 E D 8
9 C E 9
I would use unlist()
+ match()
+ [
and stay with base. You might want to generalise this.
v0.1
df1[c("col1", "col2")] = local({
U = unlist(df1[c("col1", "col2")], use.names = FALSE)
i = match(U, df2$col1, nomatch = 9999L) # 9999L := Placeholder
j = i == 9999L
M = df2$col2[i]
M[j] = U[j]
M
})
> df1
1 D C 1
2 E D 2
3 C E 3
4 D C 4
5 E D 5
6 C E 6
7 D C 7
8 E D 8
9 C E 9
local({ .. })
is convenient if we neither want to create a custom function nor clutter the environment with a lot of variables, which are used only once.
For a small number of replacement options, you could use the vectorized case_when()
function from the dplyr package:
df1$col1 <- case_when(
df1$col1 == "A" ~ "D", # A -> D
df1$col1 == "B" ~ "E", # B -> E
TRUE ~ df1$col1 # otherwise don't change the value
)
# same logic for df1$col2