I would like to identify all rows in a data frame (or matrix) whose values in column 1 and 2 match a specific pair. For example, if I have a matrix
testmat = rbind(c(1,1), c(1,2), c(1,4), c(2,1), c(2,4), c(3,4), c(3,10))
I would like to identify the rows that contain any of the following pairs, i.e. all rows that contain a combination of either 1,2 or 2,4 in their first and second columns
of_interest = rbind(c(1,2), c(2,4))
The following does not work
which(testmat[, 1] %in% of_interest[, 1] & testmat[, 2] %in% of_interest[, 2])
because, as expected, it returns all combinations of 1,2 in the first column and 2,4 in the second (i.e. rows 2,3,5 rather than just rows 2 and 5 as desired), so that the row [1,4] is included even though this is not one of the pairs I'm querying for. There must be some simple way to use the which( ... %in% ...)
to match specific pairs like this, but I haven't been able to find an example of this that works.
Note that I need the positions/row numbers of the rows which match the desired condition.
I would like to identify all rows in a data frame (or matrix) whose values in column 1 and 2 match a specific pair. For example, if I have a matrix
testmat = rbind(c(1,1), c(1,2), c(1,4), c(2,1), c(2,4), c(3,4), c(3,10))
I would like to identify the rows that contain any of the following pairs, i.e. all rows that contain a combination of either 1,2 or 2,4 in their first and second columns
of_interest = rbind(c(1,2), c(2,4))
The following does not work
which(testmat[, 1] %in% of_interest[, 1] & testmat[, 2] %in% of_interest[, 2])
because, as expected, it returns all combinations of 1,2 in the first column and 2,4 in the second (i.e. rows 2,3,5 rather than just rows 2 and 5 as desired), so that the row [1,4] is included even though this is not one of the pairs I'm querying for. There must be some simple way to use the which( ... %in% ...)
to match specific pairs like this, but I haven't been able to find an example of this that works.
Note that I need the positions/row numbers of the rows which match the desired condition.
I assume as you're using which()
you want the position, rather than just whether there is a match. You can cbind()
the row number to testmat
and then merge()
this with of_interest
.
merge(
cbind(testmat, seq_len(nrow(testmat))),
of_interest
) |> setNames(c("x", "y", "row_num"))
# x y row_num
# 1 1 2 2
# 2 2 4 5
Rcpp
approach with very large matrixYou mention in your comment that you have 1e8
rows. This makes me think two things:
merge()
as this will coerce matrices to data frames, i.e. copy each column into a memory-contiguous vector, which will be very expensive.of_interest
is also large, you want to break the loop as soon as match is found rather than continuing to iterate. See this question for performance advantages.Given this I would avoid using which()
or other approaches which do not exit early. Here's some Rcpp
code that should be much faster than merge()
with large datasets:
Rcpp::cppFunction("
IntegerVector get_row_position(NumericMatrix testmat, NumericMatrix of_interest) {
const R_xlen_t nrow_testmat = testmat.nrow();
const R_xlen_t nrow_of_interest = of_interest.nrow();
IntegerVector result;
// loop through the rows of testmat
for (R_xlen_t i = 0; i < nrow_testmat; ++i) {
for (R_xlen_t j = 0; j < nrow_of_interest; ++j) {
if (testmat(i, 0) == of_interest(j, 0) && testmat(i, 1) == of_interest(j, 1)) {
result.push_back(i + 1); // because of 1-indexing
break; // leave inner loop early
}
}
}
return result;
}
")
get_row_position(testmat, of_interest)
# [1] 2 5
Note: This previously accessed rows as sub-matrices e.g. NumericMatrix::Row test_row = testmat(i, _);
, which is more idiomatic Rcpp
code than a double for-loop with matrix indexing but it turns out after benchmarking it's much slower so I've updated it to just compare directly. See the edit history for the previous version.
I updated the above function after the nice answer from jblood94 which showed the previous approach was slower than some base R approaches. I use the 100m row benchmark from that answer against the rowmatch3()
function, which was the fastest (and around 8 times faster than my previous answer). This slightly updated approach is around 5 times faster than rowmatch3()
.
testmat <- `dim<-`(as.numeric(sample(4e3, 2e8, 1)), c(1e8, 2))
matchmat <- unique(`dim<-`(sample(4e3, 10, 1), c(5, 2)))
microbenchmark::microbenchmark(
get_row_position = get_row_position(testmat, matchmat),
rowmatch3 = rowmatch3(testmat, matchmat),
check = "identical",
unit = "relative",
times = 10L
)
# Unit: relative
# expr min lq mean median uq max neval cld
# get_row_position 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
# rowmatch3 5.262158 5.309956 5.405731 5.426385 5.469428 5.321671 10 b
Here are three other base R options that outperform the earlier asplit
solution in terms of speed
tic1 <- function(testmat, of_interest) {
u <- c(1, 1i)
which(tcrossprod(u, testmat) %in% tcrossprod(u, of_interest))
}
tic2 <- function(testmat, of_interest) {
p <- testmat[, 1] + 1i * testmat[, 2]
q <- of_interest[, 1] + 1i * of_interest[, 2]
which(p %in% q)
}
tic3 <- function(testmat, of_interest) {
p <- match(testmat[, 1], of_interest[, 1])
q <- match(testmat[, 2], of_interest[, 2])
which(p == q)
}
Borrowing the benchmarking template from @SamR's excellent answer, you can see
set.seed(0)
testmat <- `dim<-`(sample(4e3, 2e8, 1), c(1e8, 2))
matchmat <- unique(`dim<-`(sample(4e3, 10, 1), c(5, 2)))
microbenchmark::microbenchmark(
get_row_position = get_row_position(testmat, matchmat),
tic1 = tic1(testmat, matchmat),
tic2 = tic2(testmat, matchmat),
tic3 = tic3(testmat, matchmat),
check = "identical",
unit = "relative",
times = 10L
)
gives
Unit: relative
expr min lq mean median uq max neval
get_row_position 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
tic1 5.713719 5.659044 5.397984 5.542336 5.078150 5.169045 10
tic2 5.722275 5.484632 4.993905 5.141519 4.672854 4.069125 10
tic3 2.933148 3.002827 2.879779 2.829355 2.761182 3.007946 10
Here is an approach with which
+ asplit
> which(asplit(testmat, 1) %in% asplit(of_interest, 1))
[1] 2 5
which might be a bit inefficient due to asplit
, but should be working well for small datasets if speed is not one of your concerns.
Here are a few base R functions with a focus on performance. They are variations on the same theme. The third iteratively reduces the search and is fastest for large matrices (though still slower than @SamR's updated Rcpp function--see benchmarks below). I also added the complex matching solution from @alexis_laz's comment.
rowmatch1 <- function(mat, matchmat) {
u <- unique(`dim<-`(matchmat, NULL))
m <- array(FALSE, rep(length(u), ncol(matchmat)))
m[`dim<-`(match(matchmat, u), dim(matchmat))] <- TRUE
which(m[`dim<-`(match(mat, u), dim(mat))])
}
rowmatch2 <- function(mat, matchmat) {
u <- apply(matchmat, 2, unique, simplify = FALSE)
m <- array(FALSE, lengths(u))
m[mapply(\(i) match(matchmat[,i], u[[i]]), 1:ncol(matchmat))] <- TRUE
which(m[mapply(\(i) match(mat[,i], u[[i]]), 1:ncol(matchmat))])
}
rowmatch3 <- function(mat, matchmat) {
u <- apply(matchmat, 2, unique, simplify = FALSE)
m <- array(FALSE, lengths(u))
m[mapply(\(i) match(matchmat[,i], u[[i]]), 1:ncol(matchmat))] <- TRUE
i <- which(mat[,1] %in% u[[1]])
for (j in (1:length(u))[-1]) i <- i[mat[i, j] %in% u[[j]]]
mat <- mat[i,]
i[which(m[mapply(\(i) match(mat[,i], u[[i]]), 1:ncol(matchmat))])]
}
rowmatchCmplx <- function(mat, matchmat) {
stopifnot(ncol(matchmat) == 2L)
which(complex(real = mat[,1], imaginary = mat[,2]) %in%
complex(real = matchmat[,1], imaginary = matchmat[,2]))
}
Testing:
rowmatch1(testmat, of_interest)
#> [1] 2 5
rowmatch2(testmat, of_interest)
#> [1] 2 5
rowmatch3(testmat, of_interest)
#> [1] 2 5
rowmatchCmplx(testmat, of_interest)
#> [1] 2 5
Benchmarking on a 10M-row matrix (including @SamR's Rcpp function):
testmat <- `dim<-`(as.numeric(sample(2e3, 2e7, 1)), c(1e7, 2))
matchmat <- unique(`dim<-`(as.numeric(sample(2e3, 10, 1)), c(5, 2)))
microbenchmark::microbenchmark(
get_row_position = get_row_position(testmat, matchmat),
rowmatch1 = rowmatch1(testmat, matchmat),
rowmatch2 = rowmatch2(testmat, matchmat),
rowmatch3 = rowmatch3(testmat, matchmat),
rowmatchCmplx = rowmatchCmplx(testmat, matchmat),
check = "identical",
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> get_row_position 39.6565 39.7937 40.65146 40.22035 41.3174 42.6404 10 a
#> rowmatch1 341.5442 346.9554 360.70051 352.25465 364.7652 405.6459 10 b
#> rowmatch2 496.5627 515.0797 528.95796 524.12235 547.5906 561.9820 10 c
#> rowmatch3 207.5945 233.3698 243.04387 242.53215 247.8733 296.7575 10 d
#> rowmatchCmplx 426.5008 465.4813 480.95106 487.71605 496.6567 520.5847 10 e
Benchmark on a 100M-row matrix:
testmat <- `dim<-`(as.numeric(sample(4e3, 2e8, 1)), c(1e8, 2))
matchmat <- unique(`dim<-`(as.numeric(sample(4e3, 10, 1)), c(5, 2)))
microbenchmark::microbenchmark(
get_row_position = get_row_position(testmat, matchmat),
rowmatch1 = rowmatch1(testmat, matchmat),
rowmatch2 = rowmatch2(testmat, matchmat),
rowmatch3 = rowmatch3(testmat, matchmat),
rowmatchCmplx = rowmatchCmplx(testmat, matchmat),
check = "identical",
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> get_row_position 405.7012 405.7012 405.7012 405.7012 405.7012 405.7012 1
#> rowmatch1 3832.2640 3832.2640 3832.2640 3832.2640 3832.2640 3832.2640 1
#> rowmatch2 5949.3731 5949.3731 5949.3731 5949.3731 5949.3731 5949.3731 1
#> rowmatch3 2475.2071 2475.2071 2475.2071 2475.2071 2475.2071 2475.2071 1
#> rowmatchCmplx 5238.6490 5238.6490 5238.6490 5238.6490 5238.6490 5238.6490 1
The functions work for an arbitrary number of columns:
testmat <- `dim<-`(as.numeric(sample(1e2, 3e7, 1)), c(1e7, 3))
matchmat <- unique(`dim<-`(as.numeric(sample(1e2, 15, 1)), c(5, 3)))
microbenchmark::microbenchmark(
rowmatch1 = rowmatch1(testmat, matchmat),
rowmatch2 = rowmatch2(testmat, matchmat),
rowmatch3 = rowmatch3(testmat, matchmat),
check = "identical",
times = 1
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rowmatch1 643.4180 643.4180 643.4180 643.4180 643.4180 643.4180 1
#> rowmatch2 786.1225 786.1225 786.1225 786.1225 786.1225 786.1225 1
#> rowmatch3 271.2028 271.2028 271.2028 271.2028 271.2028 271.2028 1
Note that this approach works best if matchmat
is relatively small. If it gets very large, the matching array (m
) will blow up. In this case, it would be better to build m
as a sparse array.
You could paste()
the values from your example (testmat and of_interest) into a single value and then do one %in%
evaluation. For example:
testmat_keys <- paste(testmat[, 1], testmat[, 2], sep = "_")
of_interest_keys <- paste(of_interest[, 1], of_interest[, 2], sep = "_")
which(testmat_keys %in% of_interest_keys) #returns [1] 2 5
If %in%
is not fast enough for you, consider trying %fin%
or fmatch()
from fastmatch
as a faster alternative to %in%
.
#install.packages('fastmatch')
library(fastmatch)
matches <- which(fmatch(test_keys, of_interest_keys, nomatch = 0) > 0)
collapse::%iin%
is fast. First, convert the matrices to data frames using collapse::qDF
(faster than as.data.frame
, not shown).
library(collapse)
qDF(testmat) %iin% qDF(matchmat)
# toy data (numeric, as expected by SamR Rcpp)
testmat <- `dim<-`(as.numeric(sample(4e3, 2e8, 1)), c(1e8, 2))
matchmat <- unique(`dim<-`(as.numeric(sample(4e3, 10, 1)), c(5, 2)))
microbenchmark::microbenchmark(
get_row_position = get_row_position(testmat, matchmat),
rowmatch3 = rowmatch3(testmat, matchmat),
clps = qDF(testmat) %iin% qDF(matchmat),
check = "identical",
unit = "relative",
times = 10L
)
# Unit: relative
# expr min lq mean median uq max neval
# get_row_position 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# rowmatch3 3.488175 3.520942 3.542119 3.570613 3.581646 3.483877 10
# clps 1.284572 1.414591 1.486958 1.502061 1.562192 1.595528 10
I don't have enough reputation to moderate, but there are several related posts (since OP asks both about matrices and data frames), without explicit emphasis on performance though:
How to find row index of common rows between two matrices in R; Get row numbers where two matrices have equal rows; Find indexes of matching rows in two matrices of different size; Finding rows of a large matrix that match specific values; How do I tag rows with two variables that match rows in a second data frame?; Get indices of common rows from two different dataframes; How to find indices of specific rows in dataframe
We might treat it as interval data and use {ivs}
.
(0) Set-up
testmat = rbind(c(1,1), c(1,2), c(1,4), c(2,1), c(2,4), c(3,4), c(3,10))
of_interest = rbind(c(1,2), c(2,4))
n = seq_len(nrow(testmat))
(1) Index
i = testmat[, 1] < testmat[, 2]
since the documentation of ivs::iv()
states
This means that start < end is a requirement to generate an interval vector. In particular, empty intervals with start == end are not allowed.
For further reading, you might want to start here.
(Notice that ordering rows would create a different problem!)
(2) Compare
library(ivs) # start end
w = iv_overlaps(iv(testmat[i, 1], testmat[i, 2]),
iv(of_interest[, 1], of_interest[, 2]),
type = "equals")
(3) Index again
n[i == TRUE][w] # |> strtoi() # to return integers instead
[1] "2" "5"
merge(testmat,of_interest)
– one Commented Jan 2 at 21:22which()
is the following but I doubt it is faster than merge.which(apply(testmat,1,paste,collapse="_")%in%apply(of_interest,1,paste,collapse="_"))
– one Commented Jan 2 at 21:37complex(real = testmat[, 1], imaginary = testmat[, 2]) %in% complex(real = of_interest[, 1], imaginary = of_interest[, 2])
– alexis_laz Commented Jan 3 at 6:42