performance - Identifying data frame rows in R with specific pairs of values in two columns - Stack Overflow

admin2025-05-01  0

I would like to identify all rows in a data frame (or matrix) whose values in column 1 and 2 match a specific pair. For example, if I have a matrix

testmat = rbind(c(1,1), c(1,2), c(1,4), c(2,1), c(2,4), c(3,4), c(3,10))

I would like to identify the rows that contain any of the following pairs, i.e. all rows that contain a combination of either 1,2 or 2,4 in their first and second columns

of_interest = rbind(c(1,2), c(2,4))

The following does not work

which(testmat[, 1] %in% of_interest[, 1] & testmat[, 2] %in% of_interest[, 2])

because, as expected, it returns all combinations of 1,2 in the first column and 2,4 in the second (i.e. rows 2,3,5 rather than just rows 2 and 5 as desired), so that the row [1,4] is included even though this is not one of the pairs I'm querying for. There must be some simple way to use the which( ... %in% ...) to match specific pairs like this, but I haven't been able to find an example of this that works.

Note that I need the positions/row numbers of the rows which match the desired condition.

I would like to identify all rows in a data frame (or matrix) whose values in column 1 and 2 match a specific pair. For example, if I have a matrix

testmat = rbind(c(1,1), c(1,2), c(1,4), c(2,1), c(2,4), c(3,4), c(3,10))

I would like to identify the rows that contain any of the following pairs, i.e. all rows that contain a combination of either 1,2 or 2,4 in their first and second columns

of_interest = rbind(c(1,2), c(2,4))

The following does not work

which(testmat[, 1] %in% of_interest[, 1] & testmat[, 2] %in% of_interest[, 2])

because, as expected, it returns all combinations of 1,2 in the first column and 2,4 in the second (i.e. rows 2,3,5 rather than just rows 2 and 5 as desired), so that the row [1,4] is included even though this is not one of the pairs I'm querying for. There must be some simple way to use the which( ... %in% ...) to match specific pairs like this, but I haven't been able to find an example of this that works.

Note that I need the positions/row numbers of the rows which match the desired condition.

Share Improve this question edited Jan 4 at 0:14 ThomasIsCoding 104k9 gold badges37 silver badges104 bronze badges asked Jan 2 at 21:14 MaxMax 7258 silver badges26 bronze badges 7
  • 1 Common approach is to join two data frames. i.e. merge(testmat,of_interest) – one Commented Jan 2 at 21:22
  • 1 Merge just takes the intersection and returns of_interest (which is a subset of testmat in my example), without indicating the positions of the matching rows. – Max Commented Jan 2 at 21:25
  • 2 I see! I did miss the last paragraph... One way to use which() is the following but I doubt it is faster than merge. which(apply(testmat,1,paste,collapse="_")%in%apply(of_interest,1,paste,collapse="_")) – one Commented Jan 2 at 21:37
  • 2 Thanks - you didn't miss it, I added it for clarification after reading your post! I had thought of pasting the two column elements into a single string, but was hoping that there was a more efficient way to do this, perhaps there isn't. – Max Commented Jan 2 at 21:42
  • 3 Are you working with pairs only? In that case you could utilize "complex" numbers: complex(real = testmat[, 1], imaginary = testmat[, 2]) %in% complex(real = of_interest[, 1], imaginary = of_interest[, 2]) – alexis_laz Commented Jan 3 at 6:42
 |  Show 2 more comments

6 Answers 6

Reset to default 22

Standard approach

I assume as you're using which() you want the position, rather than just whether there is a match. You can cbind() the row number to testmat and then merge() this with of_interest.

merge(
    cbind(testmat, seq_len(nrow(testmat))),
    of_interest
) |> setNames(c("x", "y", "row_num"))

#   x y row_num
# 1 1 2       2
# 2 2 4       5

Rcpp approach with very large matrix

You mention in your comment that you have 1e8 rows. This makes me think two things:

  1. Don't merge() as this will coerce matrices to data frames, i.e. copy each column into a memory-contiguous vector, which will be very expensive.
  2. If of_interest is also large, you want to break the loop as soon as match is found rather than continuing to iterate. See this question for performance advantages.

Given this I would avoid using which() or other approaches which do not exit early. Here's some Rcpp code that should be much faster than merge() with large datasets:

Rcpp::cppFunction("
IntegerVector get_row_position(NumericMatrix testmat, NumericMatrix of_interest) {
    const R_xlen_t nrow_testmat = testmat.nrow();
    const R_xlen_t nrow_of_interest = of_interest.nrow();
    IntegerVector result;

    // loop through the rows of testmat
    for (R_xlen_t i = 0; i < nrow_testmat; ++i) {
        for (R_xlen_t j = 0; j < nrow_of_interest; ++j) {
            if (testmat(i, 0) == of_interest(j, 0) && testmat(i, 1) == of_interest(j, 1)) {
                result.push_back(i + 1); // because of 1-indexing
                break; // leave inner loop early
            }
        }
    }
    return result;
}
")

get_row_position(testmat, of_interest)
# [1] 2 5

Note: This previously accessed rows as sub-matrices e.g. NumericMatrix::Row test_row = testmat(i, _);, which is more idiomatic Rcpp code than a double for-loop with matrix indexing but it turns out after benchmarking it's much slower so I've updated it to just compare directly. See the edit history for the previous version.

A quick benchmark

I updated the above function after the nice answer from jblood94 which showed the previous approach was slower than some base R approaches. I use the 100m row benchmark from that answer against the rowmatch3() function, which was the fastest (and around 8 times faster than my previous answer). This slightly updated approach is around 5 times faster than rowmatch3().

testmat <- `dim<-`(as.numeric(sample(4e3, 2e8, 1)), c(1e8, 2))
matchmat <- unique(`dim<-`(sample(4e3, 10, 1), c(5, 2)))
microbenchmark::microbenchmark(
    get_row_position = get_row_position(testmat, matchmat),
    rowmatch3 = rowmatch3(testmat, matchmat),
    check = "identical",
    unit = "relative",
    times = 10L
)

# Unit: relative
#              expr      min       lq     mean   median       uq      max neval cld
#  get_row_position 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 
#         rowmatch3 5.262158 5.309956 5.405731 5.426385 5.469428 5.321671    10   b

Update

Here are three other base R options that outperform the earlier asplit solution in terms of speed

tic1 <- function(testmat, of_interest) {
  u <- c(1, 1i)
  which(tcrossprod(u, testmat) %in% tcrossprod(u, of_interest))
}

tic2 <- function(testmat, of_interest) {
  p <- testmat[, 1] + 1i * testmat[, 2]
  q <- of_interest[, 1] + 1i * of_interest[, 2]
  which(p %in% q)
}

tic3 <- function(testmat, of_interest) {
  p <- match(testmat[, 1], of_interest[, 1])
  q <- match(testmat[, 2], of_interest[, 2])
  which(p == q)
}

Borrowing the benchmarking template from @SamR's excellent answer, you can see

set.seed(0)
testmat <- `dim<-`(sample(4e3, 2e8, 1), c(1e8, 2))
matchmat <- unique(`dim<-`(sample(4e3, 10, 1), c(5, 2)))

microbenchmark::microbenchmark(
  get_row_position = get_row_position(testmat, matchmat),
  tic1 = tic1(testmat, matchmat),
  tic2 = tic2(testmat, matchmat),
  tic3 = tic3(testmat, matchmat),
  check = "identical",
  unit = "relative",
  times = 10L
)

gives

Unit: relative
             expr      min       lq     mean   median       uq      max neval
 get_row_position 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
             tic1 5.713719 5.659044 5.397984 5.542336 5.078150 5.169045    10
             tic2 5.722275 5.484632 4.993905 5.141519 4.672854 4.069125    10
             tic3 2.933148 3.002827 2.879779 2.829355 2.761182 3.007946    10

Earlier solution (Simple, but INEFFICIENT)

Here is an approach with which + asplit

> which(asplit(testmat, 1) %in% asplit(of_interest, 1))
[1] 2 5

which might be a bit inefficient due to asplit, but should be working well for small datasets if speed is not one of your concerns.

Here are a few base R functions with a focus on performance. They are variations on the same theme. The third iteratively reduces the search and is fastest for large matrices (though still slower than @SamR's updated Rcpp function--see benchmarks below). I also added the complex matching solution from @alexis_laz's comment.

rowmatch1 <- function(mat, matchmat) {
  u <- unique(`dim<-`(matchmat, NULL))
  m <- array(FALSE, rep(length(u), ncol(matchmat)))
  m[`dim<-`(match(matchmat, u), dim(matchmat))] <- TRUE
  which(m[`dim<-`(match(mat, u), dim(mat))])
}

rowmatch2 <- function(mat, matchmat) {
  u <- apply(matchmat, 2, unique, simplify = FALSE)
  m <- array(FALSE, lengths(u))
  m[mapply(\(i) match(matchmat[,i], u[[i]]), 1:ncol(matchmat))] <- TRUE
  which(m[mapply(\(i) match(mat[,i], u[[i]]), 1:ncol(matchmat))])
}

rowmatch3 <- function(mat, matchmat) {
  u <- apply(matchmat, 2, unique, simplify = FALSE)
  m <- array(FALSE, lengths(u))
  m[mapply(\(i) match(matchmat[,i], u[[i]]), 1:ncol(matchmat))] <- TRUE
  i <- which(mat[,1] %in% u[[1]])
  for (j in (1:length(u))[-1]) i <- i[mat[i, j] %in% u[[j]]]
  mat <- mat[i,]
  i[which(m[mapply(\(i) match(mat[,i], u[[i]]), 1:ncol(matchmat))])]
}

rowmatchCmplx <- function(mat, matchmat) {
  stopifnot(ncol(matchmat) == 2L)
  which(complex(real = mat[,1], imaginary = mat[,2]) %in%
          complex(real = matchmat[,1], imaginary = matchmat[,2]))
}

Testing:

rowmatch1(testmat, of_interest)
#> [1] 2 5
rowmatch2(testmat, of_interest)
#> [1] 2 5
rowmatch3(testmat, of_interest)
#> [1] 2 5
rowmatchCmplx(testmat, of_interest)
#> [1] 2 5

Benchmarking on a 10M-row matrix (including @SamR's Rcpp function):

testmat <- `dim<-`(as.numeric(sample(2e3, 2e7, 1)), c(1e7, 2))
matchmat <- unique(`dim<-`(as.numeric(sample(2e3, 10, 1)), c(5, 2)))

microbenchmark::microbenchmark(
  get_row_position = get_row_position(testmat, matchmat),
  rowmatch1 = rowmatch1(testmat, matchmat),
  rowmatch2 = rowmatch2(testmat, matchmat),
  rowmatch3 = rowmatch3(testmat, matchmat),
  rowmatchCmplx = rowmatchCmplx(testmat, matchmat),
  check = "identical",
  times = 10
)
#> Unit: milliseconds
#>              expr      min       lq      mean    median       uq      max neval   cld
#>  get_row_position  39.6565  39.7937  40.65146  40.22035  41.3174  42.6404    10 a    
#>         rowmatch1 341.5442 346.9554 360.70051 352.25465 364.7652 405.6459    10  b   
#>         rowmatch2 496.5627 515.0797 528.95796 524.12235 547.5906 561.9820    10   c  
#>         rowmatch3 207.5945 233.3698 243.04387 242.53215 247.8733 296.7575    10    d 
#>     rowmatchCmplx 426.5008 465.4813 480.95106 487.71605 496.6567 520.5847    10     e

Benchmark on a 100M-row matrix:

testmat <- `dim<-`(as.numeric(sample(4e3, 2e8, 1)), c(1e8, 2))
matchmat <- unique(`dim<-`(as.numeric(sample(4e3, 10, 1)), c(5, 2)))

microbenchmark::microbenchmark(
  get_row_position = get_row_position(testmat, matchmat),
  rowmatch1 = rowmatch1(testmat, matchmat),
  rowmatch2 = rowmatch2(testmat, matchmat),
  rowmatch3 = rowmatch3(testmat, matchmat),
  rowmatchCmplx = rowmatchCmplx(testmat, matchmat),
  check = "identical",
  times = 1
)
#> Unit: milliseconds
#>              expr       min        lq      mean    median        uq       max neval
#>  get_row_position  405.7012  405.7012  405.7012  405.7012  405.7012  405.7012     1
#>         rowmatch1 3832.2640 3832.2640 3832.2640 3832.2640 3832.2640 3832.2640     1
#>         rowmatch2 5949.3731 5949.3731 5949.3731 5949.3731 5949.3731 5949.3731     1
#>         rowmatch3 2475.2071 2475.2071 2475.2071 2475.2071 2475.2071 2475.2071     1
#>     rowmatchCmplx 5238.6490 5238.6490 5238.6490 5238.6490 5238.6490 5238.6490     1

The functions work for an arbitrary number of columns:

testmat <- `dim<-`(as.numeric(sample(1e2, 3e7, 1)), c(1e7, 3))
matchmat <- unique(`dim<-`(as.numeric(sample(1e2, 15, 1)), c(5, 3)))

microbenchmark::microbenchmark(
  rowmatch1 = rowmatch1(testmat, matchmat),
  rowmatch2 = rowmatch2(testmat, matchmat),
  rowmatch3 = rowmatch3(testmat, matchmat),
  check = "identical",
  times = 1
)
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval
#>  rowmatch1 643.4180 643.4180 643.4180 643.4180 643.4180 643.4180     1
#>  rowmatch2 786.1225 786.1225 786.1225 786.1225 786.1225 786.1225     1
#>  rowmatch3 271.2028 271.2028 271.2028 271.2028 271.2028 271.2028     1

Note that this approach works best if matchmat is relatively small. If it gets very large, the matching array (m) will blow up. In this case, it would be better to build m as a sparse array.

You could paste() the values from your example (testmat and of_interest) into a single value and then do one %in% evaluation. For example:

testmat_keys <- paste(testmat[, 1], testmat[, 2], sep = "_")
of_interest_keys <- paste(of_interest[, 1], of_interest[, 2], sep = "_")

which(testmat_keys %in% of_interest_keys) #returns [1] 2 5

If %in% is not fast enough for you, consider trying %fin% or fmatch() from fastmatch as a faster alternative to %in%.

#install.packages('fastmatch')   
library(fastmatch)

matches <- which(fmatch(test_keys, of_interest_keys, nomatch = 0) > 0)

collapse::%iin% is fast. First, convert the matrices to data frames using collapse::qDF (faster than as.data.frame, not shown).

library(collapse)
qDF(testmat) %iin% qDF(matchmat)

# toy data (numeric, as expected by SamR Rcpp)
testmat <- `dim<-`(as.numeric(sample(4e3, 2e8, 1)), c(1e8, 2))
matchmat <- unique(`dim<-`(as.numeric(sample(4e3, 10, 1)), c(5, 2)))

microbenchmark::microbenchmark(
    get_row_position = get_row_position(testmat, matchmat),
    rowmatch3 = rowmatch3(testmat, matchmat),
    clps = qDF(testmat) %iin% qDF(matchmat),
    check = "identical",
    unit = "relative",
    times = 10L
)

# Unit: relative
#              expr      min       lq     mean   median       uq      max neval
#  get_row_position 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
#         rowmatch3 3.488175 3.520942 3.542119 3.570613 3.581646 3.483877    10
#              clps 1.284572 1.414591 1.486958 1.502061 1.562192 1.595528    10

I don't have enough reputation to moderate, but there are several related posts (since OP asks both about matrices and data frames), without explicit emphasis on performance though:

How to find row index of common rows between two matrices in R; Get row numbers where two matrices have equal rows; Find indexes of matching rows in two matrices of different size; Finding rows of a large matrix that match specific values; How do I tag rows with two variables that match rows in a second data frame?; Get indices of common rows from two different dataframes; How to find indices of specific rows in dataframe

We might treat it as interval data and use {ivs}.

(0) Set-up

testmat = rbind(c(1,1), c(1,2), c(1,4), c(2,1), c(2,4), c(3,4), c(3,10))
of_interest = rbind(c(1,2), c(2,4))

n = seq_len(nrow(testmat))

(1) Index

i = testmat[, 1] < testmat[, 2] 

since the documentation of ivs::iv() states

This means that start < end is a requirement to generate an interval vector. In particular, empty intervals with start == end are not allowed.

For further reading, you might want to start here.

(Notice that ordering rows would create a different problem!)

(2) Compare

library(ivs)     # start          end 
w = iv_overlaps(iv(testmat[i, 1], testmat[i, 2]), 
                iv(of_interest[, 1], of_interest[, 2]), 
                type = "equals")

(3) Index again

n[i == TRUE][w] # |> strtoi() # to return integers instead
[1] "2" "5"
转载请注明原文地址:http://anycun.com/QandA/1746096361a91614.html