databricks - Delta Lake - multiple logical files referencing the same data file path - Stack Overflow

admin2025-04-29  2

I have a delta table with 2 versions:

  1. Add txn: path = "a.parquet" numRecords = 10 deletionVector = null

  2. Add txn: path = "a.parquet" numRecords = 10 deletionVector = (..., cardinality = 2)

Please note both transactions point to the same physical path ("a.parquet"), without any remove transaction.

From my understanding of the delta protocol, since the above are 2 separate logical files residing in two different versions, the above describes a legal delta table that when queried, should return 18 rows.

Could you please confirm my understanding?

Testing on databricks, select () and count() seem to be inconsistent. Select () returns 18 rows, while count() result is 8.

I have a delta table with 2 versions:

  1. Add txn: path = "a.parquet" numRecords = 10 deletionVector = null

  2. Add txn: path = "a.parquet" numRecords = 10 deletionVector = (..., cardinality = 2)

Please note both transactions point to the same physical path ("a.parquet"), without any remove transaction.

From my understanding of the delta protocol, since the above are 2 separate logical files residing in two different versions, the above describes a legal delta table that when queried, should return 18 rows.

Could you please confirm my understanding?

Testing on databricks, select () and count() seem to be inconsistent. Select () returns 18 rows, while count() result is 8.

Share Improve this question asked Jan 6 at 23:56 Shani SolomonShani Solomon 11 bronze badge 2
  • Can you try using the Delta table history and metadata to inspect the deletionVector behavior explicitly – Dileep Raj Narayan Thumula Commented Jan 7 at 2:34
  • which is the most recent version 1 or 2? – JayashankarGS Commented Jan 7 at 5:22
Add a comment  | 

1 Answer 1

Reset to default 0

Usually when you do delete operation on table with enableDeletionVectors enabled it creates a logical transaction with actions add and remove in a single version itself, please check the Json file for remove action and match the rows.

You were saying 10 records added initial and in next version you find add transaction with deletion vector cardinality 2 and number of records 10,

so total 18 rows will be the output you are thinking, but there may be situation where the rows are not inserted but only updated, you need to check inserted,updated and deleted rows properly.

Another reason for different rows, it is possible because you are accessing the data at different delta versions, kindly check it.

转载请注明原文地址:http://anycun.com/QandA/1745938419a91385.html