databricks - Does cloudFiles.backfillInterval Reprocess Every File in Source Every Time Autoloader Runs? - Stack Overflow

admin2025-05-01  0

Im struggling to understand how to control the backfill process baked into Autoloader: .html#trigger-regular-backfills-using-cloudfilesbackfillinterval

If I set cloudFiles.backfillInterval to '1 day', the autoloader stream will scan every file in source to check if anything has been missed.

Then 1 day later, it will scan every file in source again...

As the number of files in source grows over time, surely this process is going to take longer and longer...

I experimented with cloudFiles.maxFileAge, assuming if I set it to something like '1 year' the backfill process would only re-scan files less than a year old, but alas that does not seem to be the case.

Am I missing something? Is there another way to control the backfill process, or is the way it works out of the box, scanning every file in source, just what I'll have to account for?

Im struggling to understand how to control the backfill process baked into Autoloader: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#trigger-regular-backfills-using-cloudfilesbackfillinterval

If I set cloudFiles.backfillInterval to '1 day', the autoloader stream will scan every file in source to check if anything has been missed.

Then 1 day later, it will scan every file in source again...

As the number of files in source grows over time, surely this process is going to take longer and longer...

I experimented with cloudFiles.maxFileAge, assuming if I set it to something like '1 year' the backfill process would only re-scan files less than a year old, but alas that does not seem to be the case.

Am I missing something? Is there another way to control the backfill process, or is the way it works out of the box, scanning every file in source, just what I'll have to account for?

Share Improve this question asked Jan 2 at 16:07 Andy McWilliamsAndy McWilliams 1431 gold badge3 silver badges11 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 1

Based on some answers provided by Databricks, the backfillInterval is used to check all files based on the interval you set. So it is supposed to check all files, just at the specified interval.

References: https://community.databricks.com/t5/data-engineering/what-does-autoloader-s-cloudfiles-backfillinterval-do/td-p/7709

https://community.databricks.com/t5/data-engineering/databricks-auto-loader-cloudfiles-backfillinterval/td-p/37915

Corrected answer based on inputs from Databricks support.


So if I want to configure my job such that every day, once a day, the job looks at “last 24 hours” and processes any files that were missed, then would this configuration help?

df = spark.readStream.format("cloudFiles") \
           .options(**autoloader_config) \
           .options("cloudFiles.backfillInterval", "1 day") \
           .load("s3a://bucket/path/")

Primary confusion is: If I set it to “1 day” then will autoloader scan all files in “s3a://bucket/path/” to look for missed files, or only the files newer than “1 day”? Concern is that if autoloader scans ALL files under the input path, then over time it’ll be scanning millions of files, which doesn’t make sense.


Then 1 day later, it will scan every file in source again...

Yes, it will trigger a full directory listing every "1 day". So yes as the number of files grows over time, this job might get slower over time.

Customer typically use a lifecycle policy on the bucket such as "files older than 30 days can be deleted or archived to another path"

We have a feature in private preview that will do this for you now. It's called CleanSource.

To summarize, the advice is to use:

  1. File Notification mode whenever possible (over Directory Listing) https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/file-detection-modes.html#file-notification-mode
  2. Then use the code you showed once a day/week (so this part of the code will be directory listing) which is what the note in the doc mentions to ensure that if there's a flip in file notification from the service perspective, to make sure no files are missed. Since the number of files can increase over time for this step, you can then:
  3. Use a bucket lifecycle policy to remove older files, or use the CleanSource preview feature (Here's a screenshot from the private preview documentation. This won't work until we enroll you for this).
转载请注明原文地址:http://anycun.com/QandA/1746108400a91786.html