Im struggling to understand how to control the backfill process baked into Autoloader: .html#trigger-regular-backfills-using-cloudfilesbackfillinterval
If I set cloudFiles.backfillInterval to '1 day', the autoloader stream will scan every file in source to check if anything has been missed.
Then 1 day later, it will scan every file in source again...
As the number of files in source grows over time, surely this process is going to take longer and longer...
I experimented with cloudFiles.maxFileAge, assuming if I set it to something like '1 year' the backfill process would only re-scan files less than a year old, but alas that does not seem to be the case.
Am I missing something? Is there another way to control the backfill process, or is the way it works out of the box, scanning every file in source, just what I'll have to account for?
Im struggling to understand how to control the backfill process baked into Autoloader: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#trigger-regular-backfills-using-cloudfilesbackfillinterval
If I set cloudFiles.backfillInterval to '1 day', the autoloader stream will scan every file in source to check if anything has been missed.
Then 1 day later, it will scan every file in source again...
As the number of files in source grows over time, surely this process is going to take longer and longer...
I experimented with cloudFiles.maxFileAge, assuming if I set it to something like '1 year' the backfill process would only re-scan files less than a year old, but alas that does not seem to be the case.
Am I missing something? Is there another way to control the backfill process, or is the way it works out of the box, scanning every file in source, just what I'll have to account for?
Based on some answers provided by Databricks, the backfillInterval is used to check all files based on the interval you set. So it is supposed to check all files, just at the specified interval.
References: https://community.databricks.com/t5/data-engineering/what-does-autoloader-s-cloudfiles-backfillinterval-do/td-p/7709
https://community.databricks.com/t5/data-engineering/databricks-auto-loader-cloudfiles-backfillinterval/td-p/37915
Corrected answer based on inputs from Databricks support.
So if I want to configure my job such that every day, once a day, the job looks at “last 24 hours” and processes any files that were missed, then would this configuration help?
df = spark.readStream.format("cloudFiles") \ .options(**autoloader_config) \ .options("cloudFiles.backfillInterval", "1 day") \ .load("s3a://bucket/path/")
Primary confusion is: If I set it to “1 day” then will autoloader scan all files in “s3a://bucket/path/” to look for missed files, or only the files newer than “1 day”? Concern is that if autoloader scans ALL files under the input path, then over time it’ll be scanning millions of files, which doesn’t make sense.
Then 1 day later, it will scan every file in source again...
Yes, it will trigger a full directory listing every "1 day". So yes as the number of files grows over time, this job might get slower over time.
Customer typically use a lifecycle policy on the bucket such as "files older than 30 days can be deleted or archived to another path"
We have a feature in private preview that will do this for you now. It's called CleanSource.
To summarize, the advice is to use: