I had a small confusion on transactional log of Delta lake. In the documentation it is mentioned that by default retention policy is 30 days and can be modified by property -: delta.logRetentionDuration=interval-string .
But I don't understand when the actual log files are deleted from the delta_log folder. Is it when we run some operation? Or may be VACCUM operation. However, it is mentioned that VACCUM operation only deletes data files and not logs. But will it delete logs older than specified log retention duration?
reference -: https://docs.databricks.com/delta/delta-batch.html#data-retention
delta-io/delta PROTOCOL.md:
By default, the reference implementation creates a checkpoint every 10 commits.
There is an async process that runs for every 10th commit to the _delta_log folder. It will create a checkpoint file and will clean up the .crc and .json files that are older than the delta.logRetentionDuration.
Checkpoints.scala has checkpoint > checkpointAndCleanupDeltaLog > doLogCleanup. MeetadataCleanup.scala has doLogCleanup > cleanUpExpiredLogs.
The value of the option is an interval literal. There is no way to specify literal infinite and months and years are not allowed for this particular option (for a reason). However nothing stops you from saying interval 1000000000 weeks - 19 million years is effectively infinite.
Related
I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs
Given two containers:
Source: An Azure StorageV2 Account with two containers named A and B containing blob files that will be stored flat in the root directory in the container.
Destination: A Azure Data Lake Gen2 (for simplification purposes, consider it another Storage Account with a single destination container).
Objective: I am trying to copy/ingest all files within the currently active source container at the top of the month. For the remainder of that month, any files newly added/overwritten files inside the active source container need to be ingested as well.
For each month, there will only be one active container that we care about. So January would use Container A, Feb would use Container B, March would use Container A, etc. Using Azure Data Factory, I’ve already figured out how to accomplish this logic of swapping containers by using a dynamic expression in the file path.
#if(equals(mod(int(formatDateTime(utcnow(),'%M')), 2), 0), ‘containerB, ‘ContainerA’)
What I’ve tried so far: I set up a Copy pipeline using a Tumbling Window approach where a trigger runs daily to check for new/changed files based on the LastModifiedDate as described here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-lastmodified-copy-data-tool. However, I ran into a conundrum regarding the fact that the Last Modified date of the files to be ingested at the top of the month will by nature have a LastModifiedDate in the past compared to when the trigger's start date window, as this container is prepared ahead of time in the days leading up the turn of the mount right before the containers are swapped. So because the LastModifiedDate is in the past compared to the start window of the trigger, then those existing files on the 1st of the month will never get copied, only new/changed files after the trigger start date. If I manually fire the trigger by hardcoding an earlier start date, then any files added to the container mid-month end up getting ingested for the remainder of the month as expected.
So how do I solve that base case for files modified before the start date? If this can be solved, then everything can happen in one pipeline and one trigger. Otherwise, I will have to figure out another approach.
And in general, I am open to ideas as to what is the best approach to take here. The files will be ~2gb and around 20,000 in quantity.
You can do it by setting your trigger at the end of each day and try to copy all the new/updated files using last modified date on that day like below.
Assuming that there is no file uploading to second container when first container is active.
Please follow the below steps:
Go to Data factory and drag the copy activity in your pipeline.
Create the source dataset by creating the linked service. Give your container condition by clicking on Add dynamic content in source dataset.
#if(equals(mod(int(formatDateTime(utcnow(),'%M')), 2), 0), ‘containerb, ‘containera’)
Then select the Wildcard file path in the File path type and give * in file path like below to copy multiple files.
Here I am copying new/updated files in the last 24 hours. Go to Filter by last modified and give #adddays(utcNow(),-1) in start time and #utcNow() in the end time.
As we are scheduling this with trigger at the end of each day, it will look for new/updated files from the last 24 hours to start time.
Give your container of another storage account as sink dataset.
Now, click on the Add trigger and create a Tumbling Window trigger
like below.
You can give the start Date above as your wish at the end of the day based on your pipeline execution time.
Please make sure you publish the pipeline and trigger before execution.
If your second container also having new/modified files when the first container is active, then you may give a try like this in the start time of last modified date.
#if(equals(int(formatDateTime(utcNow(),'%D')),1), adddays(utcNow(),-31), adddays(utcNow(),-1))
I see in the 'data' tab of databricks that the number of files used by delta table is 20000(size:1.6TB).
But the actual file count on the azure blob storage where the delta stores the files is 13.5 Million (size: 31 TB).
The following checks were some checks done:
the vacuum is running every day with the default 7 days interval.(takes approx 4 hours every day)
the transaction logs are for the last 30 days
Questions:
What are these extra files that are existing more than that used by the delta table?
We would like to delete these extra files and free up the storage space. How can we isolate the files that are used by the delta table? Is there a command to list that?
Note: I am using Azure databricks and currently trying out vacuum dry run command to see if it helps(will update soon).
Thanks is Advance
I noticed that I have only 2 checkpoints files in a delta lake folder. Every 10 commits, a new checkpoint is created and the oldest one is removed.
For instance this morning, I had 2 checkpoints: 340 and 350. I was available to time travel from 340 to 359.
Now, after a "write" action, I have 2 checkpoints: 350 and 360. I'm now able to time travel from 350 to 360.
What can remove the old checkpoints? How can I prevent that?
I'm using Azure Databricks 7.3 LTS ML.
Ability to perform time travel isn't directly related to the checkpoint. Checkpoint is just an optimization that allows to quickly access metadata as Parquet file without need to scan individual transaction log files. This blog post describes the details of the transaction log in more details
The commits history is retained by default for 30 days, and could be customized as described in documentation. Please note that vacuum may remove deleted files that are still referenced in the commit log, because data is retained only for 7 days by default. So it's better to check corresponding settings.
If you perform following test, then you can see that you have history for more than 10 versions:
df = spark.range(10)
for i in range(20):
df.write.mode("append").format("delta").save("/tmp/dtest")
# uncomment if you want to see content of log after each operation
#print(dbutils.fs.ls("/tmp/dtest/_delta_log/"))
then to check files in log - you should see both checkpoints and files for individual transactions:
%fs ls /tmp/dtest/_delta_log/
also check the history - you should have at least 20 versions:
%sql
describe history delta.`/tmp/dtest/`
and you should be able to go to the early version:
%sql
SELECT * FROM delta.`/tmp/dtest/` VERSION AS OF 1
If you want to keep your checkpoints X days, you can set delta.checkpointRetentionDuration to X days this way:
spark.sql(f"""
ALTER TABLE delta.`path`
SET TBLPROPERTIES (
delta.checkpointRetentionDuration = 'X days'
)
"""
)
I want to set the correlation strategy on an aggregator so that it uses a date out of the incoming file (as message) name to correlate files so all files with todays date belong to the same group. Now since I might have multiple days worth of data its possible that I have aggregated 2 days of files. I want to base the release strategy on a done file (message) that includes the date in the filename as well so essentially each day will have a bunch of files and a done for file. Ingesting done file should release files for that day from the aggregator but still keep the other day files until the done file for that day is ingested.
so in this scenario, correlation is obviously simple - but what I am not sure about is how to release not all but only some specific messages from the group based on the correlation key. Documentation talks about messagereaper but that goes into messagestore stuff and I want to do all this in memory.
let me elaborate with an example
i have these files on a directory which im polling by a file inbound channel adapter
file-1-2014.04.27.dat
file-2-2014.04.27.dat
file-3-2014.04.27.dat
done-2014.04.27.dat
file-1-2014.04.28.dat
file-2-2014.04.28.dat
done-2014.04.28.dat
as these files are being polled in i have an aggregator in the flow where all incoming files are being aggregated. To correlate I was thinking I can extract the date and put that in correlation_id header so that first 3 files are being considered to belong to one group and then second 2 files belong to the second group .. now once I consume the done-2014.04.27.dat file at that time I want to release the first 3 files to be further processed in the flow but hold on to
file-1-2014.04.28.dat
file-2-2014.04.28.dat
until I receive the
done-2014.04.28.dat
and then release these 2 files.
Any help would be appreciated.
Thanks
I am not sure what you mean when you say "correlation is simple" but then go on to say you only want to release part of the group. If they have different dates then they will be in different groups, so there's no need to release part of a group, just release the whole group by running the reaper just after midnight (or any time the next day). It's not at all clear why you need a "done" message.
By default, the aggregator uses an in-memory message store (SimpleMessageStore).
EDIT:
Just put the done file in the same group and have your release strategy detect the presence of the done file. You could use an expression, but if the group can be large, it would be more efficient to implement ReleaseStrategy and iterate over MessageGroup.getMessages() looking for the done file.
The next step depends on what's downstream of the aggregator. If you use a splitter to split them back to separate files, you can simply add a filter to drop the done file. If you deal with the collection of files directly, either ignore the done file, or add a transformer to remove it from the collection.
With respect to the reaper; assuming files arrive in real time, I was simply suggesting that if you, say, run the reaper once a day (say at 01:00) with a group timeout of, say 30 minutes, then the reaper will release yesterday's files (without the need for a done file).
EDIT:
See my comment on your "answer" below - you have 2 subscribers on filesLogger.