Can Databricks Autoloader Keep Track of File Uploading Time - databricks

Is it possible to keep track of S3 file uploading time with Databricks autoloader? Looks like Autoloader would add columns for the file name and processing time but in our user case we would need to know the order the files are uploaded to S3.

When you load the data, you can query the _metadata column (or specific attribute inside it) - it includes file_modification_time field that represents time of last file modification (that should match upload time).
Just do:
df.select("*", "_metadata.file_modification_time")
to get access to that field. See doc for details.

Related

Copying data using Data Copy into individual files for blob storage

I am entirely new to Azure, so if this is easy please just tell me to RTFM, but I'm not used to the terminology yet so I'm struggling.
I've created a data factory and pipeline to copy data, using a simple query, from my source data. The target data is a .txt file in my blob storage container. This part is all working quite well.
Now, what I'm attempting to do is to store each row that's returned from my query into an individual file in blob storage. This is where I'm getting stuck, and I'm not sure where to look. This seems like something that'll be pretty easy, but as I said I'm new to Azure and so far am not sure where to look.
You can type 1 in the Max rows per file of the Sink setting and don't set the file name in the dataset of sink. If you need, you can specify the file name prefix in the File name prefix setting.
Screenshots:
The dataset of sink
Sink setting in the copy data activity
Result:

azure data factory: iterate over millions of files

Previously I had a problem on how to merge several JSON files into one single file,
which I was able to resolve it with the answer of this question.
At first, I tried with just some files by using wild cards in the file name in the connection section of the input dataset. But when I remove the file name, theory tells me that all of the files in all folders would be loaded recursively as I checked the copy recursively option, in the source section of the copy activity.
The problem is that when I manually trigger the pipeline after removing the file name from the input of the data set, only some of the files get loaded and the task ends successfully but only loading around 400+ files, each folder has 1M+ files, I want to create BIG csv files by merging all the small JSON files of the source (I already was able to create csv file by mapping the schemas in the copy activity).
It is probably stopping due to a timeout or out of memory exception.
One solution is to loop over the contents of the directory using
Directory.EnumerateFiles(searchDir)
This way you can process all the files without having the list / contents of all files in memory at the same time.

Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage

I have a set of CSV files stored in Azure Blob Storage. I am reading the files into a database table using the Copy Data task. The Source is set as the folder where the files reside, so it's grabbing it's file and loading it into the database. The issue is that I can't seem to map the file name in order to read it into a column. I'm sure there are more complicated ways to do it, for instance first reading the metadata and then read the files using a loop, but surely the file metadata should be available to use while traversing through the files?
Thanks
This is not possible in a regular copy activity. Mapping Data Flows has this possibility, it's still in preview, but maybe it can help you out. If you check the documentation, you find an option to specify a column to store file name.
It looks like this:

Export Sharepoint list to .csv and upload to Azure Data Lake Using Flow

I am trying to using Microsoft Flow to export a Sharepoint List to Azure Data Lake.
I want it so that anytime a particular online list is changed, its entire contents are loaded into a file in Data Lake. If the file already exists, I want to overwrite it. Can someone please explain how I can go about doing this, I have tried multiple ways, but they are not getting the job done.
Thanks
I was able to get the items in the SharePoint list to near perfection. I will post the Flow here in case anyone in the future needs it.
So what I did is that every 5 minutes I "create" a file in Azure Data Lake which overwrites the file if it exists. The content of the files cannot be blank, so I added a newline to the content. Then I use Get Items to retrieve all the items in the SharePoint List. From there, using an Apply to each loop, I append the content of the current row of the Sharepoint list to the Data Lake file (separated by | and ending with a new line after all the content is added). This works to near perfection, with the only caveat being the newline at the beginning of the file, which I eliminate using PowerQuery.
This is exactly what I needed. If anybody sees a way to make this better, please post so that we can get this to perfection.

Get creation date of an Excel file via PackageProperties class

I generate a heavy file using Apache POI within 10 minutes. To minimize the memory usage and time, I generate the file only when I detect a change in my record. If not, I will just fetch the old most recent Excel file for download. The problem is how can I get the creation date of the old Excel file? I am thinking of using the Apache's PackageProperties class although I don't know how to achieve that.
Get the file via FileInputStream.
Read that file via new XSSFWorkbook(FileInputStream inputStream).
I don't know the next step here to connect #2 and #4.
Next is get the PackageProperties attribute of that workbook.
Use the PackageProperties.getCreatedProperty() to get the creation date.
If there is a change detected after the creation date of the file, we then start generating the file, then overwrite the old version of the file with the new one, then proceeds to download. If there are no changes detected, proceeds to download the previous file.
Now, how can I get that PackageProperties attribute of the workbook?
I had checked this other entry with a similar case (but using a CSV rather than Excel), but it seems using the last modified property of the file is not always identical to the creation date.

Resources