Read files in the order of the timestamp from Azure - azure

I need to read files from a folder that is present on the Azure VM in the same order in which they are created and load the data to Azure SQL DB.
Using Logic Apps/ADF, whenever there are 1000 files loaded at once, it picks up randomly from them, but how to read them sequentially, in the order of the timestamp and process them?
Note: Everyfile will have a time stamp in its name(Filename_Timestamp.xml)

Related

Iterate through files in Data factory

I have a Datalake gen 1 with folder structure /Test/{currentyear}/{Files}
{Files} Example format
2020-07-29.csv
2020-07-30.csv
2020-07-31.csv
Every day one new file gets added to the folder.
I need to create ADF to load the files in the SQL server.
COnditions
When my ADF runs for the first time it needs to iterate all files and load into sql server
When ADF executing starting from second time( daily once) it needs to pick up only todays file and load into SQL server
Can anyone tell me how to design ADF with above conditions
This should be designed as two part.
When my ADF runs for the first time it needs to iterate all files and
load into sql server
You should create a temporary pipeline to acheieve this.(I think you know how to do this, so this part I will not talk about.)
When ADF executing starting from second time( daily once) it needs to
pick up only todays file and load into SQL server
So this needs you to create another pipeline which is continuous running.
Two points to acheive this:
First, trigger this pipeline by event trigger.(When the file is upload, trigger this pipeline.).
Second, filter the file by specific format:
For your requirement, the expression should be #{formatDateTime(utcnow(),'yyyy-MM-dd')}.
On my side, I can do that successful. Please have a try on your side.

Copy n number of files from Azure Datalake to SFTP location using Logic Apps

I have a scenario-
I have some files in Azure Datalake. A job is pushing these file at an ADLS location.
This files needs to be uploaded at a SFTP location as a input data.
An application will consume this input file and will perform some operation and later save the processed data in an output directory as a output file at the same SFTP location.
With the help of Logic Apps I want to upload this output file at an ADLS location.
The application which consume this input file having some limitations- It can not consume more than 10000 records at a time.
Viz if my source file having more than 10000 records then I have to split it into multiple files (depends on number of rows), and then I have to replicate these files to SFTP location.
This replication has to be perform in such a way than after completion of one job only then another file should be copied at the SFTP location.
To upload the files I want to use to Azure Logic Apps.
As per my understanding till now Azure Logic Apps does not provide any Trigger to test 'a file added or modified' the ADLS location but Logic Apps having similar type of feature for the blob storage so I decided to use blob container.
Once my raw file are uploaded to the in ADLS location I will upload a file to blob location,
as my Logic Apps would keep polling this specific directory so whenever any new file arrives immediately it will trigger the file copy job via the Logic Apps.
Now the problem-
My ADLS directory may have one or more file(s).
How do I create a copy activity in Logic Apps to replicate these file(s) to the SFTP location.
How do I identify that how many csv type file are available in ADLS directory so that my logic apps decides number of iteration to copy the file.
Thanks In advance.
You can use List File Action on ADLS..
Output of this action is documented here - https://learn.microsoft.com/en-us/connectors/azuredatalake/#folderresponse
This is basically an array of filestatus class objects.. you can loop through this array and extract information from status objects and use it to copy data to where ever you want..
FileStatus has info -
TABLE 14
Name Path Type Description
File name pathSuffix string
File or folder name.
Type type string
Type of item (directory or file).
Block Size blockSize integer
Block size of folder or file.
Access Time accessTime integer
Unix (Epoch) time when the item last accessed.
Modification Time modificationTime integer
Unix (Epoch) time when the item last modified.
Sample LA would look like -

Azure DF: Get metadata of millions of files located in a VM and call a store procedure to update file details in a DB

I have created a Getmeta data activity in azure pipeline to fetch the file details located in a VM and I am iterating the output of Getmeta data activity using For each loop.
In the for each loop , I am calling a store procedure to update file details in the database.
If I have 2K files in the VM, the store procedure is called 2K times and which I feel not a good practice.
Is there any method to update all the file details in one shot ?
Per my knowledge, i think you could use GetMetadata Activity to get the output then pass it into Azure Function Activity.
Inside azure function,you could loop the output and use sdk (such as java sql lib) to update the tables as you want in the batch.

Azure DML Data slices?

I have 40 mil blobs of 10 TB in blob storage. I am using DML CopyDirectory to copy these into another storage account for backup purpose. It took nearly 2 weeks to complete. Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
Does DML uses anything like data slices ?
Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
As far as I know, when you start the CopyDirectory method, it will just send the request to tell the azure storage account to copy files from another storage account. All the copy operation is azure storage.
If we run the method to start copy the directory, the azure storage will firstly create the file with 0 size as below:
After the job finished, you will find it has change the size as below:
So the result is if the job started it will create the file in the target directory, but the file size is 0. You could see image1's file last modify time.
The azure storage will continue copy the file content to the target directory.
If the job finished, it will change the file last modify time.
So the DML SDK just tell the storage to copy files, then it will continue send the request to the azure storage to check each file's copy status.
Like below:
Thanks. But what happens if the files added to the source directory during this copy operation ? Does the new files as well get copied to the target directory ?
In short answer Yes.
The DML won't get the whole blob list and send request to copy all the file at one time.
It will firstly get a part of your file name list and send request to tell the storage copy file.
The list is sort by the file name.
For example.
If the DML have already copied the file name like 0 file as below.
This target blob folder
If you add the 0 start file to your folder,it will not copy.
This is copy from blob folder.
Copy completely blob folder:
If you add the file at the end of your blob folder and the DML doesn't scan it, it will be copied to the new folder.
so during that 2 weeks at least a million blobs must have been added to the container with very random names. So i think DML doesn't work in the case of large containers ?
As far as I know, the DML is designed for high-performance uploading, downloading and copying Azure Storage Blob and File.
When you using the DML CopyDirectoryAsync to copy the blob file.It will firstly send a request to list the folder's current file, then it will send the request to copy the file.
The default of the operation sending a request to list the folder's current file number is 250.
After get list it will generate a marker which is the next blob search file names. It will start to list the next file name in the folder and start copy again.
And by default, the .Net HTTP connection limit is 2. This implies that only two concurrent connections can be maintained.
It means if you don't set the .Net HTTP connection limit, the CopyDirectoryAsync will just get 500 record and start copy.
After copy completely, the operation will start to copy next files.
You could see this images:
The marker:
I suggest you could firstly set the max http connections to detect more blob files.
ServicePointManager.DefaultConnectionLimit = Environment.ProcessorCount * 8;
Besides, I suggest you could create multiple folder to store the files.
For example, you could create a folder which stores one week files.
Next week, you could start a new folder.
Then you could backup the old folder's file without new files store into that folder.
Finally, you could also write your own code to achieve your requirement, you need firstly get the list of the folder's files.
The max result of one request to get the list is 5000.
Then you could send the request to tell the storage copy each files.
If the file upload to the folder after you get the list, it will not copy to the new folder.

Switching production azure tables powering cloud service

Would like to know what would be the best way to handle the following scenario.
I have an azure cloud service uses a Azure storage table to lookup data against requests. The data in the table is generated offline periodically (once a week).
When new data is generated offline I would need to upload it into a separate table and make config changes (change table name) to the service to pick up data from the new table and re-deploy the service. (Every time data changes I change the table name - stored as a constant in my code - and re-deploy)
The other way would be to keep a configuration parameter for my azure web role which specifies the name of the table which holds current production data. Then, within the service I read the config variable for every request - get a reference to the table and fetch data from there.
Is the second approach above ok - or would it have a performance hit because I read the config, create a table client on every request that comes to the service. (The SLA for my service is less than 2 seconds)
To answer your question, 2nd approach is definitely better than the 1st one. I don't think you will take a performance hit because the config settings are cached on 1st read (I read it in one of the threads here) and creating table client does not create a network overhead because unless you execute some methods on the table client, this object just sits in the memory. One possibility would be to read from config file and put that in a static variable. When you change the config setting, capture the role environment changing event and update the static variable to the new value from the config file.
A 3rd alternative could be to soft code the table name in another table and have your application read the table name from there. You could update the table name as part of your upload process by first uploading the data and then updating this table with the new table name where data has been uploaded.

Resources