Iterate through files in Data factory - azure

I have a Datalake gen 1 with folder structure /Test/{currentyear}/{Files}
{Files} Example format
2020-07-29.csv
2020-07-30.csv
2020-07-31.csv
Every day one new file gets added to the folder.
I need to create ADF to load the files in the SQL server.
COnditions
When my ADF runs for the first time it needs to iterate all files and load into sql server
When ADF executing starting from second time( daily once) it needs to pick up only todays file and load into SQL server
Can anyone tell me how to design ADF with above conditions

This should be designed as two part.
When my ADF runs for the first time it needs to iterate all files and
load into sql server
You should create a temporary pipeline to acheieve this.(I think you know how to do this, so this part I will not talk about.)
When ADF executing starting from second time( daily once) it needs to
pick up only todays file and load into SQL server
So this needs you to create another pipeline which is continuous running.
Two points to acheive this:
First, trigger this pipeline by event trigger.(When the file is upload, trigger this pipeline.).
Second, filter the file by specific format:
For your requirement, the expression should be #{formatDateTime(utcnow(),'yyyy-MM-dd')}.
On my side, I can do that successful. Please have a try on your side.

Related

Read files in the order of the timestamp from Azure

I need to read files from a folder that is present on the Azure VM in the same order in which they are created and load the data to Azure SQL DB.
Using Logic Apps/ADF, whenever there are 1000 files loaded at once, it picks up randomly from them, but how to read them sequentially, in the order of the timestamp and process them?
Note: Everyfile will have a time stamp in its name(Filename_Timestamp.xml)

Archiving Azure Search Service

Need suggestion on archiving unused data from search service and reload it back when needed(reload to be done later).
Initial design draft looks like this:
Find the keys from search service based on some conditions(like take inactive, how old) that need to be archived.
Run achiever job(need suggestion here, could be a web job, function app)
Fetch the data and insert to blob storage and delete it from the search service.
Now the real way is to run the job in the pool and should be asynchronous
There's no right / wrong answer for this question. What you need to do is perform batch queries (up to 1000 docs), and schedule it to archive past data (eg. run an Azure function which will trigger and search for docs where createdDate > DataTime.Now).
Then persist that data somewhere (can be a cosmos db or as blob into storage account). Once you need to upload it again, I would consider it as a new insert, so it should follow your current insert process.
You can also take a look on this tool which helps to copy data from your index pretty quick:
https://github.com/liamca/azure-search-backup-restore

Process multiple files in azure blob storage simultaneously based on custom logic

My exact requirement is like this:
I have bunch of files for my different customers in my blob storage "B1". Lets say 10 files for customer "C1", 10 files for customer "C2" and 10 files for customer "C3".
I want to perform some operation on each file and move them to blob storage "B2". This can take 30 seconds to 5 minutes depending on data in file.
Now I want to process one files for each customer simultaneously but not more then one file for the same customer at the same time.
Like one file for customer "C1", one for "C2" and one for "C3" must process simultaneously. So the processing time of "C1" do not effect "C2" and "C3".
But the next file of C1 will be processed only when the first is completed.
What is the best architecture with Microsoft Azure functionalities to implement this?
For example, I have implemented like this with Azure Function V1:
Blob Triggered Azure Function : This will simply add file names with customer ID in an azure table as soon as any file is placed in blob. This table will contain one more column "InQueue" which is FALSE by default.
Time Triggered Azure Function : This will check in azure table and take first file for each customer for whom all files has InQueue = FALSE (Means : No file in process). And for them update InQueue = TRUE and add their name to azure queue.
Queue Triggered Azure Function : This will be triggered as soon as any file is in azure queue and do the process on it. Once the process is completed it will delete the entry for that file from azure table. So, now for the customer of that file all other entries has "InQueue" = FALSE (No file in process)
So in above architecture Time Triggered azure function is taking care of one file per customer but it is also pushing multiple files of different customers in queue. And as Queue Triggered azure function can run multiple instance at the same time. All files of different customers will be executed simultaneously.
Is my architecture is good? or bad? or how can I improve it? What are the other options that can make my process faster or easier or with less steps?
Now the main problem confusing you what I understand is you want to execute the multiple function simultaneously. If so, I suggest you trying the Logic App with parallel branches.
This is the tutorial about how to create a logic app with parallel branch. And this is the rendering. And you could add Azure Functions as a action.
Here I used a Recurrence(Time Schedule) as a trigger you could use others. And after each branch you could also add actions to it.Just like the pic shows.
Hope this could help you, if you still have other questions, please let me know.

Azure Logic App FTP Connector not running for files modified before the current date

I am working on FTP Connector of the Azure Logic App and it is working fine if I upload a file with today's last modified date.
But FTP connector is not triggered for files that are modified before the current date.
I have found in the trigger history that the trigger is skipped and Status code 202 is being returned.
Kindly suggest me a solution so as to trigger the FTP Connector whenever any file (even if is modified a year ago) is added on the FTP.
The FTP connector maintains a trigger state, which is always the last date it ran or the date it was created (for the very first run). Thus, it only triggers if there are messages with a modified date which is later than that trigger state.
A potential solution is not to use the FTP trigger, but the recurrent trigger, and then use the FTP connector action list files in folder. This will give you all existing files there. Then you can get the content for each, and if your processing succeeds, you can delete the file.
HTH

Switching production azure tables powering cloud service

Would like to know what would be the best way to handle the following scenario.
I have an azure cloud service uses a Azure storage table to lookup data against requests. The data in the table is generated offline periodically (once a week).
When new data is generated offline I would need to upload it into a separate table and make config changes (change table name) to the service to pick up data from the new table and re-deploy the service. (Every time data changes I change the table name - stored as a constant in my code - and re-deploy)
The other way would be to keep a configuration parameter for my azure web role which specifies the name of the table which holds current production data. Then, within the service I read the config variable for every request - get a reference to the table and fetch data from there.
Is the second approach above ok - or would it have a performance hit because I read the config, create a table client on every request that comes to the service. (The SLA for my service is less than 2 seconds)
To answer your question, 2nd approach is definitely better than the 1st one. I don't think you will take a performance hit because the config settings are cached on 1st read (I read it in one of the threads here) and creating table client does not create a network overhead because unless you execute some methods on the table client, this object just sits in the memory. One possibility would be to read from config file and put that in a static variable. When you change the config setting, capture the role environment changing event and update the static variable to the new value from the config file.
A 3rd alternative could be to soft code the table name in another table and have your application read the table name from there. You could update the table name as part of your upload process by first uploading the data and then updating this table with the new table name where data has been uploaded.

Resources