Process multiple files in azure blob storage simultaneously based on custom logic

Process multiple files in azure blob storage simultaneously based on custom logic - azure

My exact requirement is like this:
I have bunch of files for my different customers in my blob storage "B1". Lets say 10 files for customer "C1", 10 files for customer "C2" and 10 files for customer "C3".
I want to perform some operation on each file and move them to blob storage "B2". This can take 30 seconds to 5 minutes depending on data in file.
Now I want to process one files for each customer simultaneously but not more then one file for the same customer at the same time.
Like one file for customer "C1", one for "C2" and one for "C3" must process simultaneously. So the processing time of "C1" do not effect "C2" and "C3".
But the next file of C1 will be processed only when the first is completed.
What is the best architecture with Microsoft Azure functionalities to implement this?
For example, I have implemented like this with Azure Function V1:
Blob Triggered Azure Function : This will simply add file names with customer ID in an azure table as soon as any file is placed in blob. This table will contain one more column "InQueue" which is FALSE by default.
Time Triggered Azure Function : This will check in azure table and take first file for each customer for whom all files has InQueue = FALSE (Means : No file in process). And for them update InQueue = TRUE and add their name to azure queue.
Queue Triggered Azure Function : This will be triggered as soon as any file is in azure queue and do the process on it. Once the process is completed it will delete the entry for that file from azure table. So, now for the customer of that file all other entries has "InQueue" = FALSE (No file in process)
So in above architecture Time Triggered azure function is taking care of one file per customer but it is also pushing multiple files of different customers in queue. And as Queue Triggered azure function can run multiple instance at the same time. All files of different customers will be executed simultaneously.
Is my architecture is good? or bad? or how can I improve it? What are the other options that can make my process faster or easier or with less steps?

Now the main problem confusing you what I understand is you want to execute the multiple function simultaneously. If so, I suggest you trying the Logic App with parallel branches.
This is the tutorial about how to create a logic app with parallel branch. And this is the rendering. And you could add Azure Functions as a action.
Here I used a Recurrence(Time Schedule) as a trigger you could use others. And after each branch you could also add actions to it.Just like the pic shows.
Hope this could help you, if you still have other questions, please let me know.

Related

Logic app to copy files from a blob even if there is no change

I have a logic app that is triggered when there is a change to the blob, this works fine but what if I want this process to run and overwrite these files is this possible. I seem to lose all dynamic options as soon as I change the bloc modified trigger

I am not sure if I understand your question. But if you just want a scheduled job kind of process to pick or put files from/to azure blob storage, you can use blob actions rather than trigger. You can use a 'Recurrence' trigger to start the logic app and use one of the appropriate blob actions to do the required operation. Let me know if you are looking for something else.

Iterate through files in Data factory

I have a Datalake gen 1 with folder structure /Test/{currentyear}/{Files}
{Files} Example format
2020-07-29.csv
2020-07-30.csv
2020-07-31.csv
Every day one new file gets added to the folder.
I need to create ADF to load the files in the SQL server.
COnditions
When my ADF runs for the first time it needs to iterate all files and load into sql server
When ADF executing starting from second time( daily once) it needs to pick up only todays file and load into SQL server
Can anyone tell me how to design ADF with above conditions

This should be designed as two part.
When my ADF runs for the first time it needs to iterate all files and
load into sql server
You should create a temporary pipeline to acheieve this.(I think you know how to do this, so this part I will not talk about.)
When ADF executing starting from second time( daily once) it needs to
pick up only todays file and load into SQL server
So this needs you to create another pipeline which is continuous running.
Two points to acheive this:
First, trigger this pipeline by event trigger.(When the file is upload, trigger this pipeline.).
Second, filter the file by specific format:
For your requirement, the expression should be #{formatDateTime(utcnow(),'yyyy-MM-dd')}.
On my side, I can do that successful. Please have a try on your side.

Archiving Azure Search Service

Need suggestion on archiving unused data from search service and reload it back when needed(reload to be done later).
Initial design draft looks like this:
Find the keys from search service based on some conditions(like take inactive, how old) that need to be archived.
Run achiever job(need suggestion here, could be a web job, function app)
Fetch the data and insert to blob storage and delete it from the search service.
Now the real way is to run the job in the pool and should be asynchronous

There's no right / wrong answer for this question. What you need to do is perform batch queries (up to 1000 docs), and schedule it to archive past data (eg. run an Azure function which will trigger and search for docs where createdDate > DataTime.Now).
Then persist that data somewhere (can be a cosmos db or as blob into storage account). Once you need to upload it again, I would consider it as a new insert, so it should follow your current insert process.
You can also take a look on this tool which helps to copy data from your index pretty quick:
https://github.com/liamca/azure-search-backup-restore

Azure DML Data slices?

I have 40 mil blobs of 10 TB in blob storage. I am using DML CopyDirectory to copy these into another storage account for backup purpose. It took nearly 2 weeks to complete. Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
Does DML uses anything like data slices ?

Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
As far as I know, when you start the CopyDirectory method, it will just send the request to tell the azure storage account to copy files from another storage account. All the copy operation is azure storage.
If we run the method to start copy the directory, the azure storage will firstly create the file with 0 size as below:
After the job finished, you will find it has change the size as below:
So the result is if the job started it will create the file in the target directory, but the file size is 0. You could see image1's file last modify time.
The azure storage will continue copy the file content to the target directory.
If the job finished, it will change the file last modify time.
So the DML SDK just tell the storage to copy files, then it will continue send the request to the azure storage to check each file's copy status.
Like below:
Thanks. But what happens if the files added to the source directory during this copy operation ? Does the new files as well get copied to the target directory ?
In short answer Yes.
The DML won't get the whole blob list and send request to copy all the file at one time.
It will firstly get a part of your file name list and send request to tell the storage copy file.
The list is sort by the file name.
For example.
If the DML have already copied the file name like 0 file as below.
This target blob folder
If you add the 0 start file to your folder,it will not copy.
This is copy from blob folder.
Copy completely blob folder:
If you add the file at the end of your blob folder and the DML doesn't scan it, it will be copied to the new folder.
so during that 2 weeks at least a million blobs must have been added to the container with very random names. So i think DML doesn't work in the case of large containers ?
As far as I know, the DML is designed for high-performance uploading, downloading and copying Azure Storage Blob and File.
When you using the DML CopyDirectoryAsync to copy the blob file.It will firstly send a request to list the folder's current file, then it will send the request to copy the file.
The default of the operation sending a request to list the folder's current file number is 250.
After get list it will generate a marker which is the next blob search file names. It will start to list the next file name in the folder and start copy again.
And by default, the .Net HTTP connection limit is 2. This implies that only two concurrent connections can be maintained.
It means if you don't set the .Net HTTP connection limit, the CopyDirectoryAsync will just get 500 record and start copy.
After copy completely, the operation will start to copy next files.
You could see this images:
The marker:
I suggest you could firstly set the max http connections to detect more blob files.
ServicePointManager.DefaultConnectionLimit = Environment.ProcessorCount * 8;
Besides, I suggest you could create multiple folder to store the files.
For example, you could create a folder which stores one week files.
Next week, you could start a new folder.
Then you could backup the old folder's file without new files store into that folder.
Finally, you could also write your own code to achieve your requirement, you need firstly get the list of the folder's files.
The max result of one request to get the list is 5000.
Then you could send the request to tell the storage copy each files.
If the file upload to the folder after you get the list, it will not copy to the new folder.

Switching production azure tables powering cloud service

Would like to know what would be the best way to handle the following scenario.
I have an azure cloud service uses a Azure storage table to lookup data against requests. The data in the table is generated offline periodically (once a week).
When new data is generated offline I would need to upload it into a separate table and make config changes (change table name) to the service to pick up data from the new table and re-deploy the service. (Every time data changes I change the table name - stored as a constant in my code - and re-deploy)
The other way would be to keep a configuration parameter for my azure web role which specifies the name of the table which holds current production data. Then, within the service I read the config variable for every request - get a reference to the table and fetch data from there.
Is the second approach above ok - or would it have a performance hit because I read the config, create a table client on every request that comes to the service. (The SLA for my service is less than 2 seconds)

To answer your question, 2nd approach is definitely better than the 1st one. I don't think you will take a performance hit because the config settings are cached on 1st read (I read it in one of the threads here) and creating table client does not create a network overhead because unless you execute some methods on the table client, this object just sits in the memory. One possibility would be to read from config file and put that in a static variable. When you change the config setting, capture the role environment changing event and update the static variable to the new value from the config file.
A 3rd alternative could be to soft code the table name in another table and have your application read the table name from there. You could update the table name as part of your upload process by first uploading the data and then updating this table with the new table name where data has been uploaded.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string