I have a requirement to move a file from ADLS Gen 2 from path(directory) 'A' to directory 'B' or 'C' based on 2 conditions : Move to 'C' if file is not csv or file size is 0 else move to 'B'.
I am planning to use Event grid (as soon as file lands in location 'A')+ Azure function (for checks and move to location 'B' or 'C').
If there are 100 files landing per day, this approach will trigger azure function 100 times.
Is there a better way to do this - can this smarts be built using just one service (such as Event Hub instead of Event grid + Function) so that there is less overhead to maintain.
Thanks for your time.
If you want low effort then try Logic Apps.
What you want is to create a Logic App with Blob Trigger, that will be triggered when there are new blobs. That takes care of trigger.
For action, you can use the "copy blob" if you like. Not sure if there is a "move blob" action supported, but if it's not and "copy blob" action isn't good enough for you then you can provide a custom JS snippet action as inline code.
Couple of notes:
If your Azure Functions are called only 100 times a day and they are only doing some small check and then moving the blob, then under consumption plan you'll probably pay less than a $1 US per month.
With Azure Functions you'll have a lot more control and it'll take you a lot longer (compared to Logic Apps) to develop/operate.
can this smarts be built using just one service
Of course, you can directly use blob trigger of azure function.
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-trigger?tabs=csharp
If there are 100 files landing per day, this approach will trigger
azure function 100 times.
You can use azure function to do a daily check instead of use event grid to trigger function(Timertrigger).
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-timer?tabs=csharp
Just put the logic in the body of function.
Related
I'm having a hard time understanding how ADF will be charged in the following scenario
1 pipeline, 2 activities with 1 being a ForEach which will loop 1000+ times. Activity inside the ForEach is a stored procedure.
So will this be 2 activity runs or more than 1000?
You can see the result in ADF monitor - but it will be 1002 activities at $1/1000 (or whatever the current rate is).
It's much cheaper (if you mind the dollar) if you can pass the list into your proc; the lookup.output field is just json with your table list in it, which you could parse in the proc.
https://learn.microsoft.com/en-us/azure/data-factory/pricing-concepts
Given two containers:
Source: An Azure StorageV2 Account with two containers named A and B containing blob files that will be stored flat in the root directory in the container.
Destination: A Azure Data Lake Gen2 (for simplification purposes, consider it another Storage Account with a single destination container).
Objective: I am trying to copy/ingest all files within the currently active source container at the top of the month. For the remainder of that month, any files newly added/overwritten files inside the active source container need to be ingested as well.
For each month, there will only be one active container that we care about. So January would use Container A, Feb would use Container B, March would use Container A, etc. Using Azure Data Factory, I’ve already figured out how to accomplish this logic of swapping containers by using a dynamic expression in the file path.
#if(equals(mod(int(formatDateTime(utcnow(),'%M')), 2), 0), ‘containerB, ‘ContainerA’)
What I’ve tried so far: I set up a Copy pipeline using a Tumbling Window approach where a trigger runs daily to check for new/changed files based on the LastModifiedDate as described here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-lastmodified-copy-data-tool. However, I ran into a conundrum regarding the fact that the Last Modified date of the files to be ingested at the top of the month will by nature have a LastModifiedDate in the past compared to when the trigger's start date window, as this container is prepared ahead of time in the days leading up the turn of the mount right before the containers are swapped. So because the LastModifiedDate is in the past compared to the start window of the trigger, then those existing files on the 1st of the month will never get copied, only new/changed files after the trigger start date. If I manually fire the trigger by hardcoding an earlier start date, then any files added to the container mid-month end up getting ingested for the remainder of the month as expected.
So how do I solve that base case for files modified before the start date? If this can be solved, then everything can happen in one pipeline and one trigger. Otherwise, I will have to figure out another approach.
And in general, I am open to ideas as to what is the best approach to take here. The files will be ~2gb and around 20,000 in quantity.
You can do it by setting your trigger at the end of each day and try to copy all the new/updated files using last modified date on that day like below.
Assuming that there is no file uploading to second container when first container is active.
Please follow the below steps:
Go to Data factory and drag the copy activity in your pipeline.
Create the source dataset by creating the linked service. Give your container condition by clicking on Add dynamic content in source dataset.
#if(equals(mod(int(formatDateTime(utcnow(),'%M')), 2), 0), ‘containerb, ‘containera’)
Then select the Wildcard file path in the File path type and give * in file path like below to copy multiple files.
Here I am copying new/updated files in the last 24 hours. Go to Filter by last modified and give #adddays(utcNow(),-1) in start time and #utcNow() in the end time.
As we are scheduling this with trigger at the end of each day, it will look for new/updated files from the last 24 hours to start time.
Give your container of another storage account as sink dataset.
Now, click on the Add trigger and create a Tumbling Window trigger
like below.
You can give the start Date above as your wish at the end of the day based on your pipeline execution time.
Please make sure you publish the pipeline and trigger before execution.
If your second container also having new/modified files when the first container is active, then you may give a try like this in the start time of last modified date.
#if(equals(int(formatDateTime(utcNow(),'%D')),1), adddays(utcNow(),-31), adddays(utcNow(),-1))
I have an Azure Function app on the Linux Consumption Plan that has two queue triggers. Both queue triggers have the batchSize parameter set to 1 because they can both use about 500 MB of memory each and I don't want to exceed the 1.5GB memory limit, so they should only be allowed to pick up one message at a time.
If I want to allow both of these queue triggers to run concurrently, but don't want them to scale beyond that, is setting the functionAppScaleLimit to 2 enough to achieve that?
Edit: added new examples, thank you #Hury Shen for providing the framework for these examples
Please see #Hury Shen's answer below for more details. I've tested three queue trigger scenarios. All use the following legend:
QueueTrigger with no functionAppScaleLimit
QueueTrigger with functionAppScaleLimit set to 2
QueueTrigger with functionAppScaleLimit set to 1
For now, I think I'm going to stick with the last example, but in the future I think I can safely set my functionAppScaleLimit to 2 or 3 if I upgrade to the premium plan. I also am going to test two queue triggers that listen to different storage queues with a functionAppScaleLimit of 2, but I suspect the safest thing for me to do is to create separate Azure Function apps for each queue trigger in that scenario.
Edit 2: add examples for two queue triggers within one function app
Here are the results when using two queue triggers within one Azure Function that are listening on two different storage queues. This is the legend for both queue triggers:
Both queue triggers running concurrently with functionAppScaleLimit set to 2
Both queue triggers running concurrently with functionAppScaleLimit set to 1
In the example where two queue triggers are running concurrently with functionAppScaleLimit set to 2 it looks like the scale limit is not working. Can someone from Microsoft please explain? There is no warning in the official documentation (https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#limit-scale-out) that this setting is in preview mode, yet we can clearly see that the Azure Function is scaling out to 4 instances when the limit is set to 2. In the following example, it looks like the limit is being respected, but the functionality is not what I want and we still see the waiting that is present in #Hury Shen's answer.
Conclusion
To limit concurrency and control scaling in Azure Functions with queue triggers, you must limit your Azure Function to use one queue trigger per function app and use the batchSize and functionAppScaleLimit settings. You will encounter race conditions and waiting that may lead to timeouts if you use more than one queue trigger.
Yes, you just need to set functionAppScaleLimit to 2. But there are some mechanisms about consumption plan you need to know. I test it in my side with batchSize as 1 and set functionAppScaleLimit to 2(I set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT as 2 in "Application settings" of function app instead of set functionAppScaleLimit, they are same). And I test with the code below:
import logging
import azure.functions as func
import time
def main(msg: func.QueueMessage) -> None:
logging.info('=========sleep start')
time.sleep(30)
logging.info('=========sleep end')
logging.info('Python queue trigger function processed a queue item: %s',
msg.get_body().decode('utf-8'))
Then I add message to the queue, I sent 10 messages: 111, 222, 333, 444, 555, 666, 777, 888, 999, 000, I sent them one by one. The function was triggered success and after a few minutes, we can see the logs in "Monitor". Click one of the log in "Monitor", we can see the logs show as:
I use 4 red boxes on the right of the screenshot above, I named each of the four logs as "s1", "s2", "s3", "s4"(step 1-4). And summarize the logs in excel for your reference:
I make cells from "s2" to "s4" as yellow because this period of time refer to the execution time of the function task.
According the screenshot of excel, we can infer the following points:
1. The maximum number of instances can only be extended to 2 because we can find it doesn't exist more than two yellow cells in each line of the excel table. So the function can not scale beyond 2 instances as you mentioned in your question.
2. You want to allow both of these queue triggers to run concurrently, it can be implemented. But the instances will be scale out by mechanism of consumption. In simple terms, when one function instance be triggered by one message and hasn't completed the task, and now another message come in, it can not ensure another instance be used. The second message might be waiting on the first instance. We can not control whether another instance is enabled or not.
===============================Update==============================
As I'm not so clear about your description, I'm not sure how many storage queues you want to listen to and how many function apps and QueueTrigger functions you created in your side. I summarize my test result as below for your reference:
1. For your question about would the Maximum Burst you described on the premium plan behave differently than this ? I think if we choose premium plan, the instances will also be scale out with same mechanism of consumption plan.
2. If you have two storage queues need to be listen to, of course we should create two QueueTrigger functions to listen to each storage queue.
3. If you just have one storage queue need to be listen to, I test with three cases(I set max scale instances as 2 in all of three cases):
A) Create one QueueTrigger function in one function app to listen to one storage queue. This is what I test in my original answer, the excel table shows us the instances will scale out by mechanism of consumption plan and we can not control it.
B) Create two QueueTrigger functions in one function app to listen to same storage queue. The result is almost same with case A, we can not control how many instances be used to deal with the messages.
C) Create two function apps and create a QueueTrigger function in each of function app to listen to same storage queue. The result also similar to case A and B, the difference is the max instances can be scaled to 4 because I created two function apps(both of them can scale to 2 instances).
So in a word, I think all of the three cases are similar. Although we choose case 3, create two function apps with one QueueTrigger function in each of them. We also can not make sure the second message be deal with immediately, it still may be processed to first instance and wait for frist instance complete deal with the first message.
So the answer for your current question in this post is setting the functionAppScaleLimit to 2 enough to achieve that? is: If you want both of instances be enabled to run concurrently, we can't make sure of it. If you just want two instances to deal with the messages, the answer is yes.
With an Azure Data Factory "Tumbling Window" trigger, is it possible to limit the hours of each day that it triggers during (adding a window you might say)?
For example I have a Tumbling Window trigger that runs a pipeline every 15 minutes. This is currently running 24/7 but I'd like it to only run during business hours (0700-1900) to reduce costs.
Edit:
I played around with this, and found another option which isn't ideal from a monitoring perspective, but it appears to work:
Create a new pipeline with a single "If Condition" step with a dynamic Expression like this:
#and(greater(int(formatDateTime(utcnow(),'HH')),6),less(int(formatDateTime(utcnow(),'HH')),20))
In the true case activity, add an Execute Pipeline step executing your original pipeline (with "Wait on completion" ticked)
In the false case activity, add a wait step which sleeps for X minutes
The longer you sleep for, the longer you can possibly encroach on your window, so adjust that to match.
I need to give it a couple of days before I check the billing on the portal to see if it has reduced costs. At the moment I'm assuming a job which just sleeps for 15 minutes won't incur the costs that one running and processing data would.
there is no easy way but you can create two deployment pipelines for the same job in Azure devops and as soon as your winodw 0700 to 1900 expires you replace that job with a dummy job using azure dev ops pipeline.
I have a Client Request for my Data Factory Solution
They want to run my Data-Factory when ever the i/p file is available in the Blob Storage/any location.To be very clear they doesn't want to run the solution in an schedule basis,because some day the file won't shows up.So i want an intelligence to search whether the file is available to be process in the location or not.If yes then i have to run my Data factory Solution to process that file,else no need to run the Data factor
Thanks in Advance
Jay
I think you've currently got 3 options to dealing with this. None of which are exactly what you want...
Option 1 - use C# to create a custom activity that does some sort of checking on the directory before proceeding with other downstream pipelines.
Option 2 - Add a long delay to the activity so the processing retires for the next X days. Sadly only a maximum of 10 long retires is allowed currently.
Option 3 - Wait for a newer version of Azure Data Factory that might allow the possibility of more event driven activities, rather than using a scheduled time slice approach.
Apologies this isn't exactly the answer you want. But this gives you current options.