I wanted to schedule Azure Search Indexer with a custom schedule as below.
First Indexer should run every 2 hours from Sunday to Monday (not on Saturday) for delta updates.
Second Indexer should run every Saturday for a full load of data.
Created an Azure Index and set up Indexer to load data to Index, but Azure Portal only allows to select a specified time frame (in minutes) to run Indexer on schedule.
For your first indexer request, there is no way to run an indexer on a schedule only from Sunday to Monday without some sort of intervention. The options would be:
Update the indexer to remove the schedule when you want it stop and re-add it when you want it to start again.
Mark the indexer as disabled: true when you want it to stop and then change it back to disabled: false when you want it to start again.
Don't use the built in indexer scheduling and instead create your own automation logic to run it on demand on specific day of the week dependent schedule that you prefer.
For your second indexer request, this is also not doable because the maximum interval we allow between indexer runs is 1 day, and as such do not support a 1 week gap. Your best option here is going to be to set up your own automation logic that runs weekly and runs the indexer on demand on the schedule that you prefer. Also note that for this use case, if you want to ensure that the indexer will always process the full load of data, you should make sure that the data source does not contain a change tracking policy. Some of our data sources are designed to implement a change tracking policy by default, so you would need to manually remove it if you the behavior you want is a full re-index each time you run it.
Relevant documentation: https://learn.microsoft.com/en-us/rest/api/searchservice/create-indexer#request-body
Related
Given two containers:
Source: An Azure StorageV2 Account with two containers named A and B containing blob files that will be stored flat in the root directory in the container.
Destination: A Azure Data Lake Gen2 (for simplification purposes, consider it another Storage Account with a single destination container).
Objective: I am trying to copy/ingest all files within the currently active source container at the top of the month. For the remainder of that month, any files newly added/overwritten files inside the active source container need to be ingested as well.
For each month, there will only be one active container that we care about. So January would use Container A, Feb would use Container B, March would use Container A, etc. Using Azure Data Factory, I’ve already figured out how to accomplish this logic of swapping containers by using a dynamic expression in the file path.
#if(equals(mod(int(formatDateTime(utcnow(),'%M')), 2), 0), ‘containerB, ‘ContainerA’)
What I’ve tried so far: I set up a Copy pipeline using a Tumbling Window approach where a trigger runs daily to check for new/changed files based on the LastModifiedDate as described here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-lastmodified-copy-data-tool. However, I ran into a conundrum regarding the fact that the Last Modified date of the files to be ingested at the top of the month will by nature have a LastModifiedDate in the past compared to when the trigger's start date window, as this container is prepared ahead of time in the days leading up the turn of the mount right before the containers are swapped. So because the LastModifiedDate is in the past compared to the start window of the trigger, then those existing files on the 1st of the month will never get copied, only new/changed files after the trigger start date. If I manually fire the trigger by hardcoding an earlier start date, then any files added to the container mid-month end up getting ingested for the remainder of the month as expected.
So how do I solve that base case for files modified before the start date? If this can be solved, then everything can happen in one pipeline and one trigger. Otherwise, I will have to figure out another approach.
And in general, I am open to ideas as to what is the best approach to take here. The files will be ~2gb and around 20,000 in quantity.
You can do it by setting your trigger at the end of each day and try to copy all the new/updated files using last modified date on that day like below.
Assuming that there is no file uploading to second container when first container is active.
Please follow the below steps:
Go to Data factory and drag the copy activity in your pipeline.
Create the source dataset by creating the linked service. Give your container condition by clicking on Add dynamic content in source dataset.
#if(equals(mod(int(formatDateTime(utcnow(),'%M')), 2), 0), ‘containerb, ‘containera’)
Then select the Wildcard file path in the File path type and give * in file path like below to copy multiple files.
Here I am copying new/updated files in the last 24 hours. Go to Filter by last modified and give #adddays(utcNow(),-1) in start time and #utcNow() in the end time.
As we are scheduling this with trigger at the end of each day, it will look for new/updated files from the last 24 hours to start time.
Give your container of another storage account as sink dataset.
Now, click on the Add trigger and create a Tumbling Window trigger
like below.
You can give the start Date above as your wish at the end of the day based on your pipeline execution time.
Please make sure you publish the pipeline and trigger before execution.
If your second container also having new/modified files when the first container is active, then you may give a try like this in the start time of last modified date.
#if(equals(int(formatDateTime(utcNow(),'%D')),1), adddays(utcNow(),-31), adddays(utcNow(),-1))
Context
I have created a streaming job using Azure portal which aggregates data using a day wise TUMBLINGWINDOW. Have attached a code snippet below, modified from the docs, which shows similar logic.
SELECT
DATEADD(day, -1, System.Timestamp()) AS WindowStart
System.Timestamp() AS WindowEnd,
TollId,
COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TumblingWindow(day, 1), TollId
Now that the job has been running and can see it producing output I want to be able to reduce the costs ideally by setting some sort of time scheduling so that the job can run and still produce the same output without being on all the time.
The only real constraint being that the aggregated output at the end of each TUMBLINGWINDOW has to remain the same as if it were running all the time (no impact of stop-starting on output).
This then brings me to my question.
Update: 2021-02-28
Before going into the question another thing that drove me was that through Azure portal you can manually start and stop a job. When you start/restart a job you can set a custom start time for the job/query. With this level of control say I start a job (or have a job running) and then decide to stop it for majority of the day and then turn it on at say 11:30pm each day with a custom start time of midnight of the current day then it would be able to be on for approx 30min before it would output the results (yet still to my understanding produce the same aggregation results/effect compared to if it was on the whole day up until that point). This job could then be paused again at 00:30am ( the next day for which it stays paused for the majority of the day (1380min total until 11:30pm again) upon which the same above logic is applied.
This way it remains off the majority of the day yet still can produce the same output for each day wise window (correct me if I am wrong in my thinking). The only issue with this to me seems to be the fact someone would have to manually perform this. Thus I was driven to the docs looking for a way to automate this.
Question
How can I start and stop a job in an automated fashion such that the required output would still remain intact but so that the job doesn't have to remain on all the time (like it currently is)?
Does the documentation linked above suffice given the context above, if so what are some possible arrangements for the N minutes (on) and M minutes (off) time variables for this to work?
Is this possible given the scenario that I want to aggregate on a one day TUMBLINGWINDOW window (whereby I want each window to start and end at midnight of each day, as per its default behaviour.)?
Eg
Window start: 2022-02-20 00:00:00 Window end: 2022-02-21 00:00:00 (aggregation performed),
Window start: 2022-02-21 00:00:00 Window end: 2022-02-22 00:00:00 (aggregation performed),
Window start: 2022-02-22 00:00:00 Window end: 2022-02-23 00:00:00 (aggregation performed),
....so on
Thoughts
I found this documentation from Microsoft regarding auto-pausing jobs using a few methods
However came across a paragraph (quoted below) which made me doubtful whether it seems reasonable in my particular use case (TUMBLING 1 day window as described in my question section).
Note
There are downsides to auto-pausing a job. The main ones being the loss of the low latency /real time capabilities, and the potential risks from allowing the input event backlog to grow unsupervised while a job is paused. Auto-pausing should not be considered for most production scenarios running at scale.
Could this method
There are 3 ways to lower costs:
downscale your job, you will have higher latency but for a lower cost, up to a point where your job crashes because it runs out of memory over time and/or can't catch up with its backlog. Here you need to keep an eye on your metrics to make sure you can react before it's too late
going further, you can regroup multiple queries into a single job. This job most likely won't be aligned in partitions, so it won't be able to scale linearly (adding SUs is not guaranteed to give you better performance). Same comment as above, plus you need to remember that when you need to scale back up, you probably will have to break down that job into multiple jobs to again be able to scale in a linear fashion
finally you can auto-pause a job, one way to implement that being explained in the doc you linked. I wrote that doc, and what I meant by that comment is that here again you are taking the risk of overloading the job if it can't run long enough to process the backlog of events. This is a risky proposition for most production scenarios
But if you know what you are doing, and are monitoring closely the appropriate metrics (as explained in the doc), this is definitely something you should explore.
Finally, all of these approaches, including the auto-pause one, will deal with tumbling windows transparently for you.
Update: 2022-03-03 following comments here
Update: 2022-03-04 following comments there
There are 3 time dimensions here:
When the job is running or not: the wall clock
When the time window is expected to output results: Tumbling(day,1) -> 00:00AM every day, this is absolute (on the day, on the hour, on the minute...) and independent of the job start time below
What output you want produced from the job, via the job start time
Let's say you have the job running 24/7 for multiple months, and decide to stop it at noon (12:00PM) on the 1st day of March.
It already has generated an output for the last day of February, at 00:00AM Mar1.
You won't see a difference in output until the following day, 00:00AM Mar2, when you expect to see the daily window of Mar1, but it's not output because the job is stopped.
Let's start the job at 01:00AM Mar2 wall clock time. If you want the missing time window, you should either pick a start time at 'when last stopped' (noon the day before), or a custom time any time before 23:59PM Mar1. What you are driving is the output window you want. Here you are telling ASA you want all the windows from that point onward.
ASA will then reload all the data it needs to generate that window (make sure the event hub has enough retention for that, we don't cache data between restarts in the job): Azure Stream Analytics will automatically look back at the data in the input source. For instance, if you start a job “Now” and if your query uses a 5-minutes Tumbling Window, Azure Stream Analytics will seek data from 5 minutes ago in the input. The first possible output event would have a timestamp equal to or greater than the current time, and ASA guarantees that all input events that may logically contribute to the output has been accounted for.
With an Azure Data Factory "Tumbling Window" trigger, is it possible to limit the hours of each day that it triggers during (adding a window you might say)?
For example I have a Tumbling Window trigger that runs a pipeline every 15 minutes. This is currently running 24/7 but I'd like it to only run during business hours (0700-1900) to reduce costs.
Edit:
I played around with this, and found another option which isn't ideal from a monitoring perspective, but it appears to work:
Create a new pipeline with a single "If Condition" step with a dynamic Expression like this:
#and(greater(int(formatDateTime(utcnow(),'HH')),6),less(int(formatDateTime(utcnow(),'HH')),20))
In the true case activity, add an Execute Pipeline step executing your original pipeline (with "Wait on completion" ticked)
In the false case activity, add a wait step which sleeps for X minutes
The longer you sleep for, the longer you can possibly encroach on your window, so adjust that to match.
I need to give it a couple of days before I check the billing on the portal to see if it has reduced costs. At the moment I'm assuming a job which just sleeps for 15 minutes won't incur the costs that one running and processing data would.
there is no easy way but you can create two deployment pipelines for the same job in Azure devops and as soon as your winodw 0700 to 1900 expires you replace that job with a dummy job using azure dev ops pipeline.
I have a Client Request for my Data Factory Solution
They want to run my Data-Factory when ever the i/p file is available in the Blob Storage/any location.To be very clear they doesn't want to run the solution in an schedule basis,because some day the file won't shows up.So i want an intelligence to search whether the file is available to be process in the location or not.If yes then i have to run my Data factory Solution to process that file,else no need to run the Data factor
Thanks in Advance
Jay
I think you've currently got 3 options to dealing with this. None of which are exactly what you want...
Option 1 - use C# to create a custom activity that does some sort of checking on the directory before proceeding with other downstream pipelines.
Option 2 - Add a long delay to the activity so the processing retires for the next X days. Sadly only a maximum of 10 long retires is allowed currently.
Option 3 - Wait for a newer version of Azure Data Factory that might allow the possibility of more event driven activities, rather than using a scheduled time slice approach.
Apologies this isn't exactly the answer you want. But this gives you current options.
I am searching for some data on splunk for a 5 minute time range. I want this query to run after every 5 minutes in splunk on it's own. How can this be done? I tried finding it on splunk but all I can see is how to schedule alerts and reports. And after the query is activated, how can we access the produced results generated by the query?
Technically you can have a scheduled search, but it only makes sense to talk about a report or an alert. Your scheduled approach is actually the best-practice (as there is also the possibility for a real-time search of the last 5 minutes).
If you just want a report, you tell Splunk to email it to you either as an HTML table or as a PDF document.
If you only want to be alerted if some condition matches (i.e. more than X results) then you want to set up an alert.
Scheduled searches are available, but they are a bit tricky to access (imho)
In the alerts/reports schedule options you have to set the following:
Earliest: -6m#m
Latest: -1m#m
Cron expression: */5 * * * *
Don't forget to set some trigger condition (for an alert) or a delivery method (for the report) ;)