Azure data factory : Process previous date data - azure

I have been trying to find a way to dynamically set the start and end properties of the pipeline. The reason for this is to process a time series of files 5 days before the current day of the pipeline execution.
I try to set this in the pipeline JSON:
"start": "Date.AddDays(SliceStart, -5)"
"start": "Date.AddDays(SliceEnd, -5)"
and when publishing through VS2015, I get the error below:
Unable to convert 'Date.AddDays(SliceEnd, -5)' to a DateTime value.
Please use ISO8601 DateTime format such as \"2014-10-01T13:00:00Z\"
for UTC time, or \"2014-10-01T05:00:00-8:00\" for Pacific Standard
Time. If the timezone designator is omitted, the system will consider
it denotes UTC time by default. Hence, \"2014-10-01\" will be
converted to \"2014-10-01T00:00:00Z\" automatically..
","code":"InputIsMalformedDetailed"
What could be the other ways to do this?

Rather than trying to set this at the pipeline level dynamically, which won't work. You need to deal with it when you provision the time slices against the dataset and in the activity.
Use the JSON attribute called Offset within the availability block for the dataset and within the scheduler block for the activity.
This will use the time slice start value configured by the interval and frequency and offset it by the given value in day/hours/minutes etc.
For example (in the dataset):
// etc....
},
"availability": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval",
"offset": "-5.00:00:00" //minus 5 days
}
//etc....
You'll need to configure this in both places, otherwise the activity will fail validation at deployment time.
Check out this Microsoft article for details on all the attributes you can use to configure more complex time slice scenarios.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-create-pipelines
Hope this helps.

Related

Get Last Value of a Time Series with Azure TimeSeries Insights

How can i query the last (most recent) event along with it's timestamp within a time series?
The approach described here does not work for me as i can not guarantee that the most recent event is within a fixed time window. The event might have been received hours or days ago in my case.
The LAST() function return the last events and the Get Series API should preserve the actual event time stamps according to the documentation but i am a bit confused about the results i am getting back from this API. I get multiple results (sometimes not even sorted by timestamp) and have to find out the latest value on my own.
Also i noticed that the query result does not actually reflect the latest ingested value. The latest ingested value is only contained in the result set if i ingest this value multiple times.
It there any more straight-forward or reliable way to get the last value of a time series with Azure Time Series Insights?
The most reliable way to get the last known value, at the moment, is to use the AggregateSeries API.
You can use the last() aggregation in a variable calculating the last event property and the last timestamp property. You must provide a search span in the query, so you will still have to "guess" when the latest value could have occurred.
Some options are to always have a larger search span than what you may need (e.g. if a sensor sends data every day, you may input a search span of a week to be safe) or use the Availability API to get the time range and distribution of the entire data set across all TSIDs and use that as the search span. Keep in mind that having large search spans will affect performance of the query.
Here's an example of a LKV query:
"aggregateSeries": {
"searchSpan": {
"from": "2020-02-01T00:00:00.000Z",
"to": "2020-02-07T00:00:00.000Z"
},
"timeSeriesId": [
"motionsensor"
],
"interval": "P30D",
"inlineVariables": {
"LastValue": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event['motion_detected'].Bool)"
}
},
"LastTimestamp": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event.$ts)"
}
}
},
"projectedVariables": [
"LastValue",
"LastTimestamp"
]
}

How to change WindowStart WindowEnd in Azure Data Factory?

I have a problem with the Azure Data Factory Pipeline Timezone. I want to create a regularly copy job on schedule that the Daten will transported every 15min from SQL Server to Azure Datawarehouse.
The whole copy process is running but there's problem on the WindowStart. The schedule(Pipeline) used UTC time but we are in Germany(UTC+1), so the WindowStart will always 1 hour late than our local time.
For exsample our local time is 16:00 and I need to update the data from 15:45 to 16:00. But the schedule will set a windowstart=14:45 and a windowend=15:00.
Is there anyone have any idea?
There are two properties which allow to shift schedule (see Dataset Availability):
anchorDateTime: Defines the absolute position in time used by the scheduler to compute dataset slice boundaries.
offset: Timespan by which the start and end of all dataset slices are shifted.
You can play around with these two. I think they are supported in both dataset/properties/availability and pipeline/properties/activities/scheduler configuration sections.
Looks line negative offset is supported, so you can try this:
"availability": {
"frequency": "Minute",
"interval": 15,
"offset": "-01:00:00"
}

Azure Data Factory falling behind schedule silently

I have a number of activities running on ADF with some running every day, some hourly and one every 15 minutes.
I found the way to set up alerts in ADF so that failing activities will trigger an email. I have not however found the way to create more detailed custom alerts.
In this case a task that runs every 15 minutes
"scheduler": {
"frequency": "Minute",
"interval": 15
}
Was set to run one at a time
"policy": {
"concurrency": 1
},
Unfortunately the activity became locked indefinitely for a couple days. Probably on a resource lock. This caused all the time slice to stay in pending state. Waiting on concurrency. Since the initial activity slice did not fail, I got no alert and no warning.
Does anyone have an idea how to monitor failures that aren't failures in ADF like if a slice misses schedule?
One way to do it is to turn your issues into failures.
You can add timeout property into pipeline execution policy:
"policy": {
"concurrency": 1,
"timeout":"00:15:00"
}
With this timeout your pipeline execution and related dataset slice will become failed after 15 minutes.

Azure Data Factory: Configure Frequency

I am planning to use Azure Data Factory as a trigger for data lake analytics jobs in a project.
The data lake jobs will calculate key figures based on sensor input data that is processed by StreamAnalytics and stored in Data Lake.
These jobs should calculate the values every ~5 minutes.
According to Microsoft documentation it is not possible to configure intervals / frequencies smaller than 15 minutes.
Anybody faced the same problem and found a solution or is it better to use a different tool as Azure Data Factory in this scenario?
As you already noticed - the minimal configurable interval is 15 minutes. If you look for tinier intervals you should look at streaming solutions amd not Data Factory. Because there is nit real context in your question, I cannot suggest you which service you should look at. But Azure Logic apps may be a good candidate as there you can have down to 1 interval.
In ADF one cant have a frequency lesser than 15 mins advisable.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution
For your use case you can take a look at Azure Stream Analytics which is meant for streaming ingestion from IOT
https://learn.microsoft.com/en-us/azure/stream-analytics/
You could do this with multiple copies of your pipeline with different names and different availability configurations in your output datasets. For example, 3 pipelines with 3 datasets set to these 3 availability configs would cover each 5-min interval:
"availability":
{
"frequency": "Minute",
"interval": 15,
"anchorDateTime":"2017-10-01T00:00:00"
}
"availability":
{
"frequency": "Minute",
"interval": 15,
"anchorDateTime":"2017-10-01T00:00:05"
}
"availability":
{
"frequency": "Minute",
"interval": 15,
"anchorDateTime":"2017-10-01T00:00:10"
}
Note you might need to implement some kind of synchronization lock if you don't want the executions to overlap.
I am using ADF v2 and I can have a frequency lesser than 15 mins. for example you can see that I have a trigger for each minute.

azure data factory - performing full IDL for the first slice

I'm working on data factory POC to replace existing data integration solution that loads data from one system to another. The existing solution extracts all data available until present point in time and then on consecutive runs extracts new/updated data that changed since last time it ran. Basically IDL (initial data load) first and then updates.
Data factory works somewhat similar and extracts data in slices. However I need the first slice to include all the data from the beginning of time. I could say that pipeline start time is "the beginning of time", but that would create too many slices.
For example I want it to run daily and grab daily increments. But I want to extract data for last 10 years first. I don't want to have 3650 slices created to catch up. I want the first slice to have WindowStart parameter overridden and set to some predetermined point in the past. And then consecutive slices to use normal WindowStart-WindowEnd time interval.
Is there a way to accomplish that?
Thanks!
How about you create two pipelines, one as a "run once" which transfers all the initial data, then clone that one, so you copy all the datasets and linked service references in the pipeline. Then add the schedule to it, and a SQL query to fetch only new data which uses the date variables? You'll need something like this in the second pipeline:
"source":
{
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('SELECT * FROM yourTable WHERE createdDate > \\'{0:yyyyMMdd-HH}\\'', SliceStart)"
},
"sink":
{
...
}
Hope that makes sense.

Resources