azure data factory - performing full IDL for the first slice - azure

I'm working on data factory POC to replace existing data integration solution that loads data from one system to another. The existing solution extracts all data available until present point in time and then on consecutive runs extracts new/updated data that changed since last time it ran. Basically IDL (initial data load) first and then updates.
Data factory works somewhat similar and extracts data in slices. However I need the first slice to include all the data from the beginning of time. I could say that pipeline start time is "the beginning of time", but that would create too many slices.
For example I want it to run daily and grab daily increments. But I want to extract data for last 10 years first. I don't want to have 3650 slices created to catch up. I want the first slice to have WindowStart parameter overridden and set to some predetermined point in the past. And then consecutive slices to use normal WindowStart-WindowEnd time interval.
Is there a way to accomplish that?
Thanks!

How about you create two pipelines, one as a "run once" which transfers all the initial data, then clone that one, so you copy all the datasets and linked service references in the pipeline. Then add the schedule to it, and a SQL query to fetch only new data which uses the date variables? You'll need something like this in the second pipeline:
"source":
{
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('SELECT * FROM yourTable WHERE createdDate > \\'{0:yyyyMMdd-HH}\\'', SliceStart)"
},
"sink":
{
...
}
Hope that makes sense.

Related

Get Last Value of a Time Series with Azure TimeSeries Insights

How can i query the last (most recent) event along with it's timestamp within a time series?
The approach described here does not work for me as i can not guarantee that the most recent event is within a fixed time window. The event might have been received hours or days ago in my case.
The LAST() function return the last events and the Get Series API should preserve the actual event time stamps according to the documentation but i am a bit confused about the results i am getting back from this API. I get multiple results (sometimes not even sorted by timestamp) and have to find out the latest value on my own.
Also i noticed that the query result does not actually reflect the latest ingested value. The latest ingested value is only contained in the result set if i ingest this value multiple times.
It there any more straight-forward or reliable way to get the last value of a time series with Azure Time Series Insights?
The most reliable way to get the last known value, at the moment, is to use the AggregateSeries API.
You can use the last() aggregation in a variable calculating the last event property and the last timestamp property. You must provide a search span in the query, so you will still have to "guess" when the latest value could have occurred.
Some options are to always have a larger search span than what you may need (e.g. if a sensor sends data every day, you may input a search span of a week to be safe) or use the Availability API to get the time range and distribution of the entire data set across all TSIDs and use that as the search span. Keep in mind that having large search spans will affect performance of the query.
Here's an example of a LKV query:
"aggregateSeries": {
"searchSpan": {
"from": "2020-02-01T00:00:00.000Z",
"to": "2020-02-07T00:00:00.000Z"
},
"timeSeriesId": [
"motionsensor"
],
"interval": "P30D",
"inlineVariables": {
"LastValue": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event['motion_detected'].Bool)"
}
},
"LastTimestamp": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event.$ts)"
}
}
},
"projectedVariables": [
"LastValue",
"LastTimestamp"
]
}

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Azure data factory : Process previous date data

I have been trying to find a way to dynamically set the start and end properties of the pipeline. The reason for this is to process a time series of files 5 days before the current day of the pipeline execution.
I try to set this in the pipeline JSON:
"start": "Date.AddDays(SliceStart, -5)"
"start": "Date.AddDays(SliceEnd, -5)"
and when publishing through VS2015, I get the error below:
Unable to convert 'Date.AddDays(SliceEnd, -5)' to a DateTime value.
Please use ISO8601 DateTime format such as \"2014-10-01T13:00:00Z\"
for UTC time, or \"2014-10-01T05:00:00-8:00\" for Pacific Standard
Time. If the timezone designator is omitted, the system will consider
it denotes UTC time by default. Hence, \"2014-10-01\" will be
converted to \"2014-10-01T00:00:00Z\" automatically..
","code":"InputIsMalformedDetailed"
What could be the other ways to do this?
Rather than trying to set this at the pipeline level dynamically, which won't work. You need to deal with it when you provision the time slices against the dataset and in the activity.
Use the JSON attribute called Offset within the availability block for the dataset and within the scheduler block for the activity.
This will use the time slice start value configured by the interval and frequency and offset it by the given value in day/hours/minutes etc.
For example (in the dataset):
// etc....
},
"availability": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval",
"offset": "-5.00:00:00" //minus 5 days
}
//etc....
You'll need to configure this in both places, otherwise the activity will fail validation at deployment time.
Check out this Microsoft article for details on all the attributes you can use to configure more complex time slice scenarios.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-create-pipelines
Hope this helps.

Azure Data Factory Data-Set Slicing

I have some trouble understanding slicing (Dataset Availability) in Azure Data Factory. Let's say I have a source dataset which never changes. Then I for some reason set up hourly slicing for my source data set. Will each slice then be identical? What is the point of using slices at all in such case (i.e. why is it Required)?
Or another case, let's say my source dataset is appended with new data continuously (for example an event log). And each morning I want to do some analysis on all history of that log. Should I then set up daily slicing? Will each slice include the full history or just the last day?
The slices are the intervals in which the pipeline is executed within the period defined in the start and end properties of the pipeline.
If you have a fix source and you execute an activity more than once, it will always use the same source (because it does not change). Lets say you set the start time and end time to be a day, and set the frequency to be 1 hour - the activity will be executed 24 times. You will have 24 slices, all using the same data source.
For your second scenario, if the data keeps changing, you can set the frequency to once a day. What will be processed depends on the activity you define in the pipeline - lets say that the pipeline deletes the old source once it finish processing, or there's logic in the activity the takes only the new data.

Aggregator that releases partial group based on correlation but holds on to rest of the messages

I want to set the correlation strategy on an aggregator so that it uses a date out of the incoming file (as message) name to correlate files so all files with todays date belong to the same group. Now since I might have multiple days worth of data its possible that I have aggregated 2 days of files. I want to base the release strategy on a done file (message) that includes the date in the filename as well so essentially each day will have a bunch of files and a done for file. Ingesting done file should release files for that day from the aggregator but still keep the other day files until the done file for that day is ingested.
so in this scenario, correlation is obviously simple - but what I am not sure about is how to release not all but only some specific messages from the group based on the correlation key. Documentation talks about messagereaper but that goes into messagestore stuff and I want to do all this in memory.
let me elaborate with an example
i have these files on a directory which im polling by a file inbound channel adapter
file-1-2014.04.27.dat
file-2-2014.04.27.dat
file-3-2014.04.27.dat
done-2014.04.27.dat
file-1-2014.04.28.dat
file-2-2014.04.28.dat
done-2014.04.28.dat
as these files are being polled in i have an aggregator in the flow where all incoming files are being aggregated. To correlate I was thinking I can extract the date and put that in correlation_id header so that first 3 files are being considered to belong to one group and then second 2 files belong to the second group .. now once I consume the done-2014.04.27.dat file at that time I want to release the first 3 files to be further processed in the flow but hold on to
file-1-2014.04.28.dat
file-2-2014.04.28.dat
until I receive the
done-2014.04.28.dat
and then release these 2 files.
Any help would be appreciated.
Thanks
I am not sure what you mean when you say "correlation is simple" but then go on to say you only want to release part of the group. If they have different dates then they will be in different groups, so there's no need to release part of a group, just release the whole group by running the reaper just after midnight (or any time the next day). It's not at all clear why you need a "done" message.
By default, the aggregator uses an in-memory message store (SimpleMessageStore).
EDIT:
Just put the done file in the same group and have your release strategy detect the presence of the done file. You could use an expression, but if the group can be large, it would be more efficient to implement ReleaseStrategy and iterate over MessageGroup.getMessages() looking for the done file.
The next step depends on what's downstream of the aggregator. If you use a splitter to split them back to separate files, you can simply add a filter to drop the done file. If you deal with the collection of files directly, either ignore the done file, or add a transformer to remove it from the collection.
With respect to the reaper; assuming files arrive in real time, I was simply suggesting that if you, say, run the reaper once a day (say at 01:00) with a group timeout of, say 30 minutes, then the reaper will release yesterday's files (without the need for a done file).
EDIT:
See my comment on your "answer" below - you have 2 subscribers on filesLogger.

Resources