Get Last Value of a Time Series with Azure TimeSeries Insights - azure

How can i query the last (most recent) event along with it's timestamp within a time series?
The approach described here does not work for me as i can not guarantee that the most recent event is within a fixed time window. The event might have been received hours or days ago in my case.
The LAST() function return the last events and the Get Series API should preserve the actual event time stamps according to the documentation but i am a bit confused about the results i am getting back from this API. I get multiple results (sometimes not even sorted by timestamp) and have to find out the latest value on my own.
Also i noticed that the query result does not actually reflect the latest ingested value. The latest ingested value is only contained in the result set if i ingest this value multiple times.
It there any more straight-forward or reliable way to get the last value of a time series with Azure Time Series Insights?

The most reliable way to get the last known value, at the moment, is to use the AggregateSeries API.
You can use the last() aggregation in a variable calculating the last event property and the last timestamp property. You must provide a search span in the query, so you will still have to "guess" when the latest value could have occurred.
Some options are to always have a larger search span than what you may need (e.g. if a sensor sends data every day, you may input a search span of a week to be safe) or use the Availability API to get the time range and distribution of the entire data set across all TSIDs and use that as the search span. Keep in mind that having large search spans will affect performance of the query.
Here's an example of a LKV query:
"aggregateSeries": {
"searchSpan": {
"from": "2020-02-01T00:00:00.000Z",
"to": "2020-02-07T00:00:00.000Z"
},
"timeSeriesId": [
"motionsensor"
],
"interval": "P30D",
"inlineVariables": {
"LastValue": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event['motion_detected'].Bool)"
}
},
"LastTimestamp": {
"kind": "aggregate",
"aggregation": {
"tsx": "last($event.$ts)"
}
}
},
"projectedVariables": [
"LastValue",
"LastTimestamp"
]
}

Related

What is the most efficient way of frequently getting the last tweets from 1000+ accounts using Twitter API?

I have a list of approximately 1.500 twitter accounts (that may or may not have tweeted) for which I want to retrieve the last (max 100 tweets) every ~20 minutes. Considering the rate limits of Twitter API v.2, what is the most efficient way of doing this without hitting the rate limits (https://developer.twitter.com/en/docs/twitter-api/rate-limits)?
As far as I understand, there is no way of getting tweets from multiple users at the same time using https://api.twitter.com/2/users/<twitter id>/tweets and iterating through the 1.500 accounts to get the last tweets will make you hit the rate limit of ~900 requests per 15 minutes.
Is there a bulk request that can do this? Is adding them all to a Twitter list and get the latest tweets from there the only real option here?
I am needing this for a Node.js application but the issue is more about how to solve it at a Twitter API level.
The Twitter search API is publicly available at /2/tweets/search/all. You can also use /2/tweets/search/recent.
Using this, you can search from tweets from multiple accounts at once using their OR operator:
(from:twitter OR from:elonmusk)
Returns:
{
"data": [
{
"id": "1540059169771978754",
"text": "we would know"
},
{
"id": "1540058653155278849",
"text": "ratios build character"
},
{
"id": "1539759270501023744",
"text": "RT #NASA: The landmark law #TitleIX opened up a universe of possibility for women, including Janet Petro, the 1st woman director of #NASAKeā€¦"
},
// ...
Note, this has a more strict rate limit, and you will have a limit of how many characters you can use in your search (probably 512).
You can add extra fields like author_id from tweet.fields, if you need them.
If you cannot get by with this, then you may be able to combine API endpoints, since rate limits are applied per-endpoint. For example, search half via the searching endpoint, and the other half via the individual user endpoints.
If this still doesn't work, you're right (from everything that I've found), you will need to either:
Increase your cache time from 20 minutes to something more 30-45 minutes
Create a list

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Azure data factory : Process previous date data

I have been trying to find a way to dynamically set the start and end properties of the pipeline. The reason for this is to process a time series of files 5 days before the current day of the pipeline execution.
I try to set this in the pipeline JSON:
"start": "Date.AddDays(SliceStart, -5)"
"start": "Date.AddDays(SliceEnd, -5)"
and when publishing through VS2015, I get the error below:
Unable to convert 'Date.AddDays(SliceEnd, -5)' to a DateTime value.
Please use ISO8601 DateTime format such as \"2014-10-01T13:00:00Z\"
for UTC time, or \"2014-10-01T05:00:00-8:00\" for Pacific Standard
Time. If the timezone designator is omitted, the system will consider
it denotes UTC time by default. Hence, \"2014-10-01\" will be
converted to \"2014-10-01T00:00:00Z\" automatically..
","code":"InputIsMalformedDetailed"
What could be the other ways to do this?
Rather than trying to set this at the pipeline level dynamically, which won't work. You need to deal with it when you provision the time slices against the dataset and in the activity.
Use the JSON attribute called Offset within the availability block for the dataset and within the scheduler block for the activity.
This will use the time slice start value configured by the interval and frequency and offset it by the given value in day/hours/minutes etc.
For example (in the dataset):
// etc....
},
"availability": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval",
"offset": "-5.00:00:00" //minus 5 days
}
//etc....
You'll need to configure this in both places, otherwise the activity will fail validation at deployment time.
Check out this Microsoft article for details on all the attributes you can use to configure more complex time slice scenarios.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-create-pipelines
Hope this helps.

azure data factory - performing full IDL for the first slice

I'm working on data factory POC to replace existing data integration solution that loads data from one system to another. The existing solution extracts all data available until present point in time and then on consecutive runs extracts new/updated data that changed since last time it ran. Basically IDL (initial data load) first and then updates.
Data factory works somewhat similar and extracts data in slices. However I need the first slice to include all the data from the beginning of time. I could say that pipeline start time is "the beginning of time", but that would create too many slices.
For example I want it to run daily and grab daily increments. But I want to extract data for last 10 years first. I don't want to have 3650 slices created to catch up. I want the first slice to have WindowStart parameter overridden and set to some predetermined point in the past. And then consecutive slices to use normal WindowStart-WindowEnd time interval.
Is there a way to accomplish that?
Thanks!
How about you create two pipelines, one as a "run once" which transfers all the initial data, then clone that one, so you copy all the datasets and linked service references in the pipeline. Then add the schedule to it, and a SQL query to fetch only new data which uses the date variables? You'll need something like this in the second pipeline:
"source":
{
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('SELECT * FROM yourTable WHERE createdDate > \\'{0:yyyyMMdd-HH}\\'', SliceStart)"
},
"sink":
{
...
}
Hope that makes sense.

Aggregating data with CouchDB reduce function

I have a process which posts documents similar to the one below to CouchDB:
{
"timestamp": [2010, 8, 4, 9, 25, 24],
"type": "quote",
"bid": 95.0,
"offer": 96.5
}
Many such documents are posted over the course of a day, each timestamped appropriately.
I want to create a CouchDB view which returns the last quote stored every day.
I've been reading View Cookbook for SQL Jockeys on how to create complex views but I have trouble seeing how to combine map and reduce functions to achieve the desired result. The map function is easy; it's the reduce function I'm having trouble with.
Any pointers gratefully received.
Create a map-function that returns all documents for a given time period using the same key. For example, return all documents in the 17th hour of the day with key 17.
Create a reduce-function that emits only the latest bid for that hour. Your view will return 24 documents, and your client side code will do the final merge.
There are many ways to accomplish this. You can retrieve a single latest-bid by emitting from your map-function a single key and then reducing this by searching all bids, but I'm not sure how that will perform for extremely large sets, such as those you'd encounter with a bidding system.
Update
http://wiki.apache.org/couchdb/View_Snippets#Computing_simple_summary_statistics_.28min.2Cmax.2Cmean.2Cstandard_deviation.29

Resources