Allowing a subset of devices' data through - azure

I have a dynamic fleet of devices self-registering with IoT Hub and feeding data into Azure Stream Analytics - each device has a uniquely generated ID. I would like to be able to randomly pick 10 of them and output this filtered dataset to Power BI for visualisation purposes. I'm using streaming datasets.
How do I go about constructing this subset...? WHERE deviceId LIKE isn't the right approach since the device ID is uniquely generated.
Thanks!

The easiest thing would be to use Stream Analytics, and have the list of devices you want to output as reference data somewhere to augment the stream with.
You could then flag that data from the reference set and use a second Stream Analytics output with a where clause on it.
What benefit will this activity have though? Maybe something like an average of all devices would be better? I don't know what the business driver is :-)

RAND is not directly supported in ASA Query Language, but can be used using JavaScript UDF (User Defined Function).
However we don't recommend to use a random generator in ASA, since it affects the repeatability in recovery scenarios.
Anthony suggestion to use reference data or an aggregate function may be the best option.
Thanks!
JS (Azure Stream Analytics team)

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Using single Time Series per Device containing multiple sensors in Azure Timeseries Insights

We are using Azure TimeSeries Insights to store time series from various sensors from our devices.
In order to faciliate querying of multiple related sensors at once i only use the deviceId (from IoT Hub) as a TimeSeries Id.
This works for my backend queries and i can use the filter expression to filter by sensor id.
The only downside i discovered was that i am losing support for TSI Explorer because my value property in a time series contains values from multiple sensors. Display as a single graph makes no sense.
I thought maybe i can use the TSI Model to create fields which filter by sensor id so i get the same experience in TSI Explorer as if i would use a time series per device and sensor. But i didn't figure out how to do so.
So my questions are:
Does the approach of merging multiple sensors of a device in a single time series have any downsides (besides the apparently lost TSI Explorer support) ?
Is there a way to model the Time Series to get back TSI Explorer support back again?
#Markus, I am sure adding variable with a filter expression like "$event.sensorId.String = ''", will allow plotting time series on TSI explorer. However this would have a serious performance implications; especially when you have a large number of time series instances.
The most preferred method is to model each sensor (using composite TSID) as a time series and build hierarchy model to organize/contextualize data such that sensors are listed under a device as a parent in TSI explorer.
From TSI team, we would like to better understand your use case. If I may suggest something, please raise a support ticket in Azure portal with necessary contact details. we can get in touch to understand your use case here and better assist you in attaining your objective.

Creating Instances and types in Azure TimeSeriesInsights

I am new in TimeSeriesInsights. Imagine we a have smart meter which transmits different types of telemetry data such as Consumption, Indoor Temperature etc. Structure of the messages is the same but a type property is there to distinguish between different telemetry types.
How can I created TSI Instances & Types?
One type per each telemetry type? Then how to distinguish them from each other?
I'd appreciate if anyone experienced, walk me through the process.
I am not sure how last-first might give you the consumption! I would be interested to know more.
However you could trying modeling an aggregate variable in TSI
Example here.
Here is the corresponding documentation.
https://learn.microsoft.com/en-us/azure/time-series-insights/concepts-variables
#Mori, Thankyou for choosing TimeSeries Insights. By types I mean you are referring to https://learn.microsoft.com/en-us/azure/time-series-insights/concepts-model-overview#time-series-model-types. These types cant be automatically created based on a type field sent in the JSON payload. Typically customers create them using the TSI explorer or use our API to create them using scripts. I am happy to walk you through the type creation and assignment process. Please let me a good time to connect.

solution for large amount of time series data with many attributes

I am working on a news analytics project, we retrieve events from real time news streams and do sentiment for certain financial instruments. Currently we only generate one time series sentiment stream per instrument, which is aggregated from over 100 types of events and many news website.We use Postgres to store the structured data, pre-calculate/aggregate sentiment and store on Influx to support real time streaming on frontend.
We are thinking of expanding the functionality to let user to be able to select the in scope event types and news sources, so every user can have different sentiment streams. And user should be able to further break down sentiment with only certain event type or source. The ideal solution should be able to let user to define the scope and receive aggregated sentiment on the fly.
I hardly imagine aggregation can be done completely on the fly without any pre-calculation. On the other hand, the most atomic time series is per event type per news source. But in this way we need to maintain (100event types * 100 news source * 1000 instruments) 10million series? Increasing news source further will make the system impossible to maintain.
Can someone please share some thoughts what architecture or technical solution may support our requirements?
If all event types and sources share same instruments, then you can create a one stream and make event type and source a attribute in the stream ( series). Then you can filter the stream by attributes as needed.
However, if different sources have different instruments and event types, then you can have a stream ( time series) for instrument and add source and event type as attributes to each stream so you can filter by attributes.
In general, try to reduce number of streams and encode that information as attributes.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources