Input data from multiple sources Azure streaming job - azure

I need to send data from two devices to my Azure IOT hub.
Both of the devices transmit data with different JSON format. The common column between them is TimeStamp.
Now i need to consume and combine these two inputs and output my data into Power BI.
Can anybody suggest any approach or any link to refer to?
Prateek Raina

To implement such a scenario you might want to use Azure Stream Analytics to reconcile the 2 different data types. The query language for ASA is SQL like and is pretty straight forward, it shouldn't be too hard to do this considering your data sources are both json.
Note that you can easily setup PowerBI as the output for Stream Analytics as well.

I suggest you send it separately. But if you want to achieve with a single datatable, you can join it with Union.

Related

Polymorphic data transformation techniques / data lake/ big data

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different versions and as a result the data is ingested in different but similar schema in the data lake. For instance a startDate field can be string or an object containing date. ) Our goal is to visualise accumulated information in a BI tool.
Questions:
What are the best practices for dealing with polymorphic data?
Process and transform required piece of data (reduced version) to a uni-schema file using a programming language and then process it in spark and databricks and consume in a BI tool.
Decompose data to the meaningful groups and process and join (using data relationships) them with spark and databricks.
I appreciate your comments and sharing opinions and experiences on this topic especially from subject matter experts and data engineers. That would be siper nice if you could also share some useful resources about this particular topic.
Cheers!
One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.
The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.
Landing Zone
Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.
Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.
Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.
Staging Area
Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.
The transformation options you will have are:
Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.
As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.
ODS Area
Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.
I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.
The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.
Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.
Source Control
Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.
Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.
Version1:{
'theProperty': 8
}
Version2:{
'data':{
'myProperty': 10
}
}
Processing =>
Refined version: [{
'property: 8
},
{
'property: 10
}]
So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?

Select best Azure storage for visualization and analysis

I am writing tool to analyze data coming from a race simulator, I have two use cases:
Display live telemetry on a chart - so mostly visualization of incoming stuff, to detect manually anomalies
Calculate own metrics, analyze data and suggest actions based on them - this can be done after a session, doesn't have to be calculated live. Now I am focusing solely on storing data but I have to keep in mind that later it needs to be analyzed.
I was thinking about utilizing Event Hub to handle streaming of events, question is how to visualize data in the easiest way and what's the optimal storage for second use case - it has to be big data solution I believe, there will be many datapoints to analyze.
For visualization you can use Power Bi or another visualization tool running on containers.
For storing, you can go with Azure Time Series Insights or just sink from Event Hubs to Azure Cosmos DB and then, connect power bi on it to create your charts.
https://learn.microsoft.com/en-us/azure/time-series-insights/overview-what-is-tsi

How can I decide, if I should use the Power BI API to push data into my streaming dataset or Azure Stream Analytics?

I am very new to Azure. I need to create a Power BI dashboard to visualize some data produced by a sensor. The dashboard needs to get updated "almost" real-time. I have identified that I need a push data set, as I want to visualize some historic data on a line chart. However, from the architecture point of view, I could use the Power BI REST APIs (which would be completely fine in my case, as we process the data with a Python app and I could use that to call Power BI) or Azure Stream Analytics (which could also work, I could dump the data to the Azure Blob storage from the Python app and then stream it).
Can you tell me generally speaking, what are the advantages/disadvantages of the two approaches?
Azure stream analytics lets you have multiple sources and define multiple targets and one of those targets could be Power-BI and Blob ... and at the same time you can use windowing function on the data as it comes in. It also provides you a visual way of managing your pipeline including windowing function.
In your case you are kind of replicating the incoming data to Blob first and secondly to power-BI. But if you have a use case to apply windowing function(1 minutes or so) as your data is coming in from multiple sources e.g. more than one sensor or a senor and other source, you have to fiddle around a lot to get it working manually, where as in stream analytics you can easily do it.
Following article highlights some of the pros and cons of Azure Analytics...
https://www.axonize.com/blog/iot-technology/the-advantages-and-disadvantages-of-using-azure-stream-analytics-for-iot-applications/
If possible, I would recommend streaming data to IoT Hub first, and then ASA can pick it up and render the same on Power BI. It will provide you better latency than streaming data from Blob to ASA and then Power BI. It is the recommended IoT pattern for remote monitoring, predictive maintenance etc , and provides you longer term options to add a lot of logic in the real-time pipelines (ML scoring, windowing, custom code etc).

Azure IoT + Stream Analytics with blob data

we currently try to evaluate whether or not we should port our business logic
to Azure IoT Hub.
So far this looks promising but i have a questions about stream analytics.
Lets say we have IoT device in the field that send their data as csv files.
Currently our back end has some huge problem to go through this data, analyse it and inject it into our database systems with a decent performance.
I want to try to use Azure for that.
If I use IoT hub and wanna send this csv format to the hub. We assume that the csv format is fixed so i can't just port to the d2c communication format.
Can the stream analytics service work with this csv format and can it puts the embedded data into specific tables in a table storage ?
This would be really important. Are there any example of that out there that might clear things up for me ?
I guess Auzre has its libraries for handling csv files. What if we use no csv format but instead another industry standard format that Azure might not know about ?
Hope you can help me here.
Azure Stream Analytics (ASA) does support CSV as input:
Event serialization format: The serialization format (JSON, CSV, or Avro) of the incoming data stream.
And yes, it also support Azure Table Storage as output . See the docs
When you create an ASA job you can upload your csv file to test the query, so you can easily try it out if you create a sample file.
They have some example csv data on github
I suggest you create a small Proof of concept based on your sample data.
If, for some reason (like the data is in an unsupported format), ASA does not fit you can always retrieve the IoT Hub data using different techniques, for example using an EventProcessorHost. This way you have complete control over the data and you can output it using everything you want and it will still be scalable (but of course this depends on the data destination as well). See this post as a rough idea. It seems a bit outdated but the concept is still valid this day.
The official docs about possible other options to read data from the EventHub can be found here

How to get Data from Node-RED/ Octoblu to Power BI

Trying to get a live dashboard of the state of my gates (ON, OFF)
The JSON format of my payload is
"msg": {
"time_on": 1437773972742,
"time_off": 1437773974231,
}
Does anyone have experience on how to send the states to power bi without using Azure Stream Analytics or Event Hub?
Edit:
Trying to send two json packages from Node-Red to Power BI to get live updates on my dashboard
If you want to use Stream Analytics you will need to flatten the properties by doing SELECT msg.time_on, msg.time_off FROM Input.
If you don't want to use Stream Analytics you will either need to push the data to one of the sources that Power BI can periodically pull from such as SQL Azure (Note: this will not be real time) or integrate with the Power BI push API by going through the resources here: http://dev.powerbi.com.
Ziv.
I'm not aware of Node-RED either; but there a pretty good samples here: https://github.com/PowerBI. You can also use our API Console (http://docs.powerbi.apiary.io/) to play with the API. The console can generate code for you in common languages like JavaScript, Ruby, Python, C#, etc.
Look at Create Dataset:
http://docs.powerbi.apiary.io/#reference/datasets/datasets-collection/create-a-dataset
and add rows to a table:
http://docs.powerbi.apiary.io/#reference/datasets/table-rows/add-rows-to-a-table-in-a-dataset
HTH, Lukasz

Resources