I'm trying to build a real-time reporting service on top of Microsoft Azure Data Warehouse. Currently I have a SQL server with about 5 TB of data. I want to stream the data to the data warehouse and use Azure DW's computation power to generate real-time reporting based on data. Is there any ready to use/best practices to do that?
One approach I was considering is to load data into Kafka and then stream it into Azure DW via Spark streaming. However, this approach is more near-real-time than real-time. Is there any way to utilize SQL Server Change Data Capture to stream data into data warehouse?
I don't personally see Azure SQL Data Warehouse in a real-time architecture. It's a batch MPP system optimised for shredding billions of rows over multiple nodes. Such a pattern is not synonymous with sub-second or real-time performance, in my humble opinion. Real-time architectures tend to look more like Event Hubs > Stream Analytics in Azure. The low concurrency available (ie currently a max of 32 concurrent users) is also not a good fit for reporting.
As an alternative you might consider Azure SQL Database in-memory tables for fast load and then hand off to the warehouse at a convenient point.
You could Azure SQL Data Warehouse in a so-called Lambda architecture with a batch and real-time element, where is supports the batch stream. See here for further reading:
https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx
If you’re looking for a SQL-based SaaS solution to power realtime reporting applications we recently released a HTTP API product called Stride, which is based on the open-source streaming-SQL database we build, PipelineDB, that can handle this type of workload.
The Stride API enables developers to run continuous SQL queries on streaming data and store the results of continuous queries in tables that get incrementally updated as new data arrives. This might be a simpler approach to adding the type of realtime analytics layer you mentioned above.
Feel free to check out the Stride technical docs for more detail.
Related
We are using Azure Stream Analytics to build out a new IoT product. The data is successfully streaming to Power BI but there is no way to implement Row Level Security so we can display this data back to a customer, limited to only that customer's data. I am considering adding an Azure SQL DB between ASA and PBI and switching the PBI Dataset from a streaming dataset to Direct Query with a high page refresh rate but this seems like it will be a very intense workload for an Azure SQL DB to handle. There is the potential, as the product grows, for multiple inserts per second and querying every couple of seconds. Streaming seems like the better answer besides the missing RLS. Any tips?
There is the potential, as the product grows, for multiple inserts per second and querying every couple of seconds.
A small Azure SQL Database should handle that load. 1000/sec simple. 100,000/sec is probably too much.
And ASA can ensure that the output streams are not too frequent.
I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.
I am very new to Azure. I need to create a Power BI dashboard to visualize some data produced by a sensor. The dashboard needs to get updated "almost" real-time. I have identified that I need a push data set, as I want to visualize some historic data on a line chart. However, from the architecture point of view, I could use the Power BI REST APIs (which would be completely fine in my case, as we process the data with a Python app and I could use that to call Power BI) or Azure Stream Analytics (which could also work, I could dump the data to the Azure Blob storage from the Python app and then stream it).
Can you tell me generally speaking, what are the advantages/disadvantages of the two approaches?
Azure stream analytics lets you have multiple sources and define multiple targets and one of those targets could be Power-BI and Blob ... and at the same time you can use windowing function on the data as it comes in. It also provides you a visual way of managing your pipeline including windowing function.
In your case you are kind of replicating the incoming data to Blob first and secondly to power-BI. But if you have a use case to apply windowing function(1 minutes or so) as your data is coming in from multiple sources e.g. more than one sensor or a senor and other source, you have to fiddle around a lot to get it working manually, where as in stream analytics you can easily do it.
Following article highlights some of the pros and cons of Azure Analytics...
https://www.axonize.com/blog/iot-technology/the-advantages-and-disadvantages-of-using-azure-stream-analytics-for-iot-applications/
If possible, I would recommend streaming data to IoT Hub first, and then ASA can pick it up and render the same on Power BI. It will provide you better latency than streaming data from Blob to ASA and then Power BI. It is the recommended IoT pattern for remote monitoring, predictive maintenance etc , and provides you longer term options to add a lot of logic in the real-time pipelines (ML scoring, windowing, custom code etc).
I am trying to design an IoT platform using the above mentioned technologies. I would be happy if someone can comment on the architecture, if its good and scalable !
I get IoT sensor data through mqtt which I will receive through spark streaming( There is a mqtt connector for spark streaming which does it). I only have to subscribe to the topics and there is a third party server which publishes IoT data to the topic.
Then I parse the data , and insert in AWS DynamoDB . Yes whole setup will run on AWS.
I may have to process/transform the data in future depending on the IoT use cases so I thought spark might be useful . Also I have heard spark streaming is blazing fast.
It's a simple overview and I am not sure if its a good architecture. Will it be a overkill to use spark streaming ? Are there other ways to directly store data on DynamoDB received from mqtt ?
I cannot state whether your components will result in a scalable architecture, since you did not elaborate on how you will scale them, nor what will be the estimated load that such a system should handle, or if there will be peaks in terms of load.
If you are talking about scalability in terms of performance, you should also consider scalability in terms of pricing which may be important to your project.
For instance, DynamoDB is a very scalable NoSQL database service, which offers elastic performances with a very efficient pricing. I do not know much about Apache Spark, and even if it has been designed to be very efficient at scale, how will you distribute incoming data ? Will you host multiple instances on EC2 and use autoscaling to manage instances ?
My advice would be to segregate your needs in terms of components to conduct a successful analysis. To summarize your statements:
You need to ingest incoming sensor telemetry at scale using MQTT.
You need to transform or enrich these data on the fly.
You need to insert these data (probably as time-series) into DynamoDB in order to build an event-sourcing system.
Since you mentioned Apache Spark, I imagine you would need to perform some analysis of these data, either in near real-time, or in batch, to build value out of your data.
My advice would be to use serverless, managed services in AWS so that you can only pay for what really you use, and forget about the maintenance, or the scalability, and focus on your project.
AWS IoT is a platform built into AWS which will allow you to securely ingest data at any scale using MQTT.
This platform also embeds a rules engine, which will allow you to build your business rules in the cloud. For example, intercepting incoming messages, enrich them, and call other AWS services as a result (e.g calling a Lambda function to do some processing on the ingested data).
The rules engine has a native connector to DynamoDB, which will allow you to insert your enriched or transformed data into a table.
The rules engine has also a connector to the new Amazon Machine Learning service, if you want to get predictions on sensor data in real-time.
You can then use other services such as EMR + Spark to batch-process your data once a day, week, month.
The advantage here is that you assemble your components and use them as you go, meaning that you do not need the full featured stack when you are beginning, but still have the flexibility of making changes in the future.
An overview of the AWS IoT service.
Our team have just recently started using Application Insights to add telemetry data to our windows desktop application. This data is sent almost exclusively in the form of events (rather than page views etc). Application Insights is useful only up to a point; to answer anything other than basic questions we are exporting to Azure storage and then using Power BI.
My question is one of data structure. We are new to analytics in general and have just been reading about star/snowflake structures for data warehousing. This looks like it might help in providing the answers we need.
My question is quite simple: Is this the right approach? Have we over complicated things? My current feeling is that a better approach will be to pull the latest data and transform it into a SQL database of facts and dimensions for Power BI to query. Does this make sense? Is this what other people are doing? We have realised that this is more work than we initially thought.
Definitely pursue Michael Milirud's answer, if your source product has suitable analytics you might not need a data warehouse.
Traditionally, a data warehouse has three advantages - integrating information from different data sources, both internal and external; data is cleansed and standardised across sources, and the history of change over time ensures that data is available in its historic context.
What you are describing is becoming a very common case in data warehousing, where star schemas are created for access by tools like PowerBI, Qlik or Tableau. In smaller scenarios the entire warehouse might be held in the PowerBI data engine, but larger data might need pass through queries.
In your scenario, you might be interested in some tools that appear to handle at least some of the migration of Application Insights data:
https://sesitai.codeplex.com/
https://github.com/Azure/azure-content/blob/master/articles/application-insights/app-insights-code-sample-export-telemetry-sql-database.md
Our product Ajilius automates the development of star schema data warehouses, speeding the development time to days or weeks. There are a number of other products doing a similar job, we maintain a complete list of industry competitors to help you choose.
I would continue with Power BI - it actually has a very sophisticated and powerful data integration and modeling engine built in. Historically I've worked with SQL Server Integration Services and Analysis Services for these tasks - Power BI Desktop is superior in many aspects. The design approaches remain consistent - star schemas etc, but you build them in-memory within PBI. It's way more flexible and agile.
Also are you aware that AI can be connected directly to PBI Web? This connects to your AI data in minutes and gives you PBI content ready to use (dashboards, reports, datasets). You can customize these and build new reports from the datasets.
https://powerbi.microsoft.com/en-us/documentation/powerbi-content-pack-application-insights/
What we ended up doing was not sending events from our WinForms app directly to AI but to the Azure EventHub
We then created a job that reads from the eventhub and send the data to
AI using the SDK
Blob storage for later processing
Azure table storage to create powerbi reports
You can of course add more destinations.
So basically all events are send to one destination and from there stored in many destinations, each for their own purposes. We definitely did not want to be restricted to 7 days of raw data and since storage is cheap and blob storage can be used in many analytics solutions of Azure and Microsoft.
The eventhub can be linked to stream analytics as well.
More information about eventhubs can be found at https://azure.microsoft.com/en-us/documentation/articles/event-hubs-csharp-ephcs-getstarted/
You can start using the recently released Application Insights Analytics' feature. In Application Insights we now let you write any query you would like so that you can get more insights out of your data. Analytics runs your queries in seconds, lets you filter / join / group by any possible property and you can also run these queries from Power BI.
More information can be found at https://azure.microsoft.com/en-us/documentation/articles/app-insights-analytics/