Azure SQL can't handle incoming stream analytics data - azure

I have a scenario where event hub gets data in every 10 seconds, which pass to the stream analytics and then which is passed to the Azure SQL Server. The technical team raised the concerns that Azure SQL is unable to handler so much of data, if data raises 2,00,00,000. then it stops to work.
Can you please guide me is it actual problem of Azure SQL, if it is then can you please suggest me the solution.

Keep in mind that 4TB is the absolute maximum size of an Azure SQL Premium instance. If you plan to store all events for your use case, then this will fill up very quickly. Consider using CosmosDb or Event Hub Capture if you really need to store the messages indefinitely and use SQL for aggregates after processing with SQL DW or ADLS.
Remeber that to optimise Event Hubs you must have a partitioning strategy to optimise the throughput. See the docs.

Related

Processing an Event Stream in Azure Databricks

I am looking to implement a solution to populate some tables in Azure SQL based on events that are flowing through Azure Event Hubs into Azure Data Lake Service (Gen2) using Data Capture.
The current ingestion architecture is attached:
Current Architecture
I need to find an efficient way of processing each event that lands in the ADLS and writing it into a SQL database whilst joining it with other tables in the same database using Azure Databricks. The flow in Databricks should look like this:
Read event from ADLS
Validate schema of event
Load event data into Azure SQL table (Table 1)
Join certain elements of Table 1 with other tables in the same database
Load joined data into a new table (Table 2)
Repeat steps 1-5 for each incoming event
Does anyone have a reference implementation that has delivered against a similar requirement? I have looked at using Azure data Factory to pick up and trigger a Notebook whenever an event lands in ADLS (note, there is very low throughput of events (~1 every 10 seconds), however that solution will be too costly.
I am considering the following options:
Using Stream Analytics to stream the data into SQL (however, the joining part is quite complex and requires multiple tables
Streaming from the Event Hub into Databricks (however this solution would require a new Event Hub, and to my knowledge would not make use of the existing data capture architecture)
Use Event Grid to trigger a Databricks Notebook for each Event that lands in ADLS (this could be the best solution, but I am not sure if it is feasible)
Any suggestions and working examples would be greatly appreciated.

Storing IOT Data in Azure: SQL vs Cosmos vs Other Methods

The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
All the messages should be consumed from FTP by Azure IOT hub by some means.
From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
At the same time,  I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Chandan Jha
First, Azure SQL DB does have Hyperscale which is much larger than 4TB. That said, there is a tipping point where it makes sense to consider alternative architectures when you get to be bigger than what one machine can handle for your solution. While CosmosDB does give you a horizontal sharding solution, you can do the same with N SQL Databases (there are libraries to help there). Stepping back, it is actually pretty important to understand what you want to do with the data if it were in a database. Both CosmosDB and SQL DB are set up for OLTP-style operations (with some limited forms of broader queries - SQL DB supports columnstore and batch mode, for example, which means you could do a reasonably-sized data mart just fine there too). If you are just storing things in the database in the hope of needing to support future data scientists, then you may or may not really need either of these two OLTP stores.
Synapse SQL is set up for analytics and generally has support to read from data in formats in Azure Storage. So, this may be a better strategy if you want to support arbitrarily-large IoT data and do analytics/ML processing over it.
If you know your solution will never be above , you may not need to consider something like Synapse, but it is set up for those scenarios if you are of sufficient size.
Option - 1:
Why don't you extract and serialize the data based on the partition id (device id), send it over the to IoT hub, where you can have the Azure Functions or Logic Apps that de-serializes the data into files that are stored in the blob containers.
Option - 2:
You can also attempt to create a module that extracts the data into excel file, which is then sent to the IoT hub to be stored in the storage containers.

Data flow: Azure Event Hubs to Cosmos db VS Cosmos db to Azure Event Hub - Better option?

I want to monitor some events coming from my application.
One option is to send data to Azure Event Hub and use stream analytics to do some post-processing and enter the data into cosmos db.
Another option is to store to cosmos db from application and run a periodic azure function to do the processing and store it back.
What is the right way to do it? Is there a better way to do it?
The best architecture way is to have Event Hubs to Cosmos DB. I have done the same implementations using Application -> EventHub -> ChangeFeed Azure Function -> Cosmosdb
You can read about Here.
ChangeFeed is offered by Azure Cosmos DB out of the box for this case. It works as a trigger on Cosmos DB change.
It depends on the kind of processing that you would like to do with the events ingested. If it is event at a time processing, a simple Azure Function with CosmosDB changefeed processor might be enough. If you would like to do stateful processing like windowing or event order based computation, azure Stream Analytics would be better. Stream Analytics also provides native integration to PowerBI dashboards. Same job can send the data both to CosmosDB and PowerBI. If you are going to use Azure Stream Analytics, you will have to use EventHub for event ingestion. Using EventHub for ingestion also has other benefits like being able to archive events to blob storage.

Maintaining relationships between data stored in Azure SQL and Table Storage

Working on a project in which the high volume data will be moved from SQL Server (running on an Azure VM) to Azure Table Storage for scaling and cheaper storage reasons. There are several foreign-keys within the data that is being moved to Table Storage, which are GUIDs (Primary-key) in SQL tables. Obviously, there is no means to ensure referential integrity, as transactions do not span different Azure storage types. I would like to know if anyone has had any success with this storage design. Are there any transaction management solutions that allow transactions to be created that span SQL Server and Azure Table Storage? What are the implications of having queries that read from both databases (SQL and Table Storage)?
If you're trying to perform a transaction that spans both SQL Server and Azure Tables, your best bet will be using the eventually consistent transaction pattern.
In a nutshell, you will put your updates into a queue message, then have a worker process (whether it be a WebJob, Worker Role, something running on your VM) dequeue the message with peeklock, make sure that all steps within the transaction are executed, then call complete on the message.
If you're looking to perform a transaction on just an Azure Table, you can do so with batch updates as long as your entities live within the same partition.

Where is Azure Event Hub messages stored?

I generated a SAS signature using this RedDog tool and successfully sent a message to Event Hub using the Events Hub API refs. I know it was successful because I got a 201 Created response from the endpoint.
This tiny success brought about a question that I have not been able to find an answer to:
I went to the azure portal and could not see the messages I created anywhere. Further reading revealed that I needed to create a storage account; I stumbled on some C# examples (EventProcessorHost) which requires the storage account creds etc.
Question is, are there any APIs I can use to persist the data? I do not want to use the C# tool.
Please correct me if my approach is wrong, but my aim is to be able to post telemetries to EventHub, persist the data and perform some analytics operations on it. The telemetry data should be viewable on Azure.
You don't have direct access to the transient storage used for EventHub messages, but you could write a consumer that reads from the EventHub continuously and persist the messages to Azure Table or to Azure Blob.
The closest thing you will find to a way to automatically persist messages (as with Amazon Kinesis Firehose vs Amazon Kinesis which EventHubs are basically equivalent to), would be to use Azure Streaming Analytics configured to write the output either to Azure Blob or to Azure Table. This example shows how to set up a Streaming Analytics job that passes the data through and stores it in SQL, but you can see the UI where you can choose a choice such as Azure Table. Or you can get an idea of the options from the output API.
Of course you should be aware of the requirements around serialization that led to this question
The Event Hub stores data for maximum of 7 days; that’s too in standard pricing tier. If you want to persist the data for longer in a storage account, you can use the Event Hub Capture feature. You don’t have to write a single line of code to achieve this. You can configure it through Portal or ARM template. This is described in this document - https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview
The event hub stores it’s transient data in Azure storage. It doesn’t give any more detail in relation to the data storage. This is evident from this documentation - https://learn.microsoft.com/en-us/azure/event-hubs/configure-customer-managed-key
The storage account you need for EventProcessorHost is only used for checkpointing or maintaining the offset of the last read event in a partition.

Resources