Processing an Event Stream in Azure Databricks

Processing an Event Stream in Azure Databricks - azure

I am looking to implement a solution to populate some tables in Azure SQL based on events that are flowing through Azure Event Hubs into Azure Data Lake Service (Gen2) using Data Capture.
The current ingestion architecture is attached:
Current Architecture
I need to find an efficient way of processing each event that lands in the ADLS and writing it into a SQL database whilst joining it with other tables in the same database using Azure Databricks. The flow in Databricks should look like this:
Read event from ADLS
Validate schema of event
Load event data into Azure SQL table (Table 1)
Join certain elements of Table 1 with other tables in the same database
Load joined data into a new table (Table 2)
Repeat steps 1-5 for each incoming event
Does anyone have a reference implementation that has delivered against a similar requirement? I have looked at using Azure data Factory to pick up and trigger a Notebook whenever an event lands in ADLS (note, there is very low throughput of events (~1 every 10 seconds), however that solution will be too costly.
I am considering the following options:
Using Stream Analytics to stream the data into SQL (however, the joining part is quite complex and requires multiple tables
Streaming from the Event Hub into Databricks (however this solution would require a new Event Hub, and to my knowledge would not make use of the existing data capture architecture)
Use Event Grid to trigger a Databricks Notebook for each Event that lands in ADLS (this could be the best solution, but I am not sure if it is feasible)
Any suggestions and working examples would be greatly appreciated.

Related

Multi-tenant IoT application with Azure

I'm building an Azure IoT Hub application. I have several customers, each with a set of devices. Do you think all those customers should be connected to the same hub or a different one(s)?
I would like to populate a multi tenant db (single db, multiple schemas) via azure stream analytics. The idea is to use a job that partitions the data by customer and saves it in a table of a specific schema (schema associated to a specific customer) on my db. It's possible to do it, or the only way to keep customer data separate is to have several db's (instead of having one db and multiple schemas)?

I'm building an Azure IoT Hub application. I have several customers,
each with a set of devices. Do you think all those customers should be
connected to the same hub or a different one(s)?
It really depends on the data which is processed and also your actual requirements. If sharing the IoT Hub resource details with other customers is not an issue, then you can use the same IoT Hub. Else, choose individual IoT Hubs.
I would like to populate a multi tenant db (single db, multiple
schemas) via azure stream analytics. The idea is to use a job that
partitions the data by customer and saves it in a table of a specific
schema (schema associated to a specific customer) on my db. It's
possible to do it, or the only way to keep customer data separate is
to have several db's (instead of having one db and multiple schemas)?
SQL output in Azure Stream Analytics supports writing in parallel as an option. This option allows for fully parallel job topologies, where multiple output partitions are writing to the destination table in parallel. Enabling this option in Azure Stream Analytics however may not be sufficient to achieve higher throughputs, as it depends significantly on your database configuration and table schema. The choice of indexes, clustering key, index fill factor, and compression have an impact on the time to load tables. For more information about how to optimize your database to improve query and load performance based on internal benchmarks, see SQL Database performance guidance. Ordering of writes is not guaranteed when writing in parallel to SQL Database.
See Increase throughput performance to Azure SQL Database from Azure Stream Analytics for more details.

Looking for an alternative solution to processing tens of thousands of JSONs from Azure Blob to Azure SQL DB

I currently have pipelines developed that leverage Azure Data Factory for orchestration and Azure DataBricks for it's compute to perform the following actions... I receive tens of thousands of single record json files into Azure Blob in a real-time basis and on a 15 minute basis i check the folders for any new files and once found I load them into a dataframe using Databricks and load these into a single file in SQL DB before having other ADF jobs trigger stored procedures which then transform my data into final SQL tables.... We are looking to move away from Databricks as we are not using it for it's true capabilities but are of course paying the Databricks costs. Looking for ideas on other solutions to load tens of thousands of jsons into SQL DB (with minimal to no transformations) on a periodic (i.e. 15 minute) basis. We are a microsoft shop so not looking to necessarily move away from Azure tools.

Here a few ideas:
use Azure Functions + Blob Trigger / Event Grid to process the JSON files in real time (every time a new JSON file arrives, it will trigger your function). Then you could either insert into the final table or on a temporary table.
another idea would be combine Azure Functions + Blob Trigger / Event Grid to sink the data to a data lake. You can use ADF to sink it to SQL final tables.

Azure SQL DB is actually pretty capable as far as JSON goes so you could just use OPENROWSET to import the data directly from blob store and OPENJSON to shred it. You could then use a Logic App running on a schedule to call the proc say every 15 minutes, you wouldn't even need ADF as part of the solution.
I've worked up a couple of similar answers previously, eg here and here, but let me know if you want to progress more down this route and we can work up something more detailed.

Combine static and real time data in Azure Stream Analytics

I am looking into combining data (stored in Azure SQL) and real-time stream data (coming via IoT Hub) in Stream Analytics. One way I found is to use blob storage to copy the SQL Azure data and use it as Input type "Reference Data" and in Stream Analytics query editor JOIN with the streaming data which works fine. However, I am looking into whether it is possible to use JavaScript UDF function capability in stream analytics to get data from SQL Azure and combine with streaming IoT data? I also don't know which one is the suggested approach to combining these type of data together?
Thanks

UDFs in streaming analytics won't allow you to call out to external services like SQL. They're used for things like basic data manipulation, regex, Math, etc. If your SQL data is slow moving in nature, the approach you've outlined here of using something like Data Factory to move SQL information into Blob storage and then use it as a Reference data inside your Stream Analytics query is the correct way (and only way currently) to solve your problem.
If it's fast moving data in SQL you'd want to investigate hooking into the SQL database changes and then publishing them on to Event Hubs. You could then pull this into your query as a second Data Stream input type and do the appropriate joins in your query.

Azure SQL can't handle incoming stream analytics data

I have a scenario where event hub gets data in every 10 seconds, which pass to the stream analytics and then which is passed to the Azure SQL Server. The technical team raised the concerns that Azure SQL is unable to handler so much of data, if data raises 2,00,00,000. then it stops to work.
Can you please guide me is it actual problem of Azure SQL, if it is then can you please suggest me the solution.

Keep in mind that 4TB is the absolute maximum size of an Azure SQL Premium instance. If you plan to store all events for your use case, then this will fill up very quickly. Consider using CosmosDb or Event Hub Capture if you really need to store the messages indefinitely and use SQL for aggregates after processing with SQL DW or ADLS.
Remeber that to optimise Event Hubs you must have a partitioning strategy to optimise the throughput. See the docs.

Azure IoT data warehouse updates

I am building Azure IoT solution for my BI project. For now I have an application that once per set time window sends a .csv blob to Azure Blob Storage with incremental number in name. So after some time I will have in my storage files such as 'data1.csv', 'data2.csv', 'data3.csv', etc.
Now I will need to load these data into a database which will be my warehouse with the use of Azure Stream Analytics job. The issue might be that .CSV files will have overlapping data. They will be send every 4h and contain data for past 24h. I need to always read only last file (with highest number) and prepare lookup so it properly updates data in the warehouse. What will be the best approach to make Stream Analytics read only latest file and for updating records in DB?
EDIT:
TO clarify - I am fully aware that ASA is not capable of being an ETL job. My question is what would be best approach for my case with using IoT tools

I would suggest one of these 2 ways:
use ASA to write in a temporary SQL table, and the use a SQL trigger
to update the main table of the DW with the diff.
Or remove duplicates by adding a unique constraint as described here:
https://blogs.msdn.microsoft.com/streamanalytics/2017/01/13/how-to-achieve-exactly-once-delivery-for-sql-output/
Thanks,
JS - Azure Stream Analytics

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string