Sending data from Table Storage to Event Hub - azure

It happens so the systems we have to work with has immutable Batch layer as Table Storage. We would like to forward new records added to Table Storage to be sent to Event Hub so that we can process further.
Is there a way to continuously write new records written to Table Storage to be written to Event Hub?
Is it possible otherwise to use either Python or Java SDK of Azure Storage to read newly added records?

Related

How to create a trigger to ( to data factory or azure function or databricks ) when there are 500 new files land in a azure storage blob

I have a azure storage container where I will be getting many files on daily basis.My requirement is that I need a trigger in azure data factory or databricks when each time 500 new files are arrived so I can process them.
In Datafatory we have event trigger which will trigger for each new files(with filename and path). but is it possible to get multiple new files and their details at same time?
What services in azure I can use for this scenario.. event hub? azure function.. queues?
One of the characteristics of a serverless architecture is to execute something whenever a new event occurs. Based on that, you can't use those services alone.
Here's what I would do:
#1 Azure Functions with Blob Trigger, to execute whenever a new file arrives. This would not start the processing of the file, but just 'increment' the count of files that would be stored on Cosmos DB.
#2 Azure Cosmos DB, also offers a Change Feed, which is like an event sourcing whenever something changes in a collection. As the document created / modified on #1 will hold the count, you can use another Azure Function #3 to consume the change feed.
#3 This function will just contain an if statement which will "monitor" the current count, and if it's above the threshold, start the processing.
After that, you just need to update the document and reset the count.

Processing an Event Stream in Azure Databricks

I am looking to implement a solution to populate some tables in Azure SQL based on events that are flowing through Azure Event Hubs into Azure Data Lake Service (Gen2) using Data Capture.
The current ingestion architecture is attached:
Current Architecture
I need to find an efficient way of processing each event that lands in the ADLS and writing it into a SQL database whilst joining it with other tables in the same database using Azure Databricks. The flow in Databricks should look like this:
Read event from ADLS
Validate schema of event
Load event data into Azure SQL table (Table 1)
Join certain elements of Table 1 with other tables in the same database
Load joined data into a new table (Table 2)
Repeat steps 1-5 for each incoming event
Does anyone have a reference implementation that has delivered against a similar requirement? I have looked at using Azure data Factory to pick up and trigger a Notebook whenever an event lands in ADLS (note, there is very low throughput of events (~1 every 10 seconds), however that solution will be too costly.
I am considering the following options:
Using Stream Analytics to stream the data into SQL (however, the joining part is quite complex and requires multiple tables
Streaming from the Event Hub into Databricks (however this solution would require a new Event Hub, and to my knowledge would not make use of the existing data capture architecture)
Use Event Grid to trigger a Databricks Notebook for each Event that lands in ADLS (this could be the best solution, but I am not sure if it is feasible)
Any suggestions and working examples would be greatly appreciated.

How to ingest blobs created by Azure Diagnostics into Azure Data Explorer by subscribing to Event Grid notifications

I want to send Azure Diagnostics to Kusto tables.
The idea is to get logs and metrics from various Azure resources by sending them to a storage account.
I'm following both Ingest blobs into Azure Data Explorer by subscribing to Event Grid notifications and Tutorial: Ingest and query monitoring data in Azure Data Explorer,
trying to use the best of all worlds - cheap intermediate storage for logs, and using EventHub only for notifications about the new blobs.
The problem is that only part of the data is being ingested.
I'm thinking that the problem is in the append blobs which monitoring creates. When Kusto receives "Created" notification, only a part of the blob is written, and the rest of events are never ingested as the blob is appended to.
My question is, how to make this scenario work? Is it possible at all, or I should stick with sending logs to EventHub without using the blobs with Event Grid?
Append blobs do not work nicely with Event Grid ADX ingestion, as they generate multiple BlobCreated events.
If you are able to cause blob rename on update completion, that would sole the problem.

Stream Analytics: Dynamic output path based on message payload

I am working on an IoT analytics solution which consumes Avro formatted messages fired at an Azure IoT Hub and (hopefully) uses Stream Analytics to store messages in Data Lake and blob storage. A key requirement is the Avro containers must appear exactly the same in storage as they did when presented to the IoT Hub, for the benefit of downstream consumers.
I am running into a limitation in Stream Analytics with granular control over individual file creation. When setting up a new output stream path, I can only provide date/day and hour in the path prefix, resulting in one file for every hour instead of one file for every message received. The customer requires separate blob containers for each device and separate blobs for each event. Similarly, the Data Lake requirement dictates at least a sane naming convention that is delineated by device, with separate files for each event ingested.
Has anyone successfully configured Stream Analytics to create a new file every time it pops a message off of the input? Is this a hard product limitation?
Stream Analytics is indeed oriented for efficient processing of large streams.
For your use case, you need an additional component to implement your custom logic.
Stream Analytics can output to Blob, Event Hub, Table Store or Service Bus. Another option is to use the new Iot Hub Routes to route directly to an Event Hub or a Service Bus Queue or Topic.
From there you can write an Azure Function (or, from Blob or Table Storage, a custom Data Factory activity) and use the Data Lake Store SDK to write files with the logic that you need.

Where is Azure Event Hub messages stored?

I generated a SAS signature using this RedDog tool and successfully sent a message to Event Hub using the Events Hub API refs. I know it was successful because I got a 201 Created response from the endpoint.
This tiny success brought about a question that I have not been able to find an answer to:
I went to the azure portal and could not see the messages I created anywhere. Further reading revealed that I needed to create a storage account; I stumbled on some C# examples (EventProcessorHost) which requires the storage account creds etc.
Question is, are there any APIs I can use to persist the data? I do not want to use the C# tool.
Please correct me if my approach is wrong, but my aim is to be able to post telemetries to EventHub, persist the data and perform some analytics operations on it. The telemetry data should be viewable on Azure.
You don't have direct access to the transient storage used for EventHub messages, but you could write a consumer that reads from the EventHub continuously and persist the messages to Azure Table or to Azure Blob.
The closest thing you will find to a way to automatically persist messages (as with Amazon Kinesis Firehose vs Amazon Kinesis which EventHubs are basically equivalent to), would be to use Azure Streaming Analytics configured to write the output either to Azure Blob or to Azure Table. This example shows how to set up a Streaming Analytics job that passes the data through and stores it in SQL, but you can see the UI where you can choose a choice such as Azure Table. Or you can get an idea of the options from the output API.
Of course you should be aware of the requirements around serialization that led to this question
The Event Hub stores data for maximum of 7 days; that’s too in standard pricing tier. If you want to persist the data for longer in a storage account, you can use the Event Hub Capture feature. You don’t have to write a single line of code to achieve this. You can configure it through Portal or ARM template. This is described in this document - https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview
The event hub stores it’s transient data in Azure storage. It doesn’t give any more detail in relation to the data storage. This is evident from this documentation - https://learn.microsoft.com/en-us/azure/event-hubs/configure-customer-managed-key
The storage account you need for EventProcessorHost is only used for checkpointing or maintaining the offset of the last read event in a partition.

Resources