Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData)? - azure

Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData) from Azure EventHub client SDK ?
We intentionally do not provide a partition key so EventHub service can do its internal load balancing but want to find out which partition the data is written to eventually, for diagnosing issues with the end to end data flow.

Ivan's answer is correct in the context of the legacy SDK (Microsoft.Azure.EventHubs), but the current generation (Azure.Messaging.EventHubs) is slightly different. You don't mention a specific language, but conceptually the answer is the same across them. I'll use .NET to illustrate.
If you're not using a call that requires specifying a partition directly when reading events, then you'll always have access to an object that represents the partition that an event was read from. For example, if you're using the EventHubConsumerClient method ReadEventsAsync to explore, you'll be see the PartitionEvent where the Partition property tells you the partition that the Data was read from.
When using the EventProcessorClient, your ProcessEventAsync handler will be invoked with a set of ProcessEventArgs where the Partition property tells you the partition that the Data was read from.

There is no direct way to this. But there're 2 workarounds.
1.Use the Event Hubs Capture to store the incoming events, then check the events in the specified blob storage. When the events are stored in blob storage, the path contains the partition id, so you can know it.
2.Use code. Create a new Consumer Group, and follow this article to read events. And in this section, there is a method public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages). You can take use of the parameter PartitionContext to get the event's partition id(by using context.PartitionId).

Related

Slow down EventHubTrigger in Azure Function

I have a simple Azure function which:
as input uses EventHubTrigger
internally write some events to the Azure Storage
During some part of the day the average batch size is around 250+, which is great because I write less block to Azure Storage, but for most of the time the batch size is less then 10.
Is there anyway to force EventHubTrigger to wait until there are more than 50/100/200 messages to process, so I can reduce append blocks in Azure Storage.
Update--
The host.json file contains settings that control Event Hub trigger behavior. See the host.json settings section for details regarding available settings.
You can specify a timestamp such that you receive events enqueued only after the given timestamp.
initialOffsetOptions/enqueuedTimeUtc: Specifies the enqueued time of the event in the stream from which to start processing. When initialOffsetOptions/type is configured as fromEnqueuedTime, this setting is mandatory. Supports time in any format supported by DateTime.Parse(), such as 2020-10-26T20:31Z. For clarity, you should also specify a timezone. When timezone isn't specified, Functions assumes the local timezone of the machine running the function app, which is UTC when running on Azure. For more information, see the EventProcessorOptions documentation.
---
If I understand your ask correctly, you want to hold all the received events until a certain threshold and then process them at once in a singe Azure function run.
To receive events in a batch, make string or EventData an array. In the binding configuration properties that you set in the function.json file and the EventHubTrigger attribute. Set function.json property "cardinality" to "many" in order to enable batching. If omitted or set to one, a single message is passed to the function. In C#, this property is automatically assigned whenever the trigger has an array for the type.
Note: When receiving in a batch you cannot bind to method parameters with event metadata and must receive these from each EventData object
Further, I could not find an explicit way to do this task. You would have to modify your function, like using a timer trigger or use the blob queue storage API to read all messages in a buffer, and then write to a blob via blob binding.

From IOT hub to multiple tables in an Azure SQL database

I have an IOT hub with devices that push their sensor data to it, to be stored in a SQL database. This seems to be quite easy to do by means of a Stream Analytics job.
However, the tricky part is as follows. The data I'm pushing is not normalized, and since I'm using a SQL database I would like to structure it among multiple tables. This does not seem to be an easy task with Stream Analytics.
This is an example of the payload I'm pushing to the IOT hub:
{
"timestamp" : "2019-01-10 12:00",
"section" : 1,
"measurements" :
{
"temperature" : 28.7,
"height" : 280,
"ec" : 6.8
},
"pictures" : {
"101_a.jpg",
"102_b.jpg",
"103_c.jpg"
}
}
My database has a table Measurement, MeasurementItem and Picture. I would like to store the timestamp and section in a Measurement record, the temperature, height and ec in a MeasurementItem record and the pictures in the Picture table.
Filling one table is easy, but to fill the second table I need the generated auto-increment ID of the previous record to keep the relation intact.
Is that actually possible with Stream Analytics, and if no, how should I do that?
You should'nt try it with Stream Analytics (SA) for several reasons. It's not designed for workloads like this, because otherwise SA would not be able to perform it's work this performant. It's just sending data to one or more sinks depending on input data.
I would suggest passing the data to a component that is able to perform logic on the output-side. There are a some options for this. 2 examples might be:
Azure Function (via service-bus-trigger pointing to the IoT hub built-in endpoint as described here)
Event-Grid-based trigger on a storage you write the IoT data to (so again you could use a Azure Function but let it be triggered by an event from a storage account)
This solutions also come with the price that each incoming data package will call a logic unit for which you have to pay additionally. Be aware that there are billing options on Azure Functions that will not depend on the amount of calls but provide you the logic in a more app-service-like model.
If you have huge amounts of data to process you might consider an architecture using Data Lake Storage Account in combination with Data Lake Analytics instead. The latter can collect, aggregate and distribute your incoming data into different data stores too.
I ended up with an Azure function with an IoT hub trigger. The function uses EF Core to store the JSON messages in the SQL database, spread over multiple tables. I was a bit reluctant for this approach, as it introduces extra logic and I expected to pay extra for that.
The opposite appeared to be true. For Azure Functions, the first 400,000 GB/s of execution and 1,000,000 executions are free. Moreover, this solution gives extra flexibility and control because the single table limitation does no longer apply.

Is it possible to have external trigger in Cassandra?

I need a worker to subscribe to new data entries in a column family.
I have to invoke the services consuming data on the producer side, or poll the column family for new data, which is a waste of resources and also leads to some extended latency.
I want some external service to be invoked when new data is written to column family. Is it possible to invoke an external service, such as an REST endpoint upon new data arrival?
There are two features, triggers and CDC (change data capture) that may work. You can create a trigger to receive updates and execute the http request, or you can use CDC to get a per replica copy of the mutations as a log to walk through.
CDC is better for consistency, since a trigger fires before mutations applied, your API endpoint may be notified but then have the mutation fail to apply so your at an inconsistent state. But triggers are easier since you dont need to worry about deduplication since its only 1 per query vs 1 per replica. Or you can use both, triggers that update a cached state and then CDC with a map reduce job to fix any inconsistencies.

Azure ServiceBus. QueueClient. ReceiveBatch issue

I have Azure Service Bus setup with a bunch of queues on it. Some are partitioned. When I tried to read dead letter messages from one of those queues I've deferred messages, then did some massaging and then try to complete those deferred messages. And this is where trouble came in. On the call to the QueueClient.ReceiveBatch() I'm getting InavlidOperationExceptionexception with following message:
ReceiveBatch of sequence numbers from different partitions is not supported
for an entity with partitioning enabled.
Inner exception contains following justification:
BR0012ReceiveBatch of sequence numbers from different partitions is not
supported for an entity with partitioning enabled.
Here is actual line of code, which produces the error:
var deferredMessages = queueClient?.ReceiveBatch(lstSequenceNums);
where lstSequenceNums is type of List<long>and contains sequence numbers of deferred messages; queueClient is of type QueueClient
So I wonder how should this situation be handled? I don't quite understand why that exception has been thrown in the first place? If that expected behavior how I could figure out relation between partition and Service Bus message sequence number?
Any help would be greatly appreciated.
From Azure Service Bus partitioning documentation, for a single message requested
When a client wants to receive a message from a partitioned queue, or from a subscription to a partitioned topic, Service Bus queries all fragments for messages, then returns the first message that is obtained from any of the messaging stores to the receiver.
and
Each partitioned queue or topic consists of multiple fragments. Each fragment is stored in a different messaging store and handled by a different message broker.
I suspect you ask for messages from multiple partitions by requesting a batch using Sequence numbers from different partitions, it would force to query too many brokers and Azure Service Bus service doesn't allow that.
SequenceNumber can help to determine what partition your messages are coming from. Top 16 bits are used to encode the partition ID. You could you that to break your batch into separate calls.
ReceiveBatch(IEnumerable<int64>) does say anything about this. I suggest to raise an issue with the team to clarify the documentation.
I made it work by reading messages one by one(getting all messages in the loop) and abandoning batch reading approach completely, because it's too fragile and issues-pron. It's the only option one have at the moment, imho, if there are partitioned queues on one's Azure Service Bus. Sean Feldman provided some insights(thank you again), but it doesn't provide any acceptable way of making ReceiveBatch viable solution for the task at hand.

Azure event hubs and multiple consumer groups

Need help on using Azure event hubs in the following scenario. I think consumer groups might be the right option for this scenario, but I was not able to find a concrete example online.
Here is the rough description of the problem and the proposed solution using the event hubs (I am not sure if this is the optimal solution. Will appreciate your feedback)
I have multiple event-sources that generate a lot of event data (telemetry data from sensors) which needs to be saved to our database and some analysis like running average, min-max should be performed in parallel.
The sender can only send data to a single endpoint, but the event-hub should make this data available to both the data handlers.
I am thinking about using two consumer groups, first one will be a cluster of worker role instances that take care of saving the data to our key-value store and the second consumer group will be an analysis engine (likely to go with Azure Stream Analysis).
Firstly, how do I setup the consumer groups and is there something that I need to do on the sender/receiver side such that copies of events appear on all consumer groups?
I did read many examples online, but they either use client.GetDefaultConsumerGroup(); and/or have all partitions processed by multiple instances of a same worker role.
For my scenario, when a event is triggered, it needs to be processed by two different worker roles in parallel (one that saves the data and second one that does some analysis)
Thank You!
TLDR: Looks reasonable, just make two Consumer Groups by using different names with CreateConsumerGroupIfNotExists.
Consumer Groups are primarily a concept so exactly how they work depends on how your subscribers are implemented. As you know, conceptually they are a group of subscribers working together so that each group receives all the messages and under ideal (won't happen) circumstances probably consumes each message once. This means that each Consumer Group will "have all partitions processed by multiple instances of the same worker role." You want this.
This can be implemented in different ways. Microsoft has provided two ways to consume messages from Event Hubs directly plus the option to use things like Streaming Analytics which are probably built on top of the two direct ways. The first way is the Event Hub Receiver, the second which is higher level is the Event Processor Host.
I have not used Event Hub Receiver directly so this particular comment is based on the theory of how these sorts of systems work and speculation from the documentation: While they are created from EventHubConsumerGroups this serves little purpose as these receivers do not coordinate with one another. If you use these you will need to (and can!) do all the coordination and committing of offsets yourself which has advantages in some scenarios such as writing the offset to a transactional DB in the same transaction as computed aggregates. Using these low level receivers, having different logical consumer groups using the same Azure consumer group probably shouldn't (normative not practical advice) be particularly problematic, but you should use different names in case it either does matter or you change to EventProcessorHosts.
Now onto more useful information, EventProcessorHosts are probably built on top of EventHubReceivers. They are a higher level thing and there is support to enable multiple machines to work together as a logical consumer group. Below I've included a lightly edited snippet from my code that makes an EventProcessorHost with a bunch of comments left in explaining some choices.
//We need an identifier for the lease. It must be unique across concurrently
//running instances of the program. There are three main options for this. The
//first is a static value from a config file. The second is the machine's NETBIOS
//name ie System.Environment.MachineName. The third is a random value unique per run which
//we have chosen here, if our VMs have very weak randomness bad things may happen.
string hostName = Guid.NewGuid().ToString();
//It's not clear if we want this here long term or if we prefer that the Consumer
//Groups be created out of band. Nor are there necessarily good tools to discover
//existing consumer groups.
NamespaceManager namespaceManager =
NamespaceManager.CreateFromConnectionString(eventHubConnectionString);
EventHubDescription ehd = namespaceManager.GetEventHub(eventHubPath);
namespaceManager.CreateConsumerGroupIfNotExists(ehd.Path, consumerGroupName);
host = new EventProcessorHost(hostName, eventHubPath, consumerGroupName,
eventHubConnectionString, storageConnectionString, leaseContainerName);
//Call something like this when you want it to start
host.RegisterEventProcessorFactoryAsync(factory)
You'll notice that I told Azure to make a new Consumer Group if it doesn't exist, you'll get a lovely error message if it doesn't. I honestly don't know what the purpose of this is because it doesn't include the Storage connection string which needs to be the same across instances in order for the EventProcessorHost's coordination (and presumably commits) to work properly.
Here I've provided a picture from Azure Storage Explorer of leases the leases and presumably offsets from a Consumer Group I was experimenting with in November. Note that while I have a testhub and a testhub-testcg container, this is due to manually naming them. If they were in the same container it would be things like "$Default/0" vs "testcg/0".
As you can see there is one blob per partition. My assumption is that these blobs are used for two things. The first of these is the Blob leases for distributing partitions amongst instances see here, the second is storing the offsets within the partition that have been committed.
Rather than the data getting pushed to the Consumer Groups the consuming instances are asking the storage system for data at some offset in one partition. EventProcessorHosts are a nice high level way of having a logical consumer group where each partition is only getting read by one consumer at a time, and where the progress the logical consumer group has made in each partition is not forgotten.
Remember that the throughput per partition is measured so that if you're maxing out ingress you can only have two logical consumers that are all up to speed. As such you'll want to make sure you have enough partitions, and throughput units, that you can:
Read all the data you send.
Catch up within the 24 hour retention period if you fall behind for a few hours due to issues.
In conclusion: consumer groups are what you need. The examples you read that use a specific consumer group are good, within each logical consumer group use the same name for the Azure Consumer Group and have different logical consumer groups use different ones.
I haven't yet used Azure Stream Analytics, but at least during the preview release you are limited to the default consumer group. So don't use the default consumer group for something else, and if you need two separate lots of Azure Stream Analytics you may need to do something nasty. But it's easy to configure!

Resources