I'm working on a NestJS project that receives data from SAP MII and then send it to EventHub. Unfortunately, EventHub supports a maximum of 1MB (https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas), and in my case, SAP MII sometimes returns 4MB+ and I still need to send it to EventHub.
I have a few ideas on my mind, but I'm not sure if there's a better approach to it or even if there's a way to change EventHub size limit.
Related
Is there any message receiving limit per device on Azure IoTHub?
If any, can I remove or raise the upper limit without registering additional devices?
I tested 2 things to make sure if I can place enough load (ideally, 18000 message/s)on Azure IoT Hub in the future load tests.
① Send a certain amount of mqtt messages from a VM.
② Send a certain amount of mqtt messages from two VMs.
I expected that the traffic of ② would be twice as large as that of ①. But it wasn't. Maximum messages per minute on IoTHub of ② is not so different from that of ①. Both of them are around 3.6k [message/min]. At that time, I registered only one device on IoT Hub. So I added another device and tested ② again to see if the second device could increase the traffic. As a result, it increased the traffic and IoT Hub had bigger messages per minute.
Judging from this result, I thought IoTHub has some kind of limit on receiving message per device. But I am not sure. So if anyone know about the limit, could you tell me what kind of limit it is and how to raise the upper limit without registering additional devices because in production we use only one device.
For your information, I know there is a "unit" to increase the throughput in IoTHub. To increase the load I changed the number of unit from 2 to 20 in both ① and ②. However, it did not make messages/min in IotHub bigger. I'd also like to know why the "unit" did not work as expected.
Thank you for reading, in advance. Any comment would be my help.
Every basic (B1,B2, B3) or standard unit of IoT Hub SKU (S1, S2, S3) has specific daily message quota as per https://azure.microsoft.com/en-us/pricing/details/iot-hub/. A single IoTHub can support 1 million devices and there is no per device cost associated, only the msg/day quota as above.
e.g. S1 SKU has 400,000 msg/day quota and you can add multiple units of S1 to increase capacity. S2 has 6000,000 msg/day and S3 has 300,000,000 msg/day quota per unit and more units can be added.
Before this limit is reached IoTHub will raise alert which can be used to automatically add more units or jump to higher SKU.
Regarding your test, there are specific throttling limits to avoid misuse of the service here -
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-quotas-throttling
As an example, for 18000 msg/sec you will need 3 units of S3 SKU (each with 6000 msg/sec rate limit). In addition there are other limits like how quickly connections can be attempted, if using Azure IoT SDK's the built-in retry logic helps overcome this otherwise you need to have retry policy. Basically you dont want million device trying to connect at the same time, IoTHub will only accept connections at a certain rate. This is not concurrent connection limit but a rate at which new connnections are accepted.
For every service which read/write from/to topic in Kafka/Redis, there are some basic metrics which we want to have in Prometheus:
How "fast" the writes are for every topic
How "fast" the reads are for every topic
In Kafka, I may want to determine how "fast" each group-id reads.
To determine the "speed" of reading from a topic, one can think of a mechanism where someone publish the same message in intervals of 10 seconds and the consumer sends to Prometheus when it fully processed that message. If the graph show that the message was read every 12 seconds, it means that we have a lag of 2 seconds when reading any messages.
It looks like a lot of repeated manual work on every topic there is on the system.
Question
Is my proposal makes any sense? Are there any best-practice/tools on how to determine "lags"/"speed" of reading/writing from every topic in redis/kafka/... in Prometheus?
I had the exact same issue once.
Maintaining the each topic metrics manually is very tiring and not at all scalable.
I switched over to using kafka_consumergroup_lag metric from the kafka_exporter
This along with the consumergroup,topic labels were enough to let us know to know which topic was not being read/lagging behind and by which consumer group.
Also has other metrics like the rate of meassages being read.
As for converting this lag in terms of time, either attach an produce time to kafka message and read it at the other end of the kafka pipeline and export the difference in time via micrometer from the application to Prometheus.
Or better still:- use tracing to track each message in the piepline using OpenTracing tools like Jaeger
Use this for Redis monitoring.
All these exporters send the data in the Prometheus format and can be directly integrated.
I have this problem where I am reading messages from SQL server (data could vary some fields are varchar(max)) and batch sending them by serializing them to JSON Azure Event Hubs Batch sending mechanism has a limit of 250Kb, is there a easy to check the size of the list of eventdata that gets trasmitted before the submission? Most of the times the data is tiny so I would want to batch up several messages. How can I know what size will the payload is going to be. I really don't want to count characters and calculate the size and thinking there is a better way you guys can help with?
If you are on .NET SDK WindowsAzure.ServiceBus, you can use SerializedSizeInBytes property (see docs). Then sum them up for the batch.
Unfortunately, I don't see it in the new Azure Event Hubs Client for .NET.
You can use TryAdd function available in EventHub SDK.
Event Hub Data batch - TryAdd (EventData data)
I am working on the POC for Azure Event hubs to implement the same into our application.
Quick Brief on flow.
Created tool to read the CSV data from local folder and send it to event hub.
We are sending Event Data in Batch to event hub.
With 12 instance of tool (Parallel), I can send a total of 600 000 lines of messages to Event hub within 1 min.
But, On receiver side, to receive the 600 000 lines of data, it takes more than 10 mins.
Need to achieve
I would like to Match/double my egress speed on the receiver to
process the data. Existing Configuration
The configuration I have made user are
TU - 10 One Event hub with 32 Partition.
Coding logic goes same as mentioned in MSDN
Only difference is, I am sending line of data in a batch.
EventProcessorhost with options {MaxBatchSize= 1000000,
PrefetchCount=1000000
To achieve higher egress rate (aka faster processing pipeline) in eventhubs:
Create a Scaled-out pipeline - each partition in EventHub is the unit-of-scale for processing events out of EventHub. With the Scale you described (6Lakh events per min --> 10K events per sec - with 32 partitions - you already got this right). Make sure you create as many partitions as you envision your pipeline need in near future. Imagine analyzing traffic on a Highway and no. of lanes is the only limitation for the amount of traffic.
Equal load distribution across partitions: if you are using SendToASpecificPartition or SendUsingPartitionKey - you will need to take care of equal load distribution. If you use EventHubClient.Send(EventDataWithOutPartitionKey) - EventHubs service will make sure all of your partitions are equally loaded. If a single EventHub Partition is heavily loaded - the amount of time you can process all events on EventHub will be bound by no. of events on this Partition.
Scale-out physical resources on the Receiver/EventProcessorHost: most importantly Network (Sockets & bandwidth) & after-a-point, CPU & Memory. Use PartitionManagerOptions.MaxReceiveClients to increase the maximum number of EventHubClients (which has a dedicated MessagingFactory, which maps to 1 socket) created per EventProcessorHost instance. By default it is 16.
Let me know how it went... :)
More on Event Hubs.
We are experiencing lots of these exceptions sending events to EventHubs during peak traffic:
"Failed to send event to EventHub. Exception : Microsoft.ServiceBus.Messaging.MessagingException: The server was unable to process the request; please retry the operation. If the problem persists, please contact your Service Bus administrator and provide the tracking id."
or
"Failed to send event to EventHub. Exception : System.TimeoutException: The operation did not complete within the allocated time "
You can see it clearly here:
As you can see, we got lots of Internal Errors, Server Busy Errors, Failed Request when Incoming messages are over 400K events/hour (or ~270 MB/hour). This is not just a transient issue. It's clearly related to throughput.
Our EH has 32 partitions, message retention of 7 days, and 5 throughput units assigned. OperationTimeout is set to 5 mins, and we are using the default RetryPolicy.
Is it anything we still need to tweak here? We are really concerned about the scalability of EH.
Thanks
Send throughput tuning can be achieved using efficient partition distribution strategies. There isn't any single knob which can do this. Below is the basic information you will need to be able to design for High-Thruput Scenarios.
1) Lets start from the Namespace: Throughput Units(aka TUs) are configured at Namespace level. Pls. bear in mind, that, TUs configured is applied - aggregate of all EventHubs under that Namespace. If you have 5 TUs on your Namespace and 5 eventhubs under it - it will be divided among all 5 eventhubs.
2) Now lets look at EventHub level: If the EventHub is allocated with 5 TUs and it has 32 partitions - No single partition can use all 5 TUs. For ex. if you are trying to send 5TU of data to 1 partition and 'Zero' to all other 31 partitions - this is not possible. Maximum you should plan per Partition is 1 TU. In general, you will need to ensure that the data is distributed evenly across all partitions. EventHubs support 3 types of sends - which gives users different level of control on Partition distribution:
EventHubClient.Send(EventDataWithoutPartitionKey) -> if you are using this API to send - eventhub will take care of evenly distributing the data across all partitions. EventHubs service gateway will round-robin the data to all partitions. When a specific partition is down - the Gateways auto-detect and ensure Clients doesn't see any impact. This is the most recommended way to Send to EventHubs.
EventHubClient.Send(EventDataWithPartitionKey) -> if you are using this API to send to EventHubs - the partitionKey will determine the distribution of your data. PartitionKey is used to Hash the EventData to the appropriate partition (algo. to hash is Microsoft Proprietary and not Shared). Typically users who require correlation of a group of messages will use this variant of Send.
EventHubSender.Send(EventData) -> In this variant, the Sender is already attached to the Partition. So - this gives complete control of Distribution across partitions to the Client.
To measure your present distribution of Data - use EventHubClient.GetPartitionRuntimeInfo Api to estimate which Partition is overloaded. The difference b/w BeginSequenceNumber and LastEnqueuedSequenceNumber is supposed to give an estimate of that partitions load compared to others.
3) Last but not the least - you can tune performance (not Throughput) at send operation level - using the SendBatch API.
1 TU can buy a Max of 1000 msgs/sec or 1MBPS - you will be throttled with whichever limit hits first - this cannot be changed.
If your messages are small - lets say 100 bytes and you can send only 1000 msgs/sec (as per the TU limit) - you will first hit the 1000 events/sec limit. However, overall using SendBatch API - you can batch lets say 10 of 100byte msgs and push at the same rate - 1000 msgs/sec with just 100 API calls and improve the end-to-end latency of the system (as it helps service also to persist messages efficiently). Remember, the only limitation here is the Max. Msg Size that can be sent - which is 256 kb (this limit will apply on your BatchSize if you use SendBatch API).
Given that background, in your case:
- Having 32 partitions and 5 TUs - I would really double-check the Partition distribution strategy.
here's some more general reading on Event Hubs...
After a lot of digging we decided to stop setting the PK for posted messages, and the issue simply went away!. We were using GUID as PK. We start to get very few erros on the Azure Portal, and no more exceptions. Hope this helps someone else