High throughput send to EventHubs resulting into MessagingException / TimeoutException / Server was unable to process the request errors

High throughput send to EventHubs resulting into MessagingException / TimeoutException / Server was unable to process the request errors - azure

We are experiencing lots of these exceptions sending events to EventHubs during peak traffic:
"Failed to send event to EventHub. Exception : Microsoft.ServiceBus.Messaging.MessagingException: The server was unable to process the request; please retry the operation. If the problem persists, please contact your Service Bus administrator and provide the tracking id."
or
"Failed to send event to EventHub. Exception : System.TimeoutException: The operation did not complete within the allocated time "
You can see it clearly here:
As you can see, we got lots of Internal Errors, Server Busy Errors, Failed Request when Incoming messages are over 400K events/hour (or ~270 MB/hour). This is not just a transient issue. It's clearly related to throughput.
Our EH has 32 partitions, message retention of 7 days, and 5 throughput units assigned. OperationTimeout is set to 5 mins, and we are using the default RetryPolicy.
Is it anything we still need to tweak here? We are really concerned about the scalability of EH.
Thanks

Send throughput tuning can be achieved using efficient partition distribution strategies. There isn't any single knob which can do this. Below is the basic information you will need to be able to design for High-Thruput Scenarios.
1) Lets start from the Namespace: Throughput Units(aka TUs) are configured at Namespace level. Pls. bear in mind, that, TUs configured is applied - aggregate of all EventHubs under that Namespace. If you have 5 TUs on your Namespace and 5 eventhubs under it - it will be divided among all 5 eventhubs.
2) Now lets look at EventHub level: If the EventHub is allocated with 5 TUs and it has 32 partitions - No single partition can use all 5 TUs. For ex. if you are trying to send 5TU of data to 1 partition and 'Zero' to all other 31 partitions - this is not possible. Maximum you should plan per Partition is 1 TU. In general, you will need to ensure that the data is distributed evenly across all partitions. EventHubs support 3 types of sends - which gives users different level of control on Partition distribution:
EventHubClient.Send(EventDataWithoutPartitionKey) -> if you are using this API to send - eventhub will take care of evenly distributing the data across all partitions. EventHubs service gateway will round-robin the data to all partitions. When a specific partition is down - the Gateways auto-detect and ensure Clients doesn't see any impact. This is the most recommended way to Send to EventHubs.
EventHubClient.Send(EventDataWithPartitionKey) -> if you are using this API to send to EventHubs - the partitionKey will determine the distribution of your data. PartitionKey is used to Hash the EventData to the appropriate partition (algo. to hash is Microsoft Proprietary and not Shared). Typically users who require correlation of a group of messages will use this variant of Send.
EventHubSender.Send(EventData) -> In this variant, the Sender is already attached to the Partition. So - this gives complete control of Distribution across partitions to the Client.
To measure your present distribution of Data - use EventHubClient.GetPartitionRuntimeInfo Api to estimate which Partition is overloaded. The difference b/w BeginSequenceNumber and LastEnqueuedSequenceNumber is supposed to give an estimate of that partitions load compared to others.
3) Last but not the least - you can tune performance (not Throughput) at send operation level - using the SendBatch API.
1 TU can buy a Max of 1000 msgs/sec or 1MBPS - you will be throttled with whichever limit hits first - this cannot be changed.
If your messages are small - lets say 100 bytes and you can send only 1000 msgs/sec (as per the TU limit) - you will first hit the 1000 events/sec limit. However, overall using SendBatch API - you can batch lets say 10 of 100byte msgs and push at the same rate - 1000 msgs/sec with just 100 API calls and improve the end-to-end latency of the system (as it helps service also to persist messages efficiently). Remember, the only limitation here is the Max. Msg Size that can be sent - which is 256 kb (this limit will apply on your BatchSize if you use SendBatch API).
Given that background, in your case:
- Having 32 partitions and 5 TUs - I would really double-check the Partition distribution strategy.
here's some more general reading on Event Hubs...

After a lot of digging we decided to stop setting the PK for posted messages, and the issue simply went away!. We were using GUID as PK. We start to get very few erros on the Azure Portal, and no more exceptions. Hope this helps someone else

Related

Is there any message receiving limit per device on Azure IoTHub?

Is there any message receiving limit per device on Azure IoTHub?
If any, can I remove or raise the upper limit without registering additional devices?
I tested 2 things to make sure if I can place enough load (ideally, 18000 message/s)on Azure IoT Hub in the future load tests.
① Send a certain amount of mqtt messages from a VM.
② Send a certain amount of mqtt messages from two VMs.
I expected that the traffic of ② would be twice as large as that of ①. But it wasn't. Maximum messages per minute on IoTHub of ② is not so different from that of ①. Both of them are around 3.6k [message/min]. At that time, I registered only one device on IoT Hub. So I added another device and tested ② again to see if the second device could increase the traffic. As a result, it increased the traffic and IoT Hub had bigger messages per minute.
Judging from this result, I thought IoTHub has some kind of limit on receiving message per device. But I am not sure. So if anyone know about the limit, could you tell me what kind of limit it is and how to raise the upper limit without registering additional devices because in production we use only one device.
For your information, I know there is a "unit" to increase the throughput in IoTHub. To increase the load I changed the number of unit from 2 to 20 in both ① and ②. However, it did not make messages/min in IotHub bigger. I'd also like to know why the "unit" did not work as expected.
Thank you for reading, in advance. Any comment would be my help.

Every basic (B1,B2, B3) or standard unit of IoT Hub SKU (S1, S2, S3) has specific daily message quota as per https://azure.microsoft.com/en-us/pricing/details/iot-hub/. A single IoTHub can support 1 million devices and there is no per device cost associated, only the msg/day quota as above.
e.g. S1 SKU has 400,000 msg/day quota and you can add multiple units of S1 to increase capacity. S2 has 6000,000 msg/day and S3 has 300,000,000 msg/day quota per unit and more units can be added.
Before this limit is reached IoTHub will raise alert which can be used to automatically add more units or jump to higher SKU.
Regarding your test, there are specific throttling limits to avoid misuse of the service here -
https://learn.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-quotas-throttling
As an example, for 18000 msg/sec you will need 3 units of S3 SKU (each with 6000 msg/sec rate limit). In addition there are other limits like how quickly connections can be attempted, if using Azure IoT SDK's the built-in retry logic helps overcome this otherwise you need to have retry policy. Basically you dont want million device trying to connect at the same time, IoTHub will only accept connections at a certain rate. This is not concurrent connection limit but a rate at which new connnections are accepted.

Calculate incoming bytes per second in Azure Event hub

How do I calculate the incoming bytes per second for an event hub namespace?
I do not control the data producer and so cannot predict the incoming bytes upfront.
I am interested in adjusting the maximum throughput units I need, without using the auto-inflate feature.
1 TU provides 1 MB/s ingress & 2 MB/s egress, but the metrics are reported per minute, not per second.
Can I make a decision based on the sum/avg/max incoming bytes reported in the Azure portal?

I believe you'll need to use Stream Analytics to query your stream and based on the query output change your TU on Event Hub.
You can also try to use Azure Monitor, but I believe it won't group per second as you need, so you'd better try the first option.

Per second metrics cannot be reliable due to very nature of potential intermittent spikes at the traffic in and out. 1 minute averages are good to monitor and you can easily take action via a Logic App.
Check messaging metrics to monitor here - https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-metrics-azure-monitor#message-metrics

How to achieve high speed processing receiving from Azure Event Hub?

I am working on the POC for Azure Event hubs to implement the same into our application.
Quick Brief on flow.
Created tool to read the CSV data from local folder and send it to event hub.
We are sending Event Data in Batch to event hub.
With 12 instance of tool (Parallel), I can send a total of 600 000 lines of messages to Event hub within 1 min.
But, On receiver side, to receive the 600 000 lines of data, it takes more than 10 mins.
Need to achieve
I would like to Match/double my egress speed on the receiver to
process the data. Existing Configuration
The configuration I have made user are
TU - 10 One Event hub with 32 Partition.
Coding logic goes same as mentioned in MSDN
Only difference is, I am sending line of data in a batch.
EventProcessorhost with options {MaxBatchSize= 1000000,
PrefetchCount=1000000

To achieve higher egress rate (aka faster processing pipeline) in eventhubs:
Create a Scaled-out pipeline - each partition in EventHub is the unit-of-scale for processing events out of EventHub. With the Scale you described (6Lakh events per min --> 10K events per sec - with 32 partitions - you already got this right). Make sure you create as many partitions as you envision your pipeline need in near future. Imagine analyzing traffic on a Highway and no. of lanes is the only limitation for the amount of traffic.
Equal load distribution across partitions: if you are using SendToASpecificPartition or SendUsingPartitionKey - you will need to take care of equal load distribution. If you use EventHubClient.Send(EventDataWithOutPartitionKey) - EventHubs service will make sure all of your partitions are equally loaded. If a single EventHub Partition is heavily loaded - the amount of time you can process all events on EventHub will be bound by no. of events on this Partition.
Scale-out physical resources on the Receiver/EventProcessorHost: most importantly Network (Sockets & bandwidth) & after-a-point, CPU & Memory. Use PartitionManagerOptions.MaxReceiveClients to increase the maximum number of EventHubClients (which has a dedicated MessagingFactory, which maps to 1 socket) created per EventProcessorHost instance. By default it is 16.
Let me know how it went... :)
More on Event Hubs.

What does the 500 telemetry data points per second limit actually mean in Application Insights?

On this documentation page there is the following limitation of Application Insights documented:
Up to 500 telemetry data points per second per instrumentation key (that is, per application). This includes both the standard telemetry sent by the SDK modules, and custom events, metrics and other telemetry sent by your code.
However it doesn't explain what the implications of that limit are?
a) Does it buffer and throttle, but still persist all data eventually? So say - 1000 data points get pushed within a second - it will persist the first 500, then wait for a bit and push the other 500?
or
b) Does it just drop/not log data? So say - 1000 data points get pushed within a second and only the first 500 will be persisted and the other 500 not (ever)?

It is the latter (b) with the caveat that ALL data will start to be throttled in this case, i.e. once RPC is > 500 (100 for free apps, please see https://azure.microsoft.com/en-us/documentation/articles/app-insights-data-retention-privacy/ for details) is detected, it will start rejecting all data from this instrumentation key on data collection endpoint, until RPC rate is back to under 500.
EDIT: Further information from Bret Grinslade:
The current implementation averages over one minute -- so if you send 30K in 1 minute (500*60) it will throttle your application. The HTTP response will tell the SDK to retry later. If the incoming rate never comes down, the response will tell the SDK to drop the data. We are working on other features to improve this experience -- pre-aggregation on the client, improved burst data rates, etc.

A bit more detail on top of Alex's response. The current implementation averages over one minue -- so if you send 30K in 1 minute (500*60) it will throttle your application. The HTTP response will tell the SDK to retry later. If the incoming rate never comes down, the response will tell the SDK to drop the data. We are working on other features to improve this experience -- pre-aggregation on the client, improved burst data rates, etc.

AI now has the ingestion throttling limit of 16K EPS: https://learn.microsoft.com/en-us/azure/application-insights/app-insights-pricing

I am not sure which NoSQL is suitable for my scenario

I am trying to design create a cloud based system (IaaS) that will gather data from sensors (water pollution related activity) and upon certain events will decide to process the data for a specific sensor.
Data characteristics are:
1. For each sensor data is being sent once every couple of days (up to 6 times a month)
2. each sensor reading contains about 5000 events that are encapsulated in 50-100 messages that are sent to the server (such "session" takes about 20 minutes where messages are sent every 5 seconds)
3. I am building the system to handle rate of 30,000 messages per second.
4. processing of the data shouldn't be real time , I have about 10 minutes once the "session" is finished to do the processing.
5. 90% of the sessions are not interesting and can be thrown away once they are finished. the other 10% have event or event encapsulated in the messages that according to them I need to decide if I need to process the entire session data and send an alert to the sensor that there is a pollution.
I created a tool that generates 5000 messages per second and I am trying to figure out which database would be the most optimal for my scenario.
These are the databases I am thinking to try:
Cassandra - I will save for each session an in memory collection of keys. the keys are for the messages that are stored in cassandra. Once I detect a message that contains bad readings I will need to pull all of the other messages in the "session" and process them (that means 50-100 requests to cassandra). My concern here is about write performance (since I have many read and write operations) + I don't have a good strategy for deleting the 90% not needed sessions.
Couchbase - I will save a document for each "session" according to sensorID and will append each message to the document. Once I detect a message that contains bad readings I will only need to send one request for the document. My concern here is about the read performance.
Redis - use it like cassandra. I assume performance will be the best but I will need to handle the sharding and replication of data myself in order not to reach the memory limit
I would love to hear which option would be the most appropriate
thanks

Reg. Redis – You may consider using a DAAS (Data as a Service). The service will manage for you all the instances, clusters, scaling, data persistence and high availability settings.
One example, is Redis Cloud by Redis Labs

This is an interesting one. If we go to basics of CAP Theorem and try to choose one DB based upon need of consistency, availability, and partition tolerance.
For High consistency and availability- Choose MySQL, PostgreSQL,Greenplum, Vertica, Neo4J.
For High availability and partition tolerance- Use Cassandra,Voldemort,Dynamo,CouchDB, Riak
For High consistency and partition tolerance- Use HBase, Redis, MongoDB,
BerkeleyDB, BigTable
So my Vote is for Cassandra here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string