Tradeoffs involved in count of partitions of event hub with azure functions - azure

I realize this may be a duplicate of Why not always configure for max number of event hub partitions?. However, the product has evolved and the defaults have changed, and the original issues may no longer be a factor.
Consider the scenario where the primary consumers of an event hub will be eventHubTrigger Azure Functions. The trigger function is backed by an EventProcessorHost which will automatically scale up or down to consume from all available partitions as needed without any administration effort.
As I understand it, the monetary cost of the Azure Function is based on the execution duration and count of invocations, which would be driven by only by the count of events consumed and not affected by the degree parallelism due to the count of partitions.
In this case, would there be any higher costs or complexity from creating a hub with the max of 32 partitions, compared to one with a small number like 4?

Your thought process makes sense: I found myself creating 32 partitions by default in this exact scenario and had no issues with that so far.
You pay per provisioned in-/egress, partition count doesn't add cost.
The only requirement is that your partition key has enough unique values to load partitions more or less evenly.

Related

In-order processing in Azure event hubs with Partitions and multiple "event processor" clients

I plan to utilize all 32 partitions in Azure event hubs.
Requirement: "Ordered" processing per partition is critical..
Question: If I increase the TU's (Throughput Units) to max available of 20 across all 32 partitions, I get 40 MB of egress. Let's say I calculated that I need 500 parallel client threads processing in parallel (EventProcessorClient) to achieve my throughput needs. How do I achieve this level of parallelism with EventProcessorClient while honoring my "Ordering" requirement?
Btw, In Kafka, I can create 500 partitions in a topic and Kafka allows only 1 thread per partition guaranteeing event order.
In short, you really can't do what you're looking to do in the way that you're describing.
The EventProcessorClient is bound to a given Event Hub and consumer group combination and will collaborate with other processors using the same Event Hub/consumer group to evenly distribute the load. Adding more processors than the number of partitions would result in them being idle. You could work around this by using additional consumer groups, but the EventProcessorClient instances will only coordinate with others in the same consumer group; the processors for each consumer group would act independently and you'd end up processing the same events multiple times.
There are also quotas on the service side that you may not be taking into account.
Assuming that you're using the Standard tier, the maximum number of concurrent reads that you could have for one Event Hub, across all partitions, with the standard tier is 100. For a given Event Hub, you can create a maximum of 20 consumer groups. Each consumer group may have a maximum of 5 active readers at a time. The Event Hubs Quotas page discusses these limits. That said, a dedicated instance allows higher limits but you would still have a gap with the strict ordering that you're looking to achieve.
Without knowing more about your specific application scenarios, how long it takes for an event to be processed, the relative size of the event body, and what your throughput target is, its difficult to offer alternative suggestions that may better fit your needs.

How do I decide how many partitions to use in Auzre Event Hub

Or phrased differently: what reason do I have to not take the max number of partitions (currently 32 without contacting Microsoft directly).
As far as I can tell more partitions means (potential) larger egress throughput, at no added monetary or computational cost. What's the catch? When would I not want to use as many partitions as I am possibly allowed to provision?
You are right in the observation that having a larger number of partitions won't cost you an extra dime when provisioning the event hub. But when the data comes in at scale you will have to allocate more TUs, so it will cost you extra based on the amount of data flowing in and out.
from the docs
Throughput in Event Hubs defines the amount of data in mega bytes or the number (in thousands) of 1-KB events that ingress and egress through Event Hubs. This throughput is measured in throughput units (TUs). Purchase TUs before you can start using the Event Hubs service. You can explicitly select Event Hubs TUs either by using portal or Event Hubs Resource Manager templates.
Another thing is that if you are using for example the Event Processor Host to process the data it has to spin up listeners for all partitions. If the incoming data is not that much and the data is divided over all those partitions you will have a lot of partitions dealing with small amount of data flowing in making it possible that there is not an optimal processing of this data.
From the docs:
The partition count on an event hub cannot be modified after setup. With that in mind, it is important to think about how many partitions you need before getting started.
Event Hubs is designed to allow a single partition reader per consumer group. In most use cases, the default setting of four partitions is sufficient. If you are looking to scale your event processing, you may want to consider adding additional partitions. There is no specific throughput limit on a partition, however the aggregate throughput in your namespace is limited by the number of throughput units. As you increase the number of throughput units in your namespace, you may want additional partitions to allow concurrent readers to achieve their own maximum throughput.
However, if you have a model in which your application has an affinity to a particular partition, increasing the number of partitions may not be of any benefit to you. For more information, see availability and consistency.
Your data processing pipeline has to deal with those partitions. If you have just one process/machine that has to process the insane amount of data that can theoretically can be send to an event hub.

High scale message processing in eventhub

As per my understanding, eventhub can process/ingest millions of messages per seconds. And to tune the ingesting, we can use throughput.
More throughput= more ingesting power.
But on receiving/consuming side, You can create upto 32 receivers(since we can create 32 partitions and one partition can be consumed by one receiver).
Based on above, if one single message takes 100 milisencond to process, one consumer can process 10 message per second and 32 consumer can process 32*10= 320 message per second.
How can I make my receiver consume more messages (for ex. 5-10k per seond).
1) Either I have to process message asynchronously inside ProcessEventsAsync. But in this case I would not be able to maintain ordering.
2) Or I have to ask Microsoft to allow me to create more partitions.
Please advice
TLDR: You will need to ask Microsoft to increase the number of partitions you are allowed, and remember that there is currently no way to increase the number on an already extant Event Hub.
You are correct that your unit of consumption parallelism is the partition. If your consumers can only do 10/seconds in order or even 100/second in order, then you will need more partitions to consume millions of events. While 100ms/event certainly seems slow to me and I think you should look for optimizations there (ie farm out work you don't need to wait for, commit less often etc), you will reach the point of needing more partitions at scale.
Some things to keep in mind: 32 partitions gives you only 32 Mb/s of ingress and 64Mb/s of egress. Both of these factors matter since that egress throughput is shared by all the consumer groups you use. So if you have 4 consumer groups reading the data (16Mb/s each) you'll need twice as many partitions (or at least throughput units) for input as you would based solely on your data ingress (because otherwise you would fall behind).
With regards to your comment about multitenancy, you will have one 'database consumer' group that handles all your tenants all of whose data will be flowing through the same hub? If so that sounds like a sensible use, what would not be so sensible is having one consumer group per tenant each consuming the entire stream.

Azure Storage Table transfer rate exceeded - how?

According to the new scalability targets for Azure, each partition in table storage is limited to 2000 entities/second.
I have been able to reach batch inserts of up to 16000 entities/second through parallelism.
For example, on my XtraLarge web role (8 CPUs, 8 cores), I am inserting 6400 entities (which is 64 separate batch inserts of 100 each) over 64 simultaneous parallel tasks.
How is this possible? Is 2000 entities/second just the minimum performance expected from a partition?
They are scalability targets, not limits. As you point out, it is the minimum expected performance, not the maximum. I imagine that you finding that at that particular time, on that particular network and hardware, that there is little contention for resources from other Azure customers. Also, note the size of your entities — Azure seems to perform well with small entities, and the targets will be for the maximum (1MB). Be warned — although the unexpected performance may be a useful, don't plan on it being there. Also make sure that you are coding for failure (500s and 503s) regardless of how often you hit the limits, otherwise when the storage service is underperforming, you app may begin to fail.

Azure Table Storage transaction limitations

I'm running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline is non blocking (await/async) and using TPL for concurrent and parallel execution.
First of all its very strange that with this setup i'm only getting about 1200 insertions. This is running on a L VM box, that is 4 cores + 800mbps.
I'm inserting 100.000 rows with unique PK and unique RK, that should leverage the ultimate distribution.
Even more deterministic behavior is the following.
When I run 1 VM i get about 1200 insertions per second.
When I run 3 VM i get about 730 on each insertions per second.
Its quite humors to read the blog post where they are specifying their targets.
https://azure.microsoft.com/en-gb/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.
What shall I do to be able to utilize the 20k per second, and how would it be possible to execute more than 1,2k per VM?
--
Update:
I've now also tried using 3 storage accounts for each individual node and is still getting the performance / throttling behavior. Which i can't find a logical reason for.
--
Update 2:
I've optimized the code further and now i'm possible to execute about 1550.
--
Update 3:
I've now also tried in US West. The performance is worse there. About 33% lower.
--
Update 4:
I tried executing the code from a XL machine. Which is 8 cores instead of 4 and the double amount of memory and bandwidth and got a 2% increase in performance so clearly this problem is not on my side..
A few comments:
You mention that you are using unique PK/RK to get ultimate
distribution, but you have to keep in mind that the PK balancing is
not immediate. When you first create a table, the entire table will
be served by 1 partition server. So if you are doing inserts across
several different PKs, they will still be going to one partition
server and be bottlenecked by the scalability target for a single
partition. The partition master will only start splitting your
partitions among multiple partition servers after it has identified hot
partition servers. In your <2 minute test you will not see the
benefit of multiple partiton servers or PKs. The throughput in the
article is targeted towards a well distributed PK scheme with
frequently accessed data, causing the data to be divided amongst
multiple partition servers.
The size of your VM is not the issue as
you are not blocked on CPU, Memory, or Bandwidth. You can achieve
full storage performance from a small VM size.
Check out
http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx.
I just now did a quick test using that tool from a WebRole VM in the
same datacenter as my storage account and I acheived, from a single
instance of the tool on a single VM, ~2800 items per second upload
and ~7300 items per second download. This is using 1024 byte
entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.
I suspect this may have to do with TCP Nagle.
See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;
Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.
I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?

Resources