Why not always configure for max number of event hub partitions? - azure

The Azure Event Hubs overview article states the following:
The number of partitions is specified at the Event Hub creation time
and must be between 8 and 32. Partitions are a data organization
mechanism and are more related to the degree of downstream parallelism
required in consuming applications than to Event Hubs throughput. This
makes the choice of the number of partitions in an Event Hub directly
related to the number of concurrent readers you expect to have. After
Event Hub creation, the partition count is not changeable; you should
consider this number in terms of long-term expected scale. You can
increase the 32 partition limit by contacting the Azure Service Bus
team.
Since you cannot change the number of partitions on your event hub after initial creation, why not just always configure it to the maximum number of partitions, 32? I do not see any pricing implications in doing this. Is there some performance trade off?
Also, as another side note, I appear to be able to create an event hub with less than 8 partitions. The article says it must be between 8-32. Not sure why it says that...

Its my understanding that each partition requires its own consumer. You could do this via multi-threading on a single process, multiple processes, or even via multipage machines each running a process. But this comes with a degree of complexity. Either the management of all the processes to ensure that all partitions are being consumed, or for synchronizing items/events that span partitions.
So the implicates are less about pricing then they are about scalability/complexity. :)

Related

Can I limit concurrency on a *specific* activity function?

I have a durable functions app, running on a premium elastic service plan in Azure, wherein I
(a) perform a one off task that returns a potentially large number of results
(b) run some independent processing on each result from part (a)
Part (a) relies on an external database, which starts rejecting requests when I hit a certain number of concurrent requests.
Part (b) doesn't have such a third party dependency, and should theoretically be able to scale indefinitely.
I'm aware of the ability to place limits on:
The maxmimum number of instances my service plan will scale out to
The number of concurrent requests per-instance
The number of concurrent activity functions
However, using any of these options to limit (a) would also limit (b), which I'd like to leave as concurrent as possible.
Is there a way I can limit the number of concurrent invocations of activity function (a), without placing restrictions on the number of invocations of (b)?
(If all else fails I can track the number of current executions myself in storage as part of running activity (a), but I'd much prefer to either configure this, or be able to drive it from the durable functions framework, if possible - as it is already tracking the number of queued activity functions of each type.)
Is there a way I can limit the number of concurrent invocations of activity function (a), without placing restrictions on the number of invocations of (b)?
Yes, there are plenty of tools in Azure which will allow you to build publish / subscribe segregation of (a) and (b). Perhaps the mistake here is to think that the results from (a) need to be processed in-process / synchronously with the consumer which sinks / processes these results.
i.e. If there is a good chance that (b) cannot keep up with the messages retrieved from (a), then I would consider separating the task of obtaining data from (a) from the task of processing the data in (b) via a queue or log technology.
Concentrating on (b):
If (b) requires command or transaction semantics (i.e. exactly once, guaranteed), then Azure Service Bus can be used to queue commands until they can be processed, and consumers of messages can be scaled independently of the production of messages in (a), using subscriptions. Think RabbitMQ.
If (b) can handle less reliable guarantees, e.g. at-least-once semantics, then Azure Event Hubs will allow you to partition messages across multiple concurrent consumers. Think Kafka.
Other alternatives exist too, e.g. Storage queues (low cost) and Event grids (wide number of subscriber protocols).
So TL;DR, buffer the accumulation of data from the processing, if you fear that there are throughput disparities between your ability to acquire and process the data.
Interestingly enough, if the process delivering to (a) IS a queue in itself, then you need only concern yourself with the performance of (b). The golden rule of queues is to leave data on the queue if you do not have the capacity to process it (since you will just need to buffer it, again).

How to get better performance with Azure ServiceBus Standard plan

I don't manage to get over 14 msg/second with the Azure ServiceBus Standard Plan. I'm running some benchmark tests with the Azure-Sample tool that I found in this question:
The test is done with a ServiceBus resource with a single Queue and all default configurations:
If I read this correctly, you've got the maximum concurrency of one (MaxInflightReceives) with 5 receivers (ReceiverCount). Increasing concurrency and enabling prefetch on the clients will increase the overall throughput. But,
Testing should be done within the same Azure data centre. If you're testing from a local machine, you're introducing a substantial latency that cannot be avoided.
The receive mode used is PeekLock. It is slower than ReceiveAndDelete. Not suggesting to switch, but this needs to be taken into consideration as you're trading throughput for safety by using PeekLock.
The standard tier has a cap on the number of operations per second. In addition to that, your namespace is deployed in a shared environment with entities scattered in various deployment containers. Performance will vary and cannot be guaranteed. If you want to have a guaranteed throughput, use Premium SKU.

Scale CosmosDB binding for Azure Functions per logical partition

I would like my Azure function to scale per logical partition instead of per physical partition. I've tested the Azure Function binding and it does scale out when I have multiple physical partitions (in my test I needed to increase our RU's from 2000 to 20000). But I don't need that much RU since I'm using it as an event store. I'm not querying the data, just processing each message through my Azure function. So I'm wondering if there is a way to let Azure Functions scale out per partition. I see that in the new v3 lib there is a ChangeFeedOptions.PartitionKey property but that class is internal and I'm not sure it does what I want.
I basically want to have as many Azure Functions running as there are new messages grouped per logical partition. What would be the best way to achieve that?
As of today this is not possible. It's not up to the user of the CF SDK to do the lease management. The CF SDK will do that for us and there is nothing we can do to change it.
The only way to theoretically actually have one lease per logical partition is to have a logical partition big enough to occupy the whole of a physical partition. This however means that you are about to hit 10GB of data in a single partition which would be the main concern you would have at this point.
I wouldn't worry about the scaling though. The CF will spawn as many leases as it needed to scale seamlessly and this scaling depends solely on the volume of data in the database and the amount of RUs allocated.

Dynamic Service Creation to Distribute Load

Background
The problem we're facing is that we are doing video encoding and want to distribute the load to multiple nodes in the cluster.
We would like to constrain the number of video encoding jobs on a particular node to some maximum value. We would also like to have small video encoding jobs sent to a certain grouping of nodes in the cluster, and long video encoding jobs sent to another grouping of nodes in the cluster.
The idea behind this is to help maintain fairness amongst clients by partitioning the large jobs into a separate pool of nodes. This helps ensure that the small video encoding jobs are not blocked / throttled by a single tenant running a long encoding job.
Using Service Fabric
We plan on using an ASF service for the video encoding. With this in mind we had an idea of dynamically creating a service for each job that comes in. Placement constraints could then be used to determine which pool of nodes a job would run in. Custom metrics based on memory usage, CPU usage ... could be used to limit the number of active jobs on a node.
With this method the node distributing the jobs would have to poll whether a new service could currently be created that satisfies the placement constraints and metrics.
Questions
What happens when a service can't be placed on a node? (Using CreateServiceAsync I assume?)
Will this polling be prohibitively expensive?
Our video encoding executable is packaged along with the service which is approximately 80MB. Will this make the spinning up of a new service take a long time? (Minutes vs seconds)
As an alternative to this we could use a reliable queue based system, where the large jobs pool pulls from one queue and the small jobs pool pulls from another queue. This seems like the simpler way, but I want to explore all options to make sure I'm not missing out on some of the features of Service Fabric. Is there another better way you would suggest?
I have no experience with placement constraints and dynamic services, so I can't speak to that.
The polling of the perf counters isn't terribly expensive, that being said it's not a free operation. A one second poll interval shouldn't cause any huge perf impact while still providing a decent degree of resolution.
The service packages get copied to each node at deployment time rather than when services get spun up, so it'll make the deployment a bit slower but not affect service creation.
You're going to want to put the job data in reliable collections any way you structure it, but the question is how. One idea I just had that might be worth considering is making the job processing service a partitioned service and base your partitioning strategy based off encoding job size and/or tenant so that large jobs from the same tenant get stuck in the same queue, and smaller jobs for others go elsewhere.
As an aside, one thing I've dealt with in the past is SF remoting limits the size of the messages sent and throws if its too big, so if your video files are being passed from service to service you're going to want to consider a paging strategy for inter service communication.

Deleting items from Azure queue painfully slow

My application relies heavily on a queue in in Windows Azure Storage (not Service Bus). Until two days ago, it worked like a charm, but all of a sudden my worker role is no longer able to process all the items in the queue. I've added several counters and from that data deduced that deleting items from the queue is the bottleneck. For example, deleting a single item from the queue can take up to 1 second!
On a SO post How to achive more 10 inserts per second with azure storage tables and on the MSDN blog
http://blogs.msdn.com/b/jnak/archive/2010/01/22/windows-azure-instances-storage-limits.aspx I found some info on how to speed up the communication with the queue, but those posts only look at insertion of new items. So far, I haven't been able to find anything on why deletion of queue items should be slow. So the questions are:
(1) Does anyone have a general idea why deletion suddenly may be slow?
(2) On Azure's status pages (https://azure.microsoft.com/en-us/status/#history) there is no mentioning of any service disruption in West Europe (which is where my stuff is situated); can I rely on the service pages?
(3) In the same storage, I have a lot of data in blobs and tables. Could that amount of data interfere with the ability to delete items from the queue? Also, does anyone know what happens if you're pushing the data limit of 2TB?
1) Sorry, no. Not a general one.
2) Can you rely on the service pages? They certainly will give you information, but there is always a lag from the time an issue occurs and when it shows up on the status board. They are getting better at automating the updates and in the management portal you are starting to see where they will notify you if your particular deployments might be affected. With that said, it is not unheard of that small issues crop up from time to time that may never be shown on the board as they don't break SLA or are resolved extremely quickly. It's good you checked this though, it's usually a good first step.
3) In general, no the amount of data you have within a storage account should NOT affect your throughput; however, there is a limit to the amount of throughput you'll get on a storage account (regardless of the data amount stored). You can read about the Storage Scalability and Performance targets, but the throughput target is up to 20,000 entities or messages a second for all access of a storage account. If you have a LOT of applications or systems attempting to access data out of this same storage account you might see some throttling or failures if you are approaching that limit. Note that as you saw with the posts on improving throughput for inserts these are the performance targets and how your code is written and configurations you use have a drastic affect on this. The data limit for a storage account (everything in it) is 500 TB, not 2TB. I believe once you hit the actual storage limit all writes will simply fail until more space is available (I've never even got close to it, so I'm not 100% sure on that).
Throughput is also limited at the partition level, and for a queue that's a target of Up to 2000 messages per second, which you clearly aren't getting at all. Since you have only a single worker role I'll take a guess that you don't have that many producers of the messages either, at least not enough to get near the 2,000 msgs per second.
I'd turn on storage analytics to see if you are getting throttled as well as check out the AverageE2ELatency and AverageServerLatency values (as Thomas also suggested in his answer) being recorded in the $MetricsMinutePrimaryTransactionQueue table that the analytics turns on. This will help give you an idea of trends over time as well as possibly help determine if it is a latency issue between the worker roles and the storage system.
The reason I asked about the size of the VM for the worker role is that there is a (unpublished) amount of throughput per VM based on it's size. An XS VM gets much less of the total throughput on the NIC than larger sizes. You can sometimes get more than you expect across the NIC, but only if the other deployments on the physical machine aren't using their portion of that bandwidth at the time. This can often lead to varying performance issues for network bound work when testing. I'd still expect much better throughput than what you are seeing though.
There is a network in between you and the Azure storage, which might degrade the latency.
Sudden peaks (e.g. from 20ms to 2s) can happen often, so you need to deal with this in your code.
To pinpoint this problem further down the road (e.g. client issues, network errors etc.) You can turn on storage analytics to see where the problem exists. There you can also see if the end2end latency is too big or just the server latency is the limiting factor. The former usually tells about network issues, the latter about something beeing wrong on the Queue itself.
Usually those latency issues a transient (just temporary) and there is no need to announce that as a service disruption, because it isn't one. If it has constantly bad performance, you should open a support ticket.

Resources