Is it guaranteed that sequence numbers will increment by 1 EventHub? - azure

I'm consuming from a single Azure EventHub partition, using the PartitionReceiver class in Java. I'm trying to understand the x-opt-sequence-number SystemProperty that is available in each EventData object that I receive.
Is it guaranteed that the sequence numbers increment by 1 per EventData consumed and that each possible sequence number exists in the partition? For example if I've consumed an Event with sequence number 5 and another with 10, based on this must there exist events with sequence numbers 6-9?
Thanks!

Corrected Context
My understanding was incorrect. It has been confirmed by the Event Hubs team that sequence numbers are contiguous within the scope of a single partition.
Original Answer
As I understand it, sequence numbers are not guaranteed to be sequential in the scope of a partition, though they often are. The only strict guarantee is that they will always be in increasing order with respect to when an event was enqueued to a specific partition. Sequence numbers are specific to a partition and shouldn't be compared across partitions.
If you need to ensure that you're sequencing with a strict and specific increment, your best bet is likely to be adding a custom bit of metadata for your application domain's sequence number in the user properties collection. Of course, that does get tricky when coordinating across publishers on multiple machines and like likely to surface performance trade-offs.

Related

Azure Service Bus: Ordered Processing of Session Sequences

Are there any recommended architectural patterns with Service Bus for ensuring ordered processing of nested groups of messages which are sent out of order? We are using Sessions, but when it comes down to ensuring that a set of Sessions must be processed sequentially in a certain order before moving onto another set of Sessions, the architecture becomes cumbersome very quickly. This question might best be illustrated with an example.
We are using Service Bus to integrate changes in real-time from a database to a third-party API. Every N minutes, we get notified of a new 'batch' of changes from the database which consists of individual records of data across different entities. We then transform/map each record and send it along to an API. For example, a 'batch' of changes might include 5 new/changed 'Person' records, 3 new/changed 'Membership' records, etc.
At the outer-most level, we must always process one entire batch before we can move on to another batch of data, but we also have a requirement to process each type of entity in a certain order. For example, all 'Person' changes must be processed for a given batch before we can move on to any other objects.
There is no guarantee that these records will be queued up in any order which is relevant to how they will need to be processed, particularly within a 'batch' of changes (e.g. the data from different entity types will be interleaved).
We actually do not necessarily need to send the individual records of entity data in any order to the API (e.g. it does not matter in which order I send those 5 Person records for that batch, as long as they are all sent before the 3 Membership records for that batch). However, we do group the messages into Sessions by entity type so that we can guarantee homogeneous records in a given session and target all records for that entity type (this also helps us support a separate requirement we have when calling the API to send a batch of records when possible instead of an individual call per record to avoid API rate limiting issues). Currently, our actual Topic Subscription containing the record data is broken up into Sessions which are unique to the entity type and the batch.
"SessionId": "Batch1234\Person"
We are finding that it is cumbersome to manage the requirement that all changes for a given batch must be processed before we move on to the next batch, because there is no Session which reliably groups those "groups of entities" together (let alone processing those groups of entities themselves in a certain order). There is, of course, no concept of a 'session of sessions', and we are currently handling this by having a separate 'Sync' queue to represent an entire batch of changes which needs to be processed what sessions of data are contained in that batch:
"SessionId": "Batch1234",
"Body":
{
"targets": ["Batch1234\Person", "Batch1234\Membership", ...]
}
This is quite cumbersome, because something (e.g. a Durable Azure Function) now has to orchestrate the entire process by watching the Sync queue and then spinning off separate processors that it oversees to ensure correct ordering at each level (which makes concurrency management and scalability much more complicated to deal with). If this is indeed a good pattern, then I do not mind implementing the extra orchestration architecture to ensure a robust, scalable implementation. However, I cannot help from feeling that I am missing something or not thinking about the architecture the right way.
Is anyone aware of any other recommended pattern(s) in Service Bus for handling ordered processing of groups of data which themselves contain groups of data which must be processed in a certain order?
For the record I'm not a service bus expert, specifically.
The entire batch construct sounds painful - can you do away with it? Often if you have a painful input, you'll have a painful solution - the old "crap in, crap out" maxim. Sometimes it's just hard to find an elegant solution.
Do the 'sets of sessions' need to be processed in a specific order?
Is a 'batch' of changes = a session?
I can't think of a specific pattern, but a "divide and conquer" approach seems reasonable (which is roughly what you have already?):
Watch for new batches, when one occurs hand it off to a BatchProcessor.
BatchProcessor applies all the rules to the batch, as you outlined.
Consider having the BatchProcessor dump it's results on a queue of some kind which is the source for the API - that way you have some kind of isolation between the batch processing and the API.

Data storage parallelization in .NET Core

I am a little lost in this task. There is a requirement for our caching solution to split a large data dictionary into partitions and perform operations on them in separate threads.
The scenario is: We have a large pool of data that is to be kept in memory (40m rows), the chosen strategy is first to have a Dictionary with int key. This dictionary contains a subset of 16 dictionaries that are keyed by guid and contain a data class.
The number 16 is calculated on startup and indicates CPU core count * 4.
The data class contains a byte[] which is basically a translated set of properties and their values, int pointer to metadata dictionary and checksum.
Then there is a set of control functions that takes care of locking and assigns/retrieves Guid keyed data based on a division of the first segment of guid (8 hex numbers) by divider. This divider is just FFFFFFFF / 16. This way each key will have a corresponding partition assigned.
Now I need to figure out how to perform operations (key lookup, iterations and writes) on these dictionaries in separate threads in parallel? Will I just wrap these operations using Tasks? Or will it be better to load these behemoth dictionaries into separate threads whole?
I have a rough idea how to implement data collectors, that will be the easy part I guess.
Also, is using Dictionaries a good approach? Their size is limited to 3mil rows per partition and if one is full, the control mechanism tries to insert on another server that is using the exact same mechanism.
Is .NET actually a good language to implement this solution?
Any help will be extremely appreciated.
Okay, so I implemented ReaderWriterLockSlim and implemented concurrent access through System.Threading.Tasks. I also managed to exclude any dataClass object from the storage, now it is only a dictionary of byte[]s.
It's able to store all 40 million rows taking just under 4GB of RAM and through some careful SIMD optimized manipulations performs EQUALS, <, > and SUM operation iterations in under 20ms, so I guess this issue is solved.
Also the concurrency throughput is quite good.
I just wanted to post this in case anybody faces similar issue in the future.

Tradeoffs involved in count of partitions of event hub with azure functions

I realize this may be a duplicate of Why not always configure for max number of event hub partitions?. However, the product has evolved and the defaults have changed, and the original issues may no longer be a factor.
Consider the scenario where the primary consumers of an event hub will be eventHubTrigger Azure Functions. The trigger function is backed by an EventProcessorHost which will automatically scale up or down to consume from all available partitions as needed without any administration effort.
As I understand it, the monetary cost of the Azure Function is based on the execution duration and count of invocations, which would be driven by only by the count of events consumed and not affected by the degree parallelism due to the count of partitions.
In this case, would there be any higher costs or complexity from creating a hub with the max of 32 partitions, compared to one with a small number like 4?
Your thought process makes sense: I found myself creating 32 partitions by default in this exact scenario and had no issues with that so far.
You pay per provisioned in-/egress, partition count doesn't add cost.
The only requirement is that your partition key has enough unique values to load partitions more or less evenly.

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

High scale message processing in eventhub

As per my understanding, eventhub can process/ingest millions of messages per seconds. And to tune the ingesting, we can use throughput.
More throughput= more ingesting power.
But on receiving/consuming side, You can create upto 32 receivers(since we can create 32 partitions and one partition can be consumed by one receiver).
Based on above, if one single message takes 100 milisencond to process, one consumer can process 10 message per second and 32 consumer can process 32*10= 320 message per second.
How can I make my receiver consume more messages (for ex. 5-10k per seond).
1) Either I have to process message asynchronously inside ProcessEventsAsync. But in this case I would not be able to maintain ordering.
2) Or I have to ask Microsoft to allow me to create more partitions.
Please advice
TLDR: You will need to ask Microsoft to increase the number of partitions you are allowed, and remember that there is currently no way to increase the number on an already extant Event Hub.
You are correct that your unit of consumption parallelism is the partition. If your consumers can only do 10/seconds in order or even 100/second in order, then you will need more partitions to consume millions of events. While 100ms/event certainly seems slow to me and I think you should look for optimizations there (ie farm out work you don't need to wait for, commit less often etc), you will reach the point of needing more partitions at scale.
Some things to keep in mind: 32 partitions gives you only 32 Mb/s of ingress and 64Mb/s of egress. Both of these factors matter since that egress throughput is shared by all the consumer groups you use. So if you have 4 consumer groups reading the data (16Mb/s each) you'll need twice as many partitions (or at least throughput units) for input as you would based solely on your data ingress (because otherwise you would fall behind).
With regards to your comment about multitenancy, you will have one 'database consumer' group that handles all your tenants all of whose data will be flowing through the same hub? If so that sounds like a sensible use, what would not be so sensible is having one consumer group per tenant each consuming the entire stream.

Resources