Azure Event Hub - Stream Analytics architecture - azure

I need some help figuring out how to build/optimize my Azure architecture for the future.
I currently have a test running that looks like this:
I am currently sending some kind of data x1 (700k a day) as described in the picture above, the "Stream Analytics" service does nothing else but ingesting the data in the database without any aggregations or other processes.
The test is currently running without any problems but I am afraid that I might run into difficulties in the future because I want to connect more data (x2, x3, ...), which will of course increase the amount of data sent.
Now my question:
I'm having a hard time figuring out how to set up the "Event Hub" and "Stream Analytics" service to handle the increasing amount of new data.
Currently I have an "Event hub" with one partition. Would this be sufficient in the future with increasing data volume and would the Stream Analytics service still be able to keep up with the processing?
Should I rather create a separate "Event Hub" for each different data type (x1, x2, ...) or should I rather create an "Event Hub" with several partitions?
For each data type a separate "Event Hub" with multiple partitions?
I have difficulties in understanding the concept of partitions and how to implement them.
Does anyone have a similar architecture and can give me some advice.
Thank you in advance

You can think of Eventhub partitions as a multi lane highway. A 4 lane highway will have more throughput than a 1 lane highway. The only benefit of a single lane highway is that the processing will happen in sequence (FIFO). But if that is not a mandate/requirement, you should set the partitions to the max(32) to use the complete power of eventhub streaming ingestion. Eventhub will automatically distribute the messages to the different partitions provided the publisher is not directing the messages to a particular partition. You can find the info on partitions here.
Another option to allow for future scalability on the eventhub is to set the throughput of the eventhub to autoscale Link between a min/max value. For example 1TU-4TU.
Similarly, you can set the stream analytics to autoscale Link.
Stream Analytics can process each eventhub partition in parallel, and more partitions increases parallelism. Number of streaming units that a job can use also depends on maximum possible parallelism. An an example, 1 partition eventhub would only allow a max of 6 streaming units. 2 partitions would allow 12 streaming units. Doing the capacity estimation and starting with a reasonable partition count would be better, to handle future scaling requirements.

Related

In-order processing in Azure event hubs with Partitions and multiple "event processor" clients

I plan to utilize all 32 partitions in Azure event hubs.
Requirement: "Ordered" processing per partition is critical..
Question: If I increase the TU's (Throughput Units) to max available of 20 across all 32 partitions, I get 40 MB of egress. Let's say I calculated that I need 500 parallel client threads processing in parallel (EventProcessorClient) to achieve my throughput needs. How do I achieve this level of parallelism with EventProcessorClient while honoring my "Ordering" requirement?
Btw, In Kafka, I can create 500 partitions in a topic and Kafka allows only 1 thread per partition guaranteeing event order.
In short, you really can't do what you're looking to do in the way that you're describing.
The EventProcessorClient is bound to a given Event Hub and consumer group combination and will collaborate with other processors using the same Event Hub/consumer group to evenly distribute the load. Adding more processors than the number of partitions would result in them being idle. You could work around this by using additional consumer groups, but the EventProcessorClient instances will only coordinate with others in the same consumer group; the processors for each consumer group would act independently and you'd end up processing the same events multiple times.
There are also quotas on the service side that you may not be taking into account.
Assuming that you're using the Standard tier, the maximum number of concurrent reads that you could have for one Event Hub, across all partitions, with the standard tier is 100. For a given Event Hub, you can create a maximum of 20 consumer groups. Each consumer group may have a maximum of 5 active readers at a time. The Event Hubs Quotas page discusses these limits. That said, a dedicated instance allows higher limits but you would still have a gap with the strict ordering that you're looking to achieve.
Without knowing more about your specific application scenarios, how long it takes for an event to be processed, the relative size of the event body, and what your throughput target is, its difficult to offer alternative suggestions that may better fit your needs.

How do I decide how many partitions to use in Auzre Event Hub

Or phrased differently: what reason do I have to not take the max number of partitions (currently 32 without contacting Microsoft directly).
As far as I can tell more partitions means (potential) larger egress throughput, at no added monetary or computational cost. What's the catch? When would I not want to use as many partitions as I am possibly allowed to provision?
You are right in the observation that having a larger number of partitions won't cost you an extra dime when provisioning the event hub. But when the data comes in at scale you will have to allocate more TUs, so it will cost you extra based on the amount of data flowing in and out.
from the docs
Throughput in Event Hubs defines the amount of data in mega bytes or the number (in thousands) of 1-KB events that ingress and egress through Event Hubs. This throughput is measured in throughput units (TUs). Purchase TUs before you can start using the Event Hubs service. You can explicitly select Event Hubs TUs either by using portal or Event Hubs Resource Manager templates.
Another thing is that if you are using for example the Event Processor Host to process the data it has to spin up listeners for all partitions. If the incoming data is not that much and the data is divided over all those partitions you will have a lot of partitions dealing with small amount of data flowing in making it possible that there is not an optimal processing of this data.
From the docs:
The partition count on an event hub cannot be modified after setup. With that in mind, it is important to think about how many partitions you need before getting started.
Event Hubs is designed to allow a single partition reader per consumer group. In most use cases, the default setting of four partitions is sufficient. If you are looking to scale your event processing, you may want to consider adding additional partitions. There is no specific throughput limit on a partition, however the aggregate throughput in your namespace is limited by the number of throughput units. As you increase the number of throughput units in your namespace, you may want additional partitions to allow concurrent readers to achieve their own maximum throughput.
However, if you have a model in which your application has an affinity to a particular partition, increasing the number of partitions may not be of any benefit to you. For more information, see availability and consistency.
Your data processing pipeline has to deal with those partitions. If you have just one process/machine that has to process the insane amount of data that can theoretically can be send to an event hub.

Tradeoffs involved in count of partitions of event hub with azure functions

I realize this may be a duplicate of Why not always configure for max number of event hub partitions?. However, the product has evolved and the defaults have changed, and the original issues may no longer be a factor.
Consider the scenario where the primary consumers of an event hub will be eventHubTrigger Azure Functions. The trigger function is backed by an EventProcessorHost which will automatically scale up or down to consume from all available partitions as needed without any administration effort.
As I understand it, the monetary cost of the Azure Function is based on the execution duration and count of invocations, which would be driven by only by the count of events consumed and not affected by the degree parallelism due to the count of partitions.
In this case, would there be any higher costs or complexity from creating a hub with the max of 32 partitions, compared to one with a small number like 4?
Your thought process makes sense: I found myself creating 32 partitions by default in this exact scenario and had no issues with that so far.
You pay per provisioned in-/egress, partition count doesn't add cost.
The only requirement is that your partition key has enough unique values to load partitions more or less evenly.

Understanding Cassandra message latency metric

I'm trying to understand how to use the org.apache.cassandra.metrics:type=Messaging metric. I setup 3 datacenters with 1 node each. When I measure the metric, for each node I get 2 cross-datacenter metrics and 1 cross-node latency metric as follows (for node in DC-2)
org.apache.cassandra.metrics:type=Messaging,name=dc3-Latency
5.3387457013878636E7
org.apache.cassandra.metrics:type=Messaging,name=CrossNodeLatency
1.1471964354991291E8
org.apache.cassandra.metrics:type=Messaging,name=dc1-Latency
1.6108579786605054E8
However, I have no processes using the cluster currently. Is Cassandra doing a dummy write to measure this metric? Also, what does the cross-node latency metric mean here, each DC contains only one node.
The metric records incoming latency from all things using the message service. The message service is used for read/writes but its also used for streaming and gossip. Gossip fires every 1 second between all the nodes so this is probably dominating it in your situation. Also some tables may be written to (system_distributed, system_traces, and some dse tables if using dse) with even a pretty idle system in some situations.
Whenever a message is sent from one node to another, it attaches a timestamp to it along with some versioning information. The first thing the receiving system will do (ignoring the obvious os/socket/etc) more or less is compare that timestamp to "now". This is what drives the metric. It will then look at the datacenter the source is from to determine which metrics to increment by how much.

High scale message processing in eventhub

As per my understanding, eventhub can process/ingest millions of messages per seconds. And to tune the ingesting, we can use throughput.
More throughput= more ingesting power.
But on receiving/consuming side, You can create upto 32 receivers(since we can create 32 partitions and one partition can be consumed by one receiver).
Based on above, if one single message takes 100 milisencond to process, one consumer can process 10 message per second and 32 consumer can process 32*10= 320 message per second.
How can I make my receiver consume more messages (for ex. 5-10k per seond).
1) Either I have to process message asynchronously inside ProcessEventsAsync. But in this case I would not be able to maintain ordering.
2) Or I have to ask Microsoft to allow me to create more partitions.
Please advice
TLDR: You will need to ask Microsoft to increase the number of partitions you are allowed, and remember that there is currently no way to increase the number on an already extant Event Hub.
You are correct that your unit of consumption parallelism is the partition. If your consumers can only do 10/seconds in order or even 100/second in order, then you will need more partitions to consume millions of events. While 100ms/event certainly seems slow to me and I think you should look for optimizations there (ie farm out work you don't need to wait for, commit less often etc), you will reach the point of needing more partitions at scale.
Some things to keep in mind: 32 partitions gives you only 32 Mb/s of ingress and 64Mb/s of egress. Both of these factors matter since that egress throughput is shared by all the consumer groups you use. So if you have 4 consumer groups reading the data (16Mb/s each) you'll need twice as many partitions (or at least throughput units) for input as you would based solely on your data ingress (because otherwise you would fall behind).
With regards to your comment about multitenancy, you will have one 'database consumer' group that handles all your tenants all of whose data will be flowing through the same hub? If so that sounds like a sensible use, what would not be so sensible is having one consumer group per tenant each consuming the entire stream.

Resources