Recently Azure released a feature called Azure Event Hubs for Kafka that allows to use Event Hubs like if it were a Kafka cluster, using the same Kafka libraries. That would allow us to migrate from our current IaaS Kafka solution to a PaaS solution, with all the advantages of a fully managed solution, and with only minimal changes in our base code (at least that's the promise).
However, while analyzing the migration we are finding it hard to get our infrastructure inside the Azure Event Hub limits. We have hundreds of topics in Kafka and we know we will scale to thousands in the future, but that can't be easily be fit inside Event hubs.
In Azure the match for the concept of topic is the Event Hub, and then you also have namespaces, that match a Kafka cluster. In fact, each namespace has a different DNS name, making it a complete different system. The limitations are the following: you can have up to 10 event hubs per namespace, up to 100 namespaces per subscription. That, translated into Kafka jargon, is up to 1000 topics. Let's suppose that's enough for our purposes, however I would need different parts of my application to connect to different Kafka clusters (namespaces) per each 10 topics I have, adding an unneeded complexity to the whole story.
It seems like in the end I am changing the difficulty of managing the infrastructure of my own cluster by the difficulty of re-architecturing my application so that it fits inside that strange 10 topic per cluster limit. With Kafka I can have 100 topics in one cluster. With Event Hubs I need 10 clusters of 10 topics each, what adds the complexity of knowing to which cluster your consumers and producers need to connect to. That completely changes the architecture of your application (making it much more complex).
I've looked through the Internet for an answer to this with no luck, everyone seems to see a lot of advantages using Event Hubs, so I am starting to think maybe I am missing something. Which would be a efficient way of fitting lots of topics inside that 10 topic limit without changing my architecture a lot?
Azure Event Hubs offers Kafka/EH for data streaming in two different umbrellas - Single Tenancy and Multi-tenancy. While multi-tenancy gives you the flexibility to reserve small and use small capacity, it is enforces with Quotas and Limits. These are stringent and cannot be flexed out. Reason, analogically you can imagine multi-tenancy to be a huge kafka cluster of which %CPU and %memory is shared with strict boundaries among different tenants. With this infrastructure to honor multi-tenancy we define boundaries and these boundaries are enforced by quotas and limits. Event Hubs is the only PaaS service that charges you for reserving your bandwidth and the ingress of events. There is no egress charge. We also let you ingress xMBps and egress 2xMBps and the quotas lets us with this boundary. Our single tenant clusters can be thought of as mimicking the exact KAfka cluster where there are no quotas attached. The limits here that we enforce are the actual physical limits. The limits of 1000 topics per namespace and 50 namespace per Capacity units are soft limits which can be relaxed as they are just enforcing the best practices. The cost justification when you compare Standard and Dedicated is not any different and in fact when you do > 50MBps, you can an advantage as the whole capacity is dedicated to one tenant with Dedicated. Also a single Capacity Unit (in which the Dedicated clusters are sold) lets you achieve anywhere between 100MBps - 250MBps based on your send/recieve pattern, payload size, frequency and more. For comparison purpose, although we do not do 0TUs on Standard and there is no direct relation/mapping between dedicate CUs and Standard
TU's, below is a pricing example,
50TU's = $0.03/hr x 50 = $1.5 per hour | 50,000 events per second = 180,000,000 events per hour
180,000,000 / 1,000,000 = 180 units of 1,000,000 messages | 180 X $0.028 = $5.04 | So, a grand total of $6.54 per hour
Note that the above does not include Capture pricing. And for a grand total of $6.85 per hour you get Dedicated with Capture included.
Was looking into the limitation, it seems that the dedicated tier has 1000 event hubs per namespace. Although there will be some additional cost due to the dedicated tier.
Related
I am currently working on understanding Event Hub along with Azure Function. I have checked out event driven scaling which mentions about the scale controller. But none of the azure documents I referred gave out the logic behind scale controller as in on what basis does it dynamically scale in or out. How does the controller know when to scale in or out and on what mechanism does the controller work?
Can anyone please help me in understanding the logic behind scale controller?
The exact algorithm used by the scale controller is not publicly available but, at a high level, it involves considering metrics over a period of time to understand whether the incoming rate of events is too quick, too slow, or just about right for the rate that events are being processed.
That information is used as part of the computation for an the ideal number of instances, which is then weighed against other factors from configuration and internal to the Functions runtime to vote on whether to add/remove instances.
The metrics themselves, and the associated calculations, are public and can be found in the EventHubScaleMonitor.
In a nutshell, it reads the last enqueued sequence number for a given partition and compares that to the last recorded checkpoint for that partition. The difference between these values is considered the number of events that remain to be processed for that partition (also known as the "event backlog"). There are some corner cases here, such as a sequence number rolling over to 0 once it hits Int64.MaxValue. Generally, though, it is fairly straightforward.
The "Consuming Events" section of Integrate Event Hubs with serverless functions on Azure also provides some high-level context around scaling, with a focus on partitions for the Event Hub. It references some concepts from the legacy Event Hubs SDK package which is no longer used by the extensions, but the high-level details are still accurate.
You can find some references to how Azure Functions scales in the host.json.
https://learn.microsoft.com/en-us/azure/azure-functions/functions-host-json-v1?tabs=2x-durable-functions#http
Not an expert answer, but I try. For each trigger there are certain defaults (which of course you can override), if those limits are exceeded a new instance is spawned.
We're considering using Service Fabric on-premises, fully or partially replacing our old solution built based on NServiceBus, though our knowledge about SF is yet a bit limited. What we like about NServiceBus is the out-of-the-box feature to declaratively throttle any service with the maximum amount of threads. If we have multiple services, and one of them starts hiccuping due to some external factors, we do not want other services affected by that. That "problem" service would just take the maximum amount of threads we allocate it with in its configuration, and its queue would start growing, but other services keep working fine as computer resources are still available. In Service Fabric, if we let our application create as many "problem" actors as it wants, it will lead to uncontrollable growth of the "problem" actors that will consume all server resources.
Any ideas on how with SF we can protect our resources in the situation I described? My first impression is that no such things like queuing or actors throttling mechanism are implemented in Service Fabric, and all must be made manually.
P.S. I think it should not be a rare demand for capability to somehow balance resources between different types of actors inside one application, to make them less dependent on each other in regards to consuming resources. I just can't believe there is nothing offered for that in SF.
Thanks
I am not sure how you would compare NServiceBus (which is a messaging solution) with Service Fabric that is a platform for building microservices. Service Fabric is a platform that supports many different types of workload. So it makes sense it does not provide out of the box throttling of threads etc.
Also, what would you expect from Service Fabric when it comes to actors or services when it comes to resource consumption. It is up to you what you want to do and how to react. I wouldn't want SF to kill my actors or throttle service request automatically. I would expect mechanisms to notify me when it happens and those are available.
That said, SF does have a mechanism to react on load using metrics, See the docs:
Metrics are the resources that your services care about and which are provided by the nodes in the cluster. A metric is anything that you want to manage in order to improve or monitor the performance of your services. For example, you might watch memory consumption to know if your service is overloaded. Another use is to figure out whether the service could move elsewhere where memory is less constrained in order to get better performance.
Things like Memory, Disk, and CPU usage are examples of metrics. These metrics are physical metrics, resources that correspond to physical resources on the node that need to be managed. Metrics can also be (and commonly are) logical metrics. Logical metrics are things like “MyWorkQueueDepth” or "MessagesToProcess" or "TotalRecords". Logical metrics are application-defined and indirectly correspond to some physical resource consumption. Logical metrics are common because it can be hard to measure and report consumption of physical resources on a per-service basis. The complexity of measuring and reporting your own physical metrics is also why Service Fabric provides some default metrics.
You can define you're own custom metrics and have the cluster react on those by moving services to other nodes. Or you could use the Health Reporting system to issue a health event and have your application or outside process act on that.
I have written an implementation of azure service bus into our application using Topics which are subscribed to by a number of applications. One of the discussions in our team is whether we stick with a single Topic and filter via the properties of the message or alternatively create a Topic for our particular needs.
Our scenario is that we wish to filter by a priority and an environment variable (test and uat environments share a connection).
So do we have Topics (something like):
TestHigh
TestMedium
TestLow
UatHigh
UatMedium
UatLow
OR, just a single topic with these values set as two properties?
My preference is that we create separate topics, as we'd be utilising the functionality available and I would imagine that under high load this would scale better? I've read peeking large queues can be inefficient. It also seems cleaner to subscribe to a single topic.
Any advice would be appreciated.
I would go with separate topics for each environment. It's cleaner. Message counts in topics can be monitored separately for each environment. It's marginally more scalable (e.g. topic size limits won't be shared) - but the limits are generous and won't matter much in testing.
But my main argument: that's how production will (hopefully) go. As in, production will have it's own connection (and namespace) in ASB, and will have separate topics. Thus you would not be filtering messages via properties in production, so why do it differently in testing?
Last tip: to make topic provision easier, I'd recommend having your app auto create them on start up. It's easy to do - check if they exist, and create if they don't.
Either approach works. More topics and subscriptions mean that you have more entities to manage at deployment time. If High/Medium/Low reflect priorities, then multiple topics may be a better choice since you can pull from the the highest priority subscription first.
From a scalability perspective there really isn't too much of a difference that you would notice since Service Bus already spreads the load across multiple logs internally, so if you use six topics or two topics will not make a material difference.
What does impact performance predictability is the choice of service class. If you choose "Standard", throughput and latency are best effort over a shared multi-tenant infrastructure. Other tenants on the same cluster may impact your throughput. If you choose "Premium", you get ringfenced resources that give you predictable performance, and your two or six Topics get processed out of that resource pool.
In another words, if I create messaging layout which uses rather large number of messaging entities (like several thousands), instead of smaller number, is there something in Azure ServiceBus that gets irritated by that and makes it perform less than ideally, or generates significantly different costs. Let us assume that number of messages will remain roughly the same in both scenarios.
So to make clear I am not asking if messaging layout with many entities is sound from applications point of view, but rather is there in Azure some that performs badly in such situations. If there are advantages to it (perhaps Azure can scale it more easily), that would be also interesting.
I am aware of 10000 entites limit in single ServiceBus namespace.
It is the more matter of programming and architecture of the solution i think - for example, we saw the problems with the ACS (authentication mechanism) - SB started to throttle the client sometimes when there were many requests. Take a look at the guidance about SB high availability - there are some issues listed that should be considered when you have a lot of load.
And, you always have other options that can be more suitable for highload scenarios - for example, Azure Event Hubs, more lightweight queue mechanism intended to be the service for the extremely high amount of messages.
I am trying to understand how can I make Azure Service Bus Topic to be scaleable to handle >10,000 requests/second from more than 50 different clients. I found this article at Microsoft - http://msdn.microsoft.com/en-us/library/windowsazure/hh528527.aspx. This provides lot of good input to scale azure service bus like creating multiple message factories, sending and receiving asynchronously, doing batch send/receive.
But all these input are from the publisher and subscriber client perspective. What if the node running the Topic can not handle the huge number of transactions? How do I monitor that? How do I have the Topic running on multiple nodes? Any input on that would be helpful.
Also wondering if any one has done any capacity testing with Topic/Queue and I am eager to see those results...
Thanks,
Prasanna
If you need 10K or 100K or 1M or more requests per seconds take a look at what's being done on the highway. More traffic, more lanes.
You can get effectively arbitrary flow rates out of Service Bus by partitioning your traffic across multiple entities. Service Bus gives a number of assurances about reliability, e.g. that we don't lose messages once we took them from you or that we assign gapless sequence numbers, and that has throughput impact on the individual entities like a single Topic. That's exactly like a highway lane just being able to deal with X cars/hour. Make more lanes.
Since these replies, Microsoft has released a ton of new capability.
Azure Auto-Scale can monitor the messages in a queue (or CPU load)
and start or stop instances to maintain that target.
Service Bus introduced Partitioned Queue's (& topics). This lets you send messages over multiple queues but they look like a single queue to you API. Dramatically increasing the throughput of a Queue.
Before you do that I'd recommend you try:-
Async & Batched writes to the queue.
Change the Prefetch parameter on the Reads.
Also look at Receive.OnMessage() to ensure you get the messages the millisec they are available.
This will improves your perf from ~5 messages / sec to many 100's or 1,000's per sec.
The Service Bus has its limitations "Capacity and Quotas", check out this article for a very good overview of these: https://learn.microsoft.com/en-gb/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted
I suggest you reach out to your local MSFT Specialist if you have a use case that will push the boundaries of Azure Service Bus, MSFT have dedicated teams in Redmond (around the world) that can help you design and push these boundaries at massive scale, this is the Windows Azure CAT (Customer Advisory Team). Their goal is to solve real world customer problems and it sounds like you might have one...
You need to performance and load test to reach all the answers to your questions above based on your specific scenario.
The Azure CAT team have a wealth of metrics on capacity and load testing with the Service Bus (Azure in general), these are not always publicly available so again reach out if you can...
If it can handle that many requests, you want to make sure that you receive the messages in such a way that you don't hit the max size of the topic. You can use multiple instances of a worker role in Azure to listen to specific subscriptions so you would be able to process messages faster without getting near the max size.