Many ordered queues - how to auto rebalancing streams between app instances? - node.js

Problem description
I want to deploy distributed, ordered queues solution for my project but I have questions/problems:
Which tool/solution should I use? Which would be the easiest to implement/learn and infrastructure cost me less? RabbitMQ, Kafka, Redis Streams?
How to implement auto rebalancing of topics/streams for each consumer in failure situation or when new topic/stream was added to system?
In other words, I want to realize something like that:
distributed queues
..but, if one of my application are failed, other instances should take all traffic which is currently left with proper distribution (equal load).
Note, that my code was written in node.js v10 (TypeScript) and my infrastructure are based on Azure, so besides self-hosted solution (like RabbitMQ), azure-based solution (like Azure Service Bus) are also possible, but less vendor-lock, the better solution for me
My current architecture
Now I provide a more detailed background of my system:
I have 100 000 vehicle's tracker devices (different ones, many manufactures and protocols), each of them communicate with one of my custom app called decoder. This small microservice decodes and unifies payload from tracker and send it to distributed queue. Each tracker sends message every 10-30 seconds.
Note, that I must keep order of messages from single device, this is very important!
In next step, I have processing app microservice which I want to scale (forking / clustering) depends of number of tracker devices. Each fork of this app should subscribe to some of topics/consumer groups to process messages from devices, while keeping order. Processing of each message takes about 1-3 seconds.
Note, that in every moment of time, I can add or remove tracker devices, and this information should be auto-propagate to forks of processing app and this instances should be able to auto rebalancing traffic from queue.
The question is how to do that with as little as possible lines of (node.js) code, and at the same time, keeping solution easy, clean and cheap? :)
As you see at picture above, if fork no.3 failed, system must decide which of working forks should be get "blue" messages. Also, if fork no.3 return back, rebalancing is also needed.
My own research
I read about Apache Kafka with Consumer Groups, but Kafka is difficult to learn and to implement for me.
I read about RabbitMQ and Consumer Groups / many topics, but I don't know how to write auto rebalancing feature and also how I can use RabbitMQ (which plugins? which settings / configurations? there's so many options...).
I read about Azure Service Bus with message sessions but it has vendor-lock (azure cloud), it costs a lot, and like other solutions, doesn't provide full auto-rebalancing out-of-box.
I read about Redis Streams (with consumer groups) but it's new feature (lack of libraries for node.js) and also doesn't provide auto-rebalancing.

1 Message Brocker
For the first question you should look for a mature m2m protocol brocker which will give you freedom in designing your own intelligent data switching algorithms.
2 Loadbalancer
The answer to the second question you must employ well performed load balancer for handling such a huge number of 100000 connected cars. My suggestion to use Azure API Gateway or Nginx load balancer.
Now lets look at some of connected car solutions and analyze how the Aws IoT or Azure IoT doing the job nicely.
OpenSource IoT Solution
OpenSource IoT Solution
Nginx or API Gateway is used for the load Balancing purposes while the event processing is done on Kafka. Using kafka you can implement your own rule engine for intelligent data switching. Similarly any Message Broker as IoT bridge would do better. If I were you would be using VerneMQ to implement MQTTv5 features and data routing. In this case queue is not required.
Again if you want to use azure queue you have to concentrate on managing the queue forking and preempting. To control the queue seamlessly you have to write Azure Queue Trigger server-less Function. Thus your goal to not be vendor locked would be impossible to achieve.
In single word using VerneMQ, MQTT V5 implementation with Nginx would be great to implement but as all these are opensource product you must be strong in implementation and trouble shooting otherwise your business operation would be in support failure.
Its better to use professional IoT cloud services for a solution of thousands of connected cars. This is paying of as the SLA of the service is very high standard and little effort in system operation management.
Azure IoT Solution
Azure IoT Solution
If you are using Azure Solution, you be using IoT Hub where you don't have to worry about load balancing. Using Azure device SDK you can connect all the car with mobile LTE sim, OBD plugin etc to the cloud. Then azure function can handle the event processing and so on.
AWS IoT Solution
AWS IoT Solution
Unlike Azure IoT Device SDK, AWS IoT have sdk for devices. But in this architecture we want to complete the connected car project a little differently. For the shake of thing shadow and actual device status synchronization we have used AWS GreenGrass core solution in the edge side. Along with the server-less IoT event processing we have settled the whole connected car solution.
Similarly Azure IoT edge could be used to provide all can information to the device twin and synchronize between the actual car and twins.
Hope this will give you a clear idea how to implement and see the cost benefit over the vendor locked or unlocked situation.
Thank you.

Related

Azure EventHub Push/Pull?

When it comes to Apache Kafka, on the consumer side I know it's a pull model. What about Azure EventHubs? Are they pull or push?
From what I've gathered so far unlike kafka event hubs "push" events to the listeners. Can someone confirm? Any additional details or references would be helpful.
A simple google search landed me on the this page to back up my claim
Is there a simple way to test this theory out?
Yes, Azure Event Hub push events to event consumers, there is no need to 'poll' for consuming the events. The event processor defines event handlers which are invoked as new events are ingested into the event stream.
The event consumer can do something called as checkpoint that marks the event upto which the events have been consumed.
See the doc for more details.
The short answer to this is that the model for consuming events depends on the type of client that your application has chosen to use. The official Azure SDK packages offer consumer types that are push-based and those that are pull-based.
You don't mention the specific language that you're using but, since you're comparing to Kafka, I'll assume that you're interested in Java. The azure-messaging-eventhubs family of packages are the current generation of the Azure SDK and has the following clients for reading events:
EventProcessorClient: This is a push-based client intended to serve as the primary consumer of events in production scenarios for the majority of workloads. It is responsible for reading and processing events for all partitions of an Event Hub and collaborates with other EventProcessorClient instances using the same Event Hub and consumer group to balance work between them. A high degree of fault tolerance is built-in, allowing the processor to be resilient in the face of errors.
EventHubConsumerAsyncClient: This is a push-based client focused on reading events from a single partition using a Flux-based subscription via the Reactor library. This client requires applications to own responsibility for resilience and processing state persistence.
EventHubConsumerClient: This is a pull-based client focused on reading events from a single partition using an iterator pattern. This client requires applications to own responsibility for resilience and processing state persistence.
More information around the package, its types, and basic usage in the Azure Event Hubs client library for Java overview. More detailed samples can be found in the samples overview, including those for Consuming events and Using the EventProcessorClient.

Microservices, how to notify backend when task complete

For example, if i have main application (backend) and some microservice, e.g for image cropping.
User loads an image, making request to backend, backend using rabbitmq posts new task in the queue, then image cropping service pickup a task, completes it and i need somehow notify backend.
What is options for this? I need another microservice for such notifications?
so... there are reaaaaaaly many ways to do that.
On the high level, what you want to achieve is to produce an event that 1 or more services can react to. Now depending on what you have available, you can produce the event in a number of different ways.
if you want to be completely platform independent, you can use Apache Kafka. It's a popular service specifically for what we need -> publishing events and processing them at mass-scale. Kafka can be clustered, partitioned, have multiple parallel consumers of the same type (like multiple instances of your main backend service) or different types (3 different microservices that happen to be interested in a specific event). This bad boy just has it all and is famous for that. You can set up a cluster yourself or use one that comes out-of-the-box with some of the cloud platforms (like AWS for instance), but this might be more expensive and difficult to use compared to some cloud-specific fully-managed solutions.
if you're running your stuff on the google cloud, you can make it easier and cheaper by using the PubSub service. PubSub is a fully managed service that is scaled out-of-the-box (welcome to the cloud! you don't need to scale or cluster anything by yourself!).
if you're running on AWS, you can use SNS, or a more recent alternative - EventBridge (kinda like SNS, but booooooy what can it not do?). Yeah... I would recommend EventBridge. It can just do more... with the target filtering rules, payload transformations, it can automatically trigger more things...
Azure... ehm... Event Hub... but I haven't worked with this one yet... I'm not much of an Azurer... because you know... nobody uses azure for this kind of stuff...

Can Azure EventHub be used for critical transactional data in production?

Reading the documentation, Azure EventHubs is meant for:
Application instrumentation
User experience or workflow processing
Internet of Things (IoT) scenarios
Can this be used for any transactional data, handling revenue or application sensitive data?
Based on what I read, looks like it is meant for handling data that one should not be worried about any data loss. Is this the case?
It is mainly designed for large scale ingestion of data. That is why typical scenario's include IoT solutions which consists of a multitude of devices sending mass amounts of telemetry data.
To allow for this kind of scale it does not include some features other messaging service, like Azure Service Bus, do have. I think this blog does a good job of listening the differences. Especially the section Use Case explains things very well:
From a target use case perspective if we consider some of our typical enterprise integration patterns then if you are implementing a pattern which uses a Command Message, or a Request/Reply Message then you probably want to use Azure Service Bus Messaging.  RPC patterns can be implemented using Request/Reply messages on Azure Service Bus using a response queue.  These are really about ESB and EAI style messaging patterns where you want to send messages between applications and probably want to use other features such as property based routing.
Azure Event Hubs is more likely to be used if you’re implementing patterns with Event Messages and you want somewhere reliable to send them that is capable of dealing with a massive scale but will allow you to do stuff with the events out of process.
With these core target use cases in mind it is easy to see where the scale differences come into play.  For messaging it’s about one application telling one or more apps to DO SOMETHING or GIVE ME SOMETHING.  The alternative is that in eventing the applications are saying SOMETHING HAS HAPPENED.  When you consider this in typical application scenarios and you put events into the telemetry and logging space you can quickly see that the SOMETHING HAS HAPPENED scenario will produce a lot more traffic than the other.
Now I’m not saying that you can’t implement some messaging type functions using event hubs and that you can’t push events to a Service Bus topic as in integration there are always different requirements which result in different implementation scenarios, but I think if you follow the above as a general rule then you will usually be on the right path.
That does not mean however, that it is only capable of handling data that one should not be worried about any data loss. Data is stored for a configurable amount of time and if necessary, this data can be read from an earlier point in time.
Now, given your scenario I do not think Event Hub is the best fit. But truth to be told, I am not sure because you will have to elaborate more on what you want to do exactly.
Addition
The idea behind Event Hubs is that you will get at least once delivery at great scale. (Source). See also this question: Does Azure Event Hub guarantees at least once delivery?

Architecting multi-service enterprise applications using Azure cloud services

I have some questions regarding architecting enterprise applications using azure cloud services.
Back Story
We have a system made up of about a dozen WCF Windows Services on a SQL backend. We currently have about 10 clients but expect that to grow to potentially a hundred with perhaps a hundred fold increase in the throughput demands on the system. The current system is poorly engineered and is simply not capable of scaling. So now appears to be the appropriate juncture to reengineer on the azure platform.
Process Flow
Let me briefly describe a simplified set of the services and the process flow and then ask some questions I have regarding utilizing azure cloud services to build the new system.
Service A is logged on to an external systems and downloads data continuously
Service B is logged on to a second external systems and downloads data continuously
There can only ever be one logged in instance each of services A and B.
Both A and B hand off their data to Service C which reconciles the data from the two external sources.
Validated and reconciled data is then passed from C to Service D which performs some accounting functions and then passes the resulting data to Services E and F.
Service E is continually logged in to an external system and uploads data to it.
Service F generates reports and publishes them to clients via FTP etc
The system is actually far more complex than this but the above illustrates the processes involved. The system runs 24 hours a day 6 days a week. Queues will be used to buffer messaging between all the services.
We could just build this system using Azure persistent VMs and utilise the service bus, queues etc but that would ties us in to vertical scaling strategy. How could we utilise cloud services to implement it given the following questions.
Questions
Given that Service A, B and E are permanently logged in to external systems there can only ever be one active instance of each. If we implement these as single instance worker roles there is the issue with downtime and patching (which is unacceptable). If we created two instances of each is there a standard way to implement active-passive load balancing with worker roles on azure or would we have to build our own load balancer? Is there another solution to this problem that I haven’t thought of?
Services C and D are a good candidates to scale using multiple worker role instance. However each instance would have to process related data. For example, we could have 4 instances each processing data for 5 individual clients. How can we get messages to be processed in groups (client centric) by each instance? Also, how would we redistribute load from one instance to the remaining instances when patching takes place etc. For example, if instance 1, which processes data for 5 clients, goes down for OS patching, the data for its clients would then have to be processed by the remaining instances until it came back up again. Similarly, how could we redistribute the load if we decide to spin up additional worker roles?
Any insights or suggestions you are able to offer would be greatly appreciated.
Mat
Question #1: you will have to implement your own load-balancing. This shouldn't be terribly complex as you could use Blob storage Lease functionality to keep a mutex on some blob in the storage from one instance while holding the connection active to your external system. Every X period of time you could renew the lease if you know that connection is still active and successful. Every other worker in the Role could be checking on that lease to see if it expires. If it ever expires, the next worker would jump in and acquire the lease, and then open the connection to the external source.
Question #2: Look into Azure Service Bus. It has a capability to allow clients to process related messages. More info here: http://geekswithblogs.net/asmith/archive/2012/04/02/149176.aspx
All queuing methodologies imply that if a message gets picked up but does not get processed within a configurable amount of time, it goes back onto the queue so that the next available instance can pick it up and process it
You can use something like AzureWatch to monitor the depth of your queues (storage or service bus) and auto-scale number of instances in your C and D roles to match; and monitor instance statuses for roles A, B and E to make sure there is always a healthy instance there and auto-scale if quantity of ready instances drop to 0.
HTH
First, back up a step. One of the first things I do when looking at application architecture on Windows Azure is to qualify whether or not the app is a good candidate for migration to Windows Azure. I particularly look at how much integration is in the application — integration is always more difficult than expected, doubly so when doing it in the cloud. If most of your workload needs to be done through a single, always-on connection, then you are going to struggle to get the availability and scalability that we turn to the cloud for.
Without knowing the detail of your application, but by way of example, assume services A & B are feeds from a financial data provider. Providers of data feeds are really good at what they do, have high availability, and provide 'enterprise grade' (whatever that means) for enterprise grade costs. Their architectures are also old-school and, in some cases, very rigid. So first off, consider asking your feed provider (that gives to a login/connection and expects you to pull data) to push data to you via a web service. Exposed web services are the solution to scaling and performance, and are used from table storage on Azure, to high throughput database services like DynamoDB. (I'll challenge any enterprise data provider to explain how a service like Amazon S3 is mickey-mouse.) If your data supplier pushed data to a web service via an agreed API, you could perform all sorts of scaling and availability on the service for a low engineering cost.
Your alternative is, as you are discovering, to build a whole lot of stuff to make sure that your architecture fits in with the single-node model of your data supplier. While it can be done, you are going to spend a lot of engineering cash on hand-rolling a whole bunch of distributed computing principles. If you are going to have an active-passive architecture, you need to implement a leader election algorithm in order to determine when a passive node should become active. This is not as trivial as it sounds as an active node may look like it has disappeared, but is still processing — and you don't want to slot another one in its place. So then you will implement a heartbeat, or even a separate 'witness' node that does nothing other than keep an eye on which nodes are alive in order to do something about them. You mention that downtime and patching is unacceptable. So what is acceptable? A few minutes or a few seconds, or less than a second? Do you want the passive node to take over from where the other left off, or start again?
You will probably find that the development cost to implement all of this is lower than the cost of building and hosting a highly available physical server. Perhaps you can separate the loads and run the data feed services in a co-lo on a physical box, and have the heavy lifting of the processing done on Windows Azure. I wouldn't even look at Azure VMs, because although they don't recycle as much as roles, they are subject to occasional problems — at least more than enterprise-grade hardware. Start off with discussions with your supplier of the data feeds — they may have a solution, or one that can be cobbled together (e.g. two logins for the price of one, and the 'second' account/instance mostly throws away its data).
Be very careful of traditional enterprise integration. They ask for things that seem odd in today's cloud-oriented world. I've had a request that my calling service have a fixed ip address, for example. You may find that the code that you have to write to work around someone else's architecture would be better spent buying physical servers. Push back on the data providers — it is time that they got out of the 90s.
[Disclaimer] 'Enterprises', particularly those that are in financial services, keep saying that their requirements are special — higher throughput, higher security, high regulations and higher availability. With the exception of a very few cases (e.g. high frequency trading), I tend to call 'bull' on most of this. They are influenced by large IT budgets and vendors of expensive kit taking them to fancy lunches, and are indoctrinated to their server-hugging beliefs. My individual view on the enterprise hardware/software/services business has influenced this answer. Your mileage may vary.

Azure Service Bus Scalability

I am trying to understand how can I make Azure Service Bus Topic to be scaleable to handle >10,000 requests/second from more than 50 different clients. I found this article at Microsoft - http://msdn.microsoft.com/en-us/library/windowsazure/hh528527.aspx. This provides lot of good input to scale azure service bus like creating multiple message factories, sending and receiving asynchronously, doing batch send/receive.
But all these input are from the publisher and subscriber client perspective. What if the node running the Topic can not handle the huge number of transactions? How do I monitor that? How do I have the Topic running on multiple nodes? Any input on that would be helpful.
Also wondering if any one has done any capacity testing with Topic/Queue and I am eager to see those results...
Thanks,
Prasanna
If you need 10K or 100K or 1M or more requests per seconds take a look at what's being done on the highway. More traffic, more lanes.
You can get effectively arbitrary flow rates out of Service Bus by partitioning your traffic across multiple entities. Service Bus gives a number of assurances about reliability, e.g. that we don't lose messages once we took them from you or that we assign gapless sequence numbers, and that has throughput impact on the individual entities like a single Topic. That's exactly like a highway lane just being able to deal with X cars/hour. Make more lanes.
Since these replies, Microsoft has released a ton of new capability.
Azure Auto-Scale can monitor the messages in a queue (or CPU load)
and start or stop instances to maintain that target.
Service Bus introduced Partitioned Queue's (& topics). This lets you send messages over multiple queues but they look like a single queue to you API. Dramatically increasing the throughput of a Queue.
Before you do that I'd recommend you try:-
Async & Batched writes to the queue.
Change the Prefetch parameter on the Reads.
Also look at Receive.OnMessage() to ensure you get the messages the millisec they are available.
This will improves your perf from ~5 messages / sec to many 100's or 1,000's per sec.
The Service Bus has its limitations "Capacity and Quotas", check out this article for a very good overview of these: https://learn.microsoft.com/en-gb/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted
I suggest you reach out to your local MSFT Specialist if you have a use case that will push the boundaries of Azure Service Bus, MSFT have dedicated teams in Redmond (around the world) that can help you design and push these boundaries at massive scale, this is the Windows Azure CAT (Customer Advisory Team). Their goal is to solve real world customer problems and it sounds like you might have one...
You need to performance and load test to reach all the answers to your questions above based on your specific scenario.
The Azure CAT team have a wealth of metrics on capacity and load testing with the Service Bus (Azure in general), these are not always publicly available so again reach out if you can...
If it can handle that many requests, you want to make sure that you receive the messages in such a way that you don't hit the max size of the topic. You can use multiple instances of a worker role in Azure to listen to specific subscriptions so you would be able to process messages faster without getting near the max size.

Resources