I am developing a consumer which consumes events from multiple Kinesis streams. I have some questions to understand the best practices.
Should I create one channel per stream? What factors should be considered to decide between "channel per stream" or "one channel for all streams"?
Which channel fits better for my case performance wise? There are different channel types like PollableChannel, SubscribaleChannel and DirectChannel.
Thank you
The KinesisMessageDrivenChannelAdapter is an active component and it performs consumption and message sending in the task executor. Therefore you might think do not shift messages to the QueueChannel or an ExecutorChannel - the logic is already async and involves enough threads on the machine. It is really much better do not shift the processing to a separate thread and keep this consumption thread busy and don't poll more records from the Kinesis into the memory.
One KinesisMessageDrivenChannelAdapter can do, essentially, the same work for several streams as several separate adapters for different stream - the thread capacity on the machine is going to be used.
We need different channel adapters in case of different processing logic or different data types, or different Kinesis Client options. In all other cases the single instance is pretty sufficient.
Related
I have been struggling to find any key pros and cons on using one over the other. When it comes to sharing data between two microservices. Especially when it comes to scale.
What my assumption and question is - if we use a CDC to queue & CDC (Queue) subscriber combination, we can more or less can get rid of the need to publish to the message queue from our application layer (which might be prone to more human errors).
I went into this thought process when evaluating Mongodb "changestreams" and have been curious ever since.
When using CDC in this way, you're basically turning your microservice's database into a message broker. That has the advantage of not requiring a separate message broker. It has the disadvantages of deeply coupling the consuming microservices to the producing microservice, especially since every new consuming microservice will effectively impose some extra load on the source microservice's database.
CDC can be a reliable way to feed a pubsub topic on a message broker, however, though it's probably best to recognize that the CDC still means a coupling between the source microservice's internal data model and the data model for interservice communication, which tends to mean changes to one require changes to all. Since one of the primary (and arguably the only always-valid-in-general) reasons to adopt microservices is to allow changes with minimal coordination, it might be advised to have the CDC feed a single service which is responsible for translating the CDC records into the wire model (e.g. domain events with an agreed upon schema).
I'm trying to design a robust architecture, however I'm having trouble on solving the message delivery.
Let me try to explain
The API would be clustered on ECS receiving a bunch of requests.
The Workers would be clustered too subscribing the same channels. (that's the problem, if we were working with only one worker it wouldn't have any issue)
How to deal with multiple workers avoiding duplicated messages?
What would be a good simple approach, keeping many workers occupied.?
Thank you.
This sounds like a very fundamental problem, for a message broker: having one channel and multiple workers subscribed to it, and all of them to receive the same message. It wouldn't really be useful to process the same message multiple times.
This problem has been addressed in most message brokers (I believe). For example, when you consume a message from an Amazon SQS queue, that message is not visible to other consumers, for a particular timeframe (visibility timeout).
When the worker processed the message, it has to delete it from the queue. Otherwise, if the timeout expired, other workers will see the message and process it.
SQS in particular has a distributed architecture and sometimes you get duplicate messages in the queue, which are processed by different workers. That's the effect of the at-least-once delivery guarantee that SQS provides.
If your system has to be strict about duplicate messages, then you need to build a de-duplication mechanism around it.
The keywords you are looking for is "exactly once guarantee in a distributed system". With that you can do some research on your own, but here some pointers.
You could use the right Event Queue System that supports "exactly once" guarantees. For example Apache Pulsar (see this link) or Kafka, or you can use their approach as inspiration in your own implementation (which may be somewhat hard to do).
For your own implementation you could write a special consumer that is the only consumer and acts a distributor for worker tasks and whose task it is to guarantee "exactly once". It would be a tradeoff and could prove a bottleneck, depending on your scalability requirements. This article explains why it is a difficult problem in distributed systems.
I have a quick question. Is the KCL able to consume from multiple streams? Should you ever set up multiple streams for your application, or is a individual stream supposed to be tied with an individual application? My particular use case is that I need to consume data being produced from the backend and also from the frontend. One of these produces data at much greater rates than the other, and for that reason think they should produce into separate streams for processing. Is there a way to consume both streams from the same KCL process or do I need to set up two? Thanks for your help!
KCL is an open source project that you can modify to consume events from multiple streams, but this is not recommended. It is better to keep things simpler.
If you have 2 different event streams, you better have 2 different kinesis streams, one for each. This allows you to scale each stream independently as each has a different rate and possibly different peaks.
If you need to share information between the streams, you can use share state variables between them, using some DB such as DynamoDB or Redis.
Please note that if you have a set of servers that are sending out these events, you should expect that some of the events of the back end, might be processed before the events from the front end. The KCL (or Lambda) code that you will have to process these events, can have different processing rates, different failure points and other out-of-synch events. Take note of such potential dependencies and exceptions.
I have a Java application, which uses an Oracle Queue to store messages in the queue for later processing by multiple threads consuming queued messages. The messages in this queue can be related to each other, and must therefore be processed in a specific order based on the business logic of my application. Basically, I want to achieve that the dequeueing of one message A is held back as long as another message B in the queue has not been completely processed. The only weapon given by Oracle AQ I see here, are the Delay and an Priority parameters. These, however, cannot be used to achieve the scenario outlined above, since there are situations, where two related messages still can be dequeued and processed at the same time. Are there any tools that can help establishing an advanced processing order of messages?
I came to the conclusion that it is not a good idea to order these messages using the queue, because it would need a custom and very specialized dequeue strategy, which has a very bad smell to me, both, complexity and most likely performance wise. It also tries to fix communication protocol issues using the queue, which are application specific and therefore should find treatment in the application itself. Instead, the application / communication protocol should be tolerant enough to handle ordering issues.
Has anybody thought about implementing strategies for Azure storage queues that would allow dequeuing messages in an arbitrary order (other than first-in, first-out). For examples, some people might be interested in LIFO, some people might want to dequeue "important" messages ahead of less important ones, etc.
Personally, I am interested in implementing a strategy that would allow messages in a multi-tenant system to be dequeued in a way that ensures large number of messages related to a particular tenant will not cause messages for other tenants to be delayed.
I am also interested in other queuing systems that may have implemented similar strategies.
Are there other queuing systems that allow this kind of
What you are looking for is referred to as a Priority Queue Pattern which you can read more about here.
There are a couple of strategies for achieving this. One is to use different queues for the higher priority messages. Or in your case, a queue for each customer.
Another approach, and is one I would prefer for your scenario, is to use the ServiceBus Topics and Subscriptions (pub/sub basically).
Both of these are discussed in more detail in the link provided above.
The priority queue pattern is the way to go. Use different queues for different message priorities. You can also assign appropriate numbers of workers to each queue to drain at an appropriate rate.