I have a quick question. Is the KCL able to consume from multiple streams? Should you ever set up multiple streams for your application, or is a individual stream supposed to be tied with an individual application? My particular use case is that I need to consume data being produced from the backend and also from the frontend. One of these produces data at much greater rates than the other, and for that reason think they should produce into separate streams for processing. Is there a way to consume both streams from the same KCL process or do I need to set up two? Thanks for your help!
KCL is an open source project that you can modify to consume events from multiple streams, but this is not recommended. It is better to keep things simpler.
If you have 2 different event streams, you better have 2 different kinesis streams, one for each. This allows you to scale each stream independently as each has a different rate and possibly different peaks.
If you need to share information between the streams, you can use share state variables between them, using some DB such as DynamoDB or Redis.
Please note that if you have a set of servers that are sending out these events, you should expect that some of the events of the back end, might be processed before the events from the front end. The KCL (or Lambda) code that you will have to process these events, can have different processing rates, different failure points and other out-of-synch events. Take note of such potential dependencies and exceptions.
Related
I'm trying to design a robust architecture, however I'm having trouble on solving the message delivery.
Let me try to explain
The API would be clustered on ECS receiving a bunch of requests.
The Workers would be clustered too subscribing the same channels. (that's the problem, if we were working with only one worker it wouldn't have any issue)
How to deal with multiple workers avoiding duplicated messages?
What would be a good simple approach, keeping many workers occupied.?
Thank you.
This sounds like a very fundamental problem, for a message broker: having one channel and multiple workers subscribed to it, and all of them to receive the same message. It wouldn't really be useful to process the same message multiple times.
This problem has been addressed in most message brokers (I believe). For example, when you consume a message from an Amazon SQS queue, that message is not visible to other consumers, for a particular timeframe (visibility timeout).
When the worker processed the message, it has to delete it from the queue. Otherwise, if the timeout expired, other workers will see the message and process it.
SQS in particular has a distributed architecture and sometimes you get duplicate messages in the queue, which are processed by different workers. That's the effect of the at-least-once delivery guarantee that SQS provides.
If your system has to be strict about duplicate messages, then you need to build a de-duplication mechanism around it.
The keywords you are looking for is "exactly once guarantee in a distributed system". With that you can do some research on your own, but here some pointers.
You could use the right Event Queue System that supports "exactly once" guarantees. For example Apache Pulsar (see this link) or Kafka, or you can use their approach as inspiration in your own implementation (which may be somewhat hard to do).
For your own implementation you could write a special consumer that is the only consumer and acts a distributor for worker tasks and whose task it is to guarantee "exactly once". It would be a tradeoff and could prove a bottleneck, depending on your scalability requirements. This article explains why it is a difficult problem in distributed systems.
I have developed a node sdk which has certain REST API.These API's are interacting with blockhchain framework for read and write operations.
There could be certain situations when many requests are coming on node sdk.
So for load balacing i have used NGNIX with having one more replica of sdk on another instance.This all works well.
It is being suggested to use rabbitMQ for load balancing as well. But in my API there are few straightforwards read and write operations by API & no heavy processing done.
I read rabbitMQ should be used for below purpose.
Integrating multiple microservices
Executing heavy task such as image processing,image uploading etc.
So how and when should i use rabbitMQ ?
I think your design is OK. Simply, your system had to manage more load and you added more replicas of your services, with a load balancer on the front that is able to distribute incoming load between the replicas. If your "sdk" is purely stateless (doesn't remeber client data collected from previous requests, but delegates all state to a DB/BC) your've done your job. A message queuing technology can help in other scenarios
when your application does things in a pure asynchronous fashion
when you have to manage big spikes of load
when some of your architecture component reacts to events (eg. receiving an alarm from a device, sending an email when your become the 1 million click etc)
when you're into event sourcing
when in some way there are stateful services that consume data from the same batch of requests (eg all data from user with id 1sw023)
various and possible
Adopting MQs has a big impact and needs some effort to integrate e manage things. Don't do it if you are not sure to leverage completely its benefits
RabbitMQ is a Message Queue. It's useful when your application is receiving more requests that what it can handle simultaneously.
The way it works is that the queue store the incoming messages until they are processed by worker nodes (for example your SDK). The worker nodes typically do some work (usually heavy processing), and when they are done with the work, they pull a new message from the queue, process it, do the work, and so on so forth.
In your case, you might need it if you see that your blockchain is rejecting a lot of messages (for example because there was too much request at once, and the blockchain couldn't reach a consensus quick enough).
I am using hazelcast jet 0.6.1 for real time analysis. There are multiple streams (mostly from remote journal) coming from different sources.
I would like to know, if full join supported between multiple streams.
If yes, will you please point me to some links / examples for full join between multiple streams.
I think you need to elaborate a bit more on what you are trying to do. Streams are theoretically infinite, so the term "full join" has to mean something different than it does in a database.
There are several types of joins available in Jet. As Can said above, there is a merge operator, but you might be thinking more of windowed join where you time bound the period of the joins.
Merge Steams is here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
Window Concepts are here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#unbounded-stream-processing
*This is in response to the comment from the first answer, it's to large for another comment and I thought the first answer is still relevant
Is this the same data and data type, just from different nodes? Like app servers for a microservices architecture? It seems to me that you have a few options here that really come down to preferred overall architecture, especially about how you want to transport the events. A couple thoughts:
You can simply merge streams from different data sources if that fits the use case:
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
If this is homogenous data, just distributed across app servers, if might be a case where you use the Hazelcast client on each app server to put events into an IMap (which is shared by all the app servers) with an Event Journal on a Hazelcast cluster. Then Jet just receives all the events from the Event Journal.
See: https://docs.hazelcast.org/docs/latest/manual/html-single/#event-journal
If you have Kafka available, perhaps you create a topic for the events from the servers and Jet receives the events from Kafka. Either way they are already merged when Jet gets them, so they are processed as one stream.
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#kafka
I am developing a consumer which consumes events from multiple Kinesis streams. I have some questions to understand the best practices.
Should I create one channel per stream? What factors should be considered to decide between "channel per stream" or "one channel for all streams"?
Which channel fits better for my case performance wise? There are different channel types like PollableChannel, SubscribaleChannel and DirectChannel.
Thank you
The KinesisMessageDrivenChannelAdapter is an active component and it performs consumption and message sending in the task executor. Therefore you might think do not shift messages to the QueueChannel or an ExecutorChannel - the logic is already async and involves enough threads on the machine. It is really much better do not shift the processing to a separate thread and keep this consumption thread busy and don't poll more records from the Kinesis into the memory.
One KinesisMessageDrivenChannelAdapter can do, essentially, the same work for several streams as several separate adapters for different stream - the thread capacity on the machine is going to be used.
We need different channel adapters in case of different processing logic or different data types, or different Kinesis Client options. In all other cases the single instance is pretty sufficient.
Can Hazelcast Jet be used for processing of millions of records using multiple clients accessing an event journal and each client would process a portion of the records?
Furthermore, is it possible to accumulate the results processed by different clients?
This is also an architectural question. To fit your aggregation need, you might have clients begin as individual streams, accumulate your aggregates there, then join the streams for common processing. Just as an example.
Also, you do have access to the underlying IMDG technology that you can leverage. You have a free hand at how you want to build the overall architecture.