Kafka Consumer not consuming messages from all partitions - node.js

I am noticing something weird happening with my system. So, I am using Kafka to send and receive messages between different systems. I have around 6 or 7 topics each with 10 partitions.
I have an external system that is sending messages on my Kafka topics. So this external system will send messages initially to a topic for eg. "XYZ" and will wait for a response from the Server. Once the Server reads and responds back to the external system then only it will continue further.
Now in our scenario when the external system sends messages to topic "XYZ" it is always sending on partition no 6. This is happening even after restarting the entire system multiple times. Messages on XYZ topic are always being sent to Partition 6.
Now on the Server side, I am using kafka-node to create clients, consumer and producer to consume and produce the messages to kafka. But in this case, it is not consuming from the topic "XYZ".
As a workaround, I tried to test everything by deleting the topics and creating them again but only with a single partition, and this time it worked fine. The entire system worked without any problem.

and creating them again but only with a single partition, and this time it worked fine
Unclear what this scenario is testing... If you wanted 1 partition, then why did the topics get created with 10?
The only reason, in theory, why this would happen is if you are closing and re-creating the producer instance, and it is not correctly randomly seeding the round-robin distribution of the sent events, and always picking the same value. Or, you have defined a key for your records, and it's always hashed to partition 6.
in this case, it is not consuming from the topic "XYZ".
Only one consumer can be active on any partition at a time. If all data ends up in partition 6, then you can only have one consumer... So, sounds like something is reading it, just not what you expect.

Related

How multiple Kafka Consumers in the same consumer group read messages from one partition in the topic?

I would like to know about how the consumers in the same consumer group read the messages from one topic which has only one partition.
For example, I have 3 consumers in one consumer group and that group is polling messages from Topic A which has partition A so if I have 1000 messages coming one by one in the Topic A how it would be delivered to 3 of the consumers.
Would it be like 3 messages will be delivered to 3 consumers parellely and once it's processed by each the another one would be delivered basically will they receive messages paraellely?
Would it be like any one consumer will fetch those messages as there is only one partition ?
Please also suggest me the best architecture approach for above scenario.
Thanks,
I want to process the multiple messages parallelly from one topic which has one partition to 4 consumers.
I am using the kafka structure with NodeJS microservices with kafkajs package.
In your scenario, only one consumer of that consumer group will read the data, most probably the first one you started. I'm not 100% sure as I never tried it out, but I assume the additional consumers will just idle without workload.
This question is essentially the same as yours.
If you want to achieve parallelity of consumers, you cannot avoid having multiple partitions, that's the main purpose of the whole partitioning concept.

Replaying Messages in Order

I am implementing a consumer which does processing of messages from a queue where order of messages is of importance. I would like to implement a mechanism using NodeJS where:
the consumer function is consuming messages m1, m2, ..., mN from the queue
doing an IO intensive operation and process the messages. m -> m'
Storing the result m' in a redis cache.
acknowledging the queue after each message process (2)
In a different function, I am listening to the message from the cache
sending the processed messages m' to an external system
if the external system was able to process the external system, then delete the processed message from the cache
If the external system rejects the processed message, then stop sending messages, discard the unsent processed messages in the cache and reset the offset to the last accepted message in the queue. For example if m12' was the last message accepted by the system, and I have acknowledged m23 from the queue, then I have to discard m13' to m23' and reset the offset so that the consumer can read and start processing from m13 again.
Few assumptions:
The processing m to m' is intensive and I am processing them optimistically, knowing that most of the times there won't be a failure
With the current assumptions and goals, is there any way I can achieve this with RabbitMQ or any Azure equivalent? My client doesn't prefer Kafka or any Azure equivalent of Kafka (Azure Event Hub).
In scenarios where the messages will always be generated in sequence then a simple queue is probably all you need.
Azure Queues are pretty simple to get into, but the general mode of operation for queues is to remove the messages as they are processed successfully.
If you can avoid the scenario where you must "roll back" or re-process from an earlier time, so if you can avoid the orchestration aspect then this would be a much simpler option.
It's the "go back and replay" that you will struggle with. If you can implement two queues in a sequential pattern, where processing messages from one queue successfully pushes the message into the next queue, then we never need to go back, because the secondary consumer can never process ahead of the primary.
With Azure Event Hubs it is much easier to reset the offset for processing, because the messages stay in the bucket regardless of their read state, (in fact any given message does not have such a state) and the consumer maintains the offset pointer itself. It also has support for multiple consumer groups, which will make a copy of the message available to each consumer.
You can up your plan to maintain the data for up to 7 days without blowing the budget.
There are two problems with Large scale telemetry ingestion services like Azure Event Hubs for your use case
The order of receipt of the message is less reliable for messages that are extremely close together, the Hub is designed to receive many messages from many sources concurrently, so its internal architecture cares a lot less about trying to preserve the precise order, it records the precise receipt timestamp on the message, but it does not guarantee that the overall sequence of records will match exactly to a scenario where you were to sort by the receipt timestamp. (its a subtle but important distinction)
Event Hubs (and many client processing code examples) are designed to guarantee Exactly Once delivery across multiple concurrent consuming threads. Again the Consumers are encouraged to be asynchronous and the serice will try to ensure that failed processing attempts are retried by the next available thread.
So you could use Event Hubs, but you would have to bypass or disable a lot of its features which is generally a strong message that it is not the correct fit for your purpose, if you want to explore it though, you would want to limit the concurrency aspects:
minimise the partition count
You probably want 1 partition for each message producer, or atleast for each sequential set, maintaining sequence is simpler inside a single partition
make sure your message sender (producer) only sends to a specific partition
Each producer MUST use a unique partition key
create a consumer group for each of your consumers
process messages one at a time, not in batches
process with a single thread
I have a lot of experience in designing MS Azure based solutions for Industrial IoT (Telemetry from PLCs) and Agricultural IoT (Raspberry Pi) device implementations. In almost all cases we think that the order of messaging is important, but unless you are maintaining real-time 2 way command and control, you can usually get away with an optimisitic approach where each message and any derivatives are or were correct at the time of transmission.
If there is the remote possibility that a device can be offline for any period of time, then dealing with the stale data flushing through the system when a device comes back online can really play havok with sequential logic programming.
Take a step back to analyse your solution, EventHubs does offer a convient way to rollback the processing to a previous offset, as long as that record is still in the bucket, but can you re-design your logic flow so that you do not have to re-process old data?
What is the requirement that drives this sequence? If it is so important to maintain the sequence, then you should probably process the data with a single consumer that does everything, or look at chaining the queues in a sequential manner.

Understanding Azure Event Hubs partitioned consumer pattern

Azure Event Hub uses the partitioned consumer pattern described in the docs.
I have some problems understanding the consumer side of this model when it comes to a real world scenario.
So lets say I have 1000 messages send to the event hub with 4 partitions, not defining any partition Id. This means the messages will go to all partitions using the round-robin method.
Now I want to have two applications distributing the messages to two different databases. My questions there are:
Lets say for the first application, I want to store all messages in Database 1. This means, for maximum speed, In my consumer application I need to have 4 threads (consumers), each listening to one partition of the event hub, right? Each of them also has to store their own offset for the partition they're reading (checkpoint).
Lets say my second application wants to filter the messages and only store a subset of them in Database 2. There I also need 4 consumers since I don't know which message goes to which partition, right?
Also for the two applications I need to have two consumer groups, but why? Is the filtering of the messages defined in the consumer group? I don't get it really why I need this one, since the applications consumers store the partition checkpoints by themselves and I can do the filtering within the applications itself.
I know there is the EventProcessorHost class but I want to understand the concept of the EventHub on a lower level.
Lets say for the first application, I want to store all messages in Database 1. This means, for maximum speed, In my consumer application I need to have 4 threads (consumers), each listening to one partition of the event hub, right? Each of them also has to store their own offset for the partition they're reading (checkpoint).
Correct, you should have a process per provisioned partition. So, if you have 4 processors you should have 4 processes, each processing the messages of a specific partition. If you process the messages using an EventProcessorHost it will take care of the spinning up of the processes for you.
Lets say my second application wants to filter the messages and only store a subset of them in Database 2. There I also need 4 consumers since I don't know which message goes to which partition, right?
What do you mean with a consumer? You need another 4 processes to process the messages but they should be configured to read using a different consumer group. Otherwise they will compete with the processes of 1
Also for the two applications I need to have two consumer groups, but why? Is the filtering of the messages defined in the consumer group? I don't get it really why I need this one, since the applications consumers store the partition checkpoints by themselves and I can do the filtering within the applications itself.
Let us define a consumer group:
Consumer groups enable multiple consuming applications to each have a separate view of the incoming message stream, and to read the stream independently at its own pace with its own offset
So yes, you need 2 different consumer groups.
Each consumer group will get all messages send to the event hub partitions. Each consumer group tracks its own progress in the stream of messages. That is why you need two for your scenario.
Say you define an additional consumer group called "App2-Consumer-Group", the reader processes will receive all messages but should take no action for messages they are not interested in.
If you would not create an additional consumer group, the reader processes for the default consumer group will process the messages for the first application and mark them as processed using the check-pointing mechanism. The reader processes for the second application won't get any messages since they are already marked as processed. (In real life, when using one consumer group with some messages might be picked up by the reader processes for the first application and some messages might be picked up by reader processes for the second application as the processes will try to get a lock on a specific partition)
I think this image shows clearly how consumer groups track their own progress in the stream of message and hence why you need tow of them if you have 2 different processing logic for the 2 different applications:

what is best practice to consume messages from multiple kafka topics?

I need to consumer messages from different kafka topics,
Should i create different consumer instance per topic and then start a new processing thread as per the number of partition.
or
I should subscribe all topics from a single consumer instance and the should start different processing threads
Thanks & regards,
Megha
The only rule is that you have to account for what Kafka does and doesn't not guarantee:
Kafka only guarantees message order for a single topic/partition. edit: this also means you can get messages out of order if your single topic Consumer switches partitions for some reason.
When you subscribe to multiple topics with a single Consumer, that Consumer is assigned a topic/partition pair for each requested topic.
That means the order of incoming messages for any one topic will be correct, but you cannot guarantee that ordering between topics will be chronological.
You also can't guarantee that you will get messages from any particular subscribed topic in any given period of time.
I recently had a bug because my application subscribed to many topics with a single Consumer. Each topic was a live feed of images at one image per message. Since all the topics always had new images, each poll() was only returning images from the first topic to register.
If processing all messages is important, you'll need to be certain that each Consumer can process messages from all of its subscribed topics faster than the messages are created. If it can't, you'll either need more Consumers committing reads in the same group, or you'll have to be OK with the fact that some messages may never be processed.
Obviously one Consumer/topic is the simplest, but it does add some overhead to have the additional Consumers. You'll have to determine whether that's important based on your needs.
The only way to correctly answer your question is to evaluate your application's specific requirements and capabilities, and build something that works within those and within Kafka's limitations.
This really depends on logic of your application - does it need to see all messages together in one place, or not. Sometimes, consumption from single topic could be easier to implement in terms of business logic of your application.

Azure service bus - Topic full

I have a process(Process A) that keeps sending events to an ASB topic. There are multiple consumers of the topic and therefore multiple subscriptions. So lets say that one of the consumer's process is down. And due to this the topic gets full as the messages are not consumed. Does this mean then Process A also fails as it is not able to send messages to ASB topic as its full?
Two more things to check:
Make sure that your dead letter queue is not full that counts towards the size of the entity.
Make sure that you have at least one subscription that works for each message. For example, if you send a message with ID=1, but you only have a subscription with ID=2, the messages will get backed up.
I think you are correct, once the limit is reached the queue stops.
However, with partitioning (using all 16 partitions * 5 GB), you can store up to 80 GB:
https://azure.microsoft.com/en-us/blog/partitioned-service-bus-queues-and-topics/
Another solution is to use auto forwarding, so the topic forwards all messages to another queue/topic
https://azure.microsoft.com/en-us/documentation/articles/service-bus-auto-forwarding/
This way each subscriber can have it's own queue of 5GB (or 80GB if you use partition)
Some more info:
https://azure.microsoft.com/nl-nl/documentation/articles/service-bus-azure-and-service-bus-queues-compared-contrasted/
https://azure.microsoft.com/en-us/documentation/articles/service-bus-quotas/

Resources