Is there any way I can have time threshold associated message in Kafka.
E.g.
Consumer pulls a message out of Kafka but system does not have enough information to process. So I put the message back in "resolver" queue, but I do not want to pull it out of the "resolver" queue for next 15 minutes, is there any way I can achieve that.
No, there is no way to achieve that with just Kafka. Kafka is designed to store messages for a certain period of time or until a certain size. The only way to get messages from Kafka is by offset. The offset identifies which messages have been consumed, messages before the offset have already been read, messages after the offset are yet to be read.
As far as I know Kafka does not provide time stamps with its messages (I could be wrong about that). But reading through the documentation and working with it for several months I have never encountered any example about retrieving messages from Kafka based on time.
Related
Is there a way to get count of events in Azure Event Hub for a particular time period? I need to get the count of events which come for an hour.
No, there is no way to get it as of now, if you look at the docs, EventHub is a high-Throughput, low-latency durable stream of events on Azure, so getting a count may be not correct at that given point of time
unlike queues, there is no concept of queue length in Azure Event Hubs
because the data is processed as a stream
I'm not sure that the context of this question is correct, as a consumer group is just a logical grouping of those reading from an Event Hub and nothing gets published to it. For the remainder of my thoughts, I'm assuming that the nomenclature was a mistake, and what you're interested in is understanding what events were published to a partition of the Event Hub.
There's no available metric or report that I'm aware of that would surface the requested information. However, assuming that you know the time range that you're interested in, you can write a utility to compute this:
Connect one of the consumer types from the Event Hubs SDK of your choice to the partition, using FromEnqueuedTime or the equivalent as your starting position. (sample)
For each event read, inspect the EnqueuedTime or equivalent and compare it to the window that you're interested in.
If the event was enqueued within your window, increase your count and repeat, starting at #2. If the event was later than your window, you're done.
The count that you've accumulated would be the number of events that the Event Hubs broker received during that time interval.
Note: If you're doing this across partitions, you'll want to read each partition individually and count. The ReadEvents method on the consumer does not guarantee ordering or fairness; it would be difficult to have a deterministic spot to decide that you're done reading.
Assuming you really just need to know "how many events did my Event Hub receive in a particular hour" - you can take a look at the metrics published by the Event Hub. Be aware that, like mentioned in the other answers, the count might not be 100% accurate given the nature of the system.
Take a look at the Incoming Messages metric. If you take this for any given hour, it will give you the count of messages that were received during this time period. You can split the namespace metric by EventHub, and every consumer group will receive every single message, so you should be fine.
This is an example of how it can look in the UI, though you should also be able to export it to a log analytics workspace.
I am using Spark 2.3 structured streaming to read messages from Kafka and write to Postgres (using the Scala programming language), my application is supposed to be a long living application, and it should be able to handle any case of failure without human intervention.
I have been looking for ways to catch unexpected errors in Structured Streaming, and I found this example here:
Spark Structured Streaming exception handling
This way it is possible to catch all errors that are thrown in the Stream, but the problem is, when the application tries again, it is stuck on the same exception again.
Is there a way in Structured Streaming that I can handle the error and tell spark to increment the offset in the "checkpointlocation" programatically so that it proceeds to the consume the next message without being stuck?
This is called in the streaming event processing world as handling a "poison pill"
Please have a look on the following link
https://www.waitingforcode.com/apache-spark-structured-streaming/corrupted-records-poison-pill-records-apache-spark-structured-streaming/read
It suggest several ways to handle this type of scenario
Strategy 1: let it crash
The Streaming application will log a poison pill message and stop the processing. It's not a big deal because thanks to the checkpointed offsets we'll be able to reprocess the data and handle it accordingly, maybe with a try-catch block.
However, as you already saw in your question, it's not a good practice in streaming systems because the consumer stops and during that idle period it accumulates the lag (the producer continues to generate data).
Strategy 2: ignore errors
If you don't want downtime of your consumer, you can simply skip the corrupted events. In Structured Streaming it can be summarized to filtering out null records and, eventually, logging the unparseable messages for further investigation, or records that get you an error.
Strategy 3: Dead Letter Queue
we ignore the errors but instead of logging them, we dispatch them into another data storage.
Strategy 4: sentinel value
You can use a pattern called Sentinel Value and it can be freely used with Dead Letter Queue.
Sentinel Value corresponds to a unique value returned every time in case of trouble.
So in your case, whenever a record cannot be converted to the structure we're processing, you will emit a common object,
For code samples look inside the link
Is there a way in Spark to handle data that arrived past watermark?
Consider a use case of devices that send messages, and those messages need to be processed inside Kafka + Spark. While 99% of messages are delivered to Spark server within let us 10 minutes, but occasionally a device may go out of connectivity zone for a day or a week and buffer messages internally, and then once a connection is restored deliver them a week later.
Watermark interval necessarily has to be fairly limited, as (1) results in the mainline case have to be produced timely, and also (2) because buffering space inside Spark is limited too, so Spark cannot keep a week worth of messages for all the devices buffered in a week-long watermark window.
In a regular Spark streaming construct, messages past watermark are discarded.
Is there a way to intercept those "very late" messages and route them to a handler or a separate stream -- only those "rejected" messages that do not fall within the watermark?
No, there is not. Apache Flink can handle such things I seem to remember. Spark has no feed for dropped data.
Hi I have a event hub with two consumer group.
Many device are sending data to my event hub any I want to save all message to my data base.
Now data are getting send by multiple device so data ingress is to high so in order two process those message i have written one EventHub Trigger webjob to process the message and save to database.
But since saving these message to my data base is time consuming task or I can say that receiver speed is slow then sender speed.
So is there any way two process these message faster by creating multiple receiver kind of thing.
I have create two event receiver with different consumer group but I found that same message is getting processed by both trigger function so now duplicate data are getting save in my data base.
So please help me to know how I can create multiple receiver which will process unique message parallel.
Please guys help me...
Creating multiple consumer groups won't help you out as you found out yourself. Different consumer groups all read the same data, but they can have their own speed.
In order to increase the speed of processing there are just 2 options:
Make the process/code itself faster, so try to optimize the code that is saving the data to the database
Increase the amount of partitions so more consumers can read the data from a given partition in parallel. This means however that you will have to recreate the Event Hub as you cannot increase/decrease the partition count after the Event Hub is created. See the docs for guidance.
about 2.: The number of concurrent data consumers is equal to the number of partitions created. For example, if you have 4 partitions you can have up to 4 concurrent data readers processing the data.
I do not know your situation but if you have certain peaks in which the processing is too slow but it catches up during more quiet hours you might be able to live with the current situation. If not, you have to do something like I outlined.
Today I noticed that it is written as below in under High Level Producer of kafka-node.
⚠️WARNING: Batch multiple messages of the same topic/partition together as an array on the messages attribute otherwise you may lose messages!
What does it mean ? Is it means that messages may get lost if I frequently tries to write data to same topic/partition without batch ?