Check messages in each 10 min intervals kafka - Nodejs - node.js

Consumer should check messages at each 10 min intervals this time response message should contains from uncommitted offset,
Currently messages getting once producer send message

That's not really how a Kafka Consumer works. Usually, you have an infinite loop and just take whatever messages are given to you. Unless you're changing the group.id and not committing offsets between requests, you'll always get the next batch of messages.
If you want to add some max consumption limit, followed by a 10 minutes to sleep a thread within that loop, then that's an implementation detail of your application, but not specific to Kafka

Related

Kafka consumer that is spun-up and torn down misses message

I have an application using kafka and taking advantage of two separate consumer groups listening to one topic where one consumer group (C1) is always listening for messages and the other consumer group (C2) comes online and starts listening for messages then goes offline again for some time.
More specifically, the code that is always listening on consumer group C1 responds to a message by creating a virtual machine that starts listening on C2 and does some work using costly hardware.
The problem I'm running into is that after the virtual machine is spun up and listening on consumer group C2 commences it will sometimes receive nothing, despite the fact that it should be receiving the same message that C1 received causing C2 to be listened on in the first place.
I'm using the following topic, producer, and consumer configs:
topic config:
partitions: 6
compression.type: producer
leader.replication.throttled.replicas: --
message.downconversion.enable: true
min.insync.replicas: 2
segment.jitter.ms: 0
cleanup.policy: delete
flush.ms: 9223372036854775807
follower.replication.throttled.replicas: --
segment.bytes: 104857600
retention.ms: 604800000
flush.messages: 9223372036854775807
message.format.version: 3.0-IV1
max.compaction.lag.ms: 9223372036854775807
file.delete.delay.ms: 60000
max.message.bytes: 8388608
min.compaction.lag.ms: 0
message.timestamp.type: CreateTime
preallocate: false
min.cleanable.dirty.ratio: 0.5
index.interval.bytes: 4096
unclean.leader.election.enable: false
retention.bytes: -1
delete.retention.ms: 86400000
segment.ms: 604800000
message.timestamp.difference.max.ms: 9223372036854775807
segment.index.bytes: 10485760
producer config:
("message.max.bytes", "20971520")
("queue.buffering.max.ms", "0")
consumer config:
("enable.partition.eof", "false")
("session.timeout.ms", "6000")
("enable.auto.commit", "true")
("auto.commit.interval.ms", "5000")
("enable.auto.of.store", "true")
The bug is intermittent. Sometimes it occurs, sometimes it doesn't and resending the exact same message after the consumer is up and listening on C2 always succeeds, so it isn't some issue like the message size being too large for the topic or anything like that.
I suspect it's related to offsets being committed/stored improperly. My topic configuration uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow and thus the new message that triggered C2's listening is being missed since it isn't the "latest" by kafka's accounting. The work done by the code listening on consumer group C2 is quite long-running and the consumer often reports a timeout, so maybe that's contributing?
EDIT: The timeout error I get is exactly:
WARN - librdkafka - librdkafka: MAXPOLL [thrd:main]: Application maximum poll interval (300000ms) exceeded by 424ms (adjust max.poll.interval.ms for long-running message processing): l
eaving group
I am using the Rust rdkafka library for both the producer and consumer with confluent cloud's hosted kafka.
uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow
That has nothing to do with committed values. Only where you start reading for a unique group id.
You have auto commits enabled, but you're getting errors. Therefore offsets are getting committed, but you're not successfully processing data. That's why there's skips.
Your error,
maximum poll interval (300000ms) exceeded by 424ms
Without seeing your consumer code, you'll need to do "slightly less" within your poll function. For example, removing a log line could reduce half a second, easily, assuming log statement takes 1ms and you're pooling 500 records each time.
Otherwise, increasing max.poll.interval.ms (allow consumer heartbeat to wait longer) or reducing max.poll.records (process less data per heartbeat, but poll more frequently) is the correct response to this error.

Would SQS batch size max limit result in slower processing through Lambdas?

I'm aware that AWS has allowed SQS to be one of the event source mappings for Lambdas. I'm glad that this is possible now as I would then not have to poll from the queue every few seconds through a cron job. However, it appears that the maximum possible value for batchSize is limited to 10. From my understanding, the batchSize is the number of messages a single Lambda invocation will receive from the queue.
This sounds like it could be an issue for me because, in my case, I may have a few hundreds of thousands of messages at a time in the queue. Those messages don't need any heavy processing; they just need to be parsed and saved to the database as a record. It's pretty simple.
If the batchSize is limited to only 10 messages per retrieval, I foresee a few issues that I may have:
It may actually take a long time to finish processing the messages on the queue.
Not only is 10 messages per retrieval slow, since the messages are very simple to process, processing only 10 messages in a single Lambda invocation sounds a little wasteful because, given the simplicity of what is needed to be done to process the messages, I'm pretty sure it can process at least a few thousands messages in a single Lambda invocation.
Having only 10 messages per retrieval may also mean that I need to make more write operations to my database because each of these messages need to inserted as a record on the database.
Are my concerns valid in this case? If so, is there anything else I can do with SQS and Lambdas to overcome those concerns?
Your assumption about a limit of 10 is correct.
Lambda will spin up more instances to run in parallel, if there are more messages available. See Scaling and Processing. This means that if there are 1000 messages available, Lambda might spin up 100 concurrent executions to quickly process all the messages.
Once a lambda function has processed the 10 messages of a batch, it continues with processing other batches. As lambda bills in 100ms intervals, the wasted time is minimal.
As for the database writes you could pre-process the messages before inserting them into the queue.
In that case you need to let you lambda function fetch the messages from the queue and process them rather than lambda getting triggered via SQS. Probably have a cloud watch event which can trigger lambda for you depending upon what your use case is.
Please note that SQS has a limit of max 10 messages in one go but you could write the code to make it much more efficient.
One of the package which is very efficient at is squiss-ts
In this case you could let your lambda function run for 15 mins (max time) and let it process as many messages possible. Idempotency is the key when you are desinging these kind of applications so in case if message wasn't processed in this run, it will be processed in the next run.
Downside of using this approach is that you need to scale your lambda's manually depending on how many messages you are anticipating.
You're right that a larger batch size seems appropriate for your use case.
As of late 2020, if you specify a batch window in seconds, you can then specify a batch size of up to 10,000 messages.
So with this new option you can now configure your lambda to wait and receive much larger batches per invocation.

How to make the consumer know that the Producer has finished sending all the messages to the Broker?

1: We are working on a near real time processing or Batch Processing using Spark Streaming. Our current design has Kafka included.
2: Every 15 minutes the Producer will send the messages.
3: We plan to use Spark Streaming to consume messages from Kafka topic.
That a very broad question:
Basically, there is no such thing as "all messages" because it's stream processing (but I still understand your question).
One way would be to inject a control message at last message that "ends a burst of data"
You could also use some "side communication channel" via an RPC such that the producer send the last offset it did write to the consumer
You could put an heuristic -- if poll() does return nothing for 1 minute, you just assume that all data got consumed
And there might be other methods... But it's all hand coded -- there is no support in Kafka (cf. (1.)).

Spark Streaming Kafka backpressure

We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here

What is spark.streaming.receiver.maxRate? How does it work with batch interval

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?
Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Resources