I have an application using kafka and taking advantage of two separate consumer groups listening to one topic where one consumer group (C1) is always listening for messages and the other consumer group (C2) comes online and starts listening for messages then goes offline again for some time.
More specifically, the code that is always listening on consumer group C1 responds to a message by creating a virtual machine that starts listening on C2 and does some work using costly hardware.
The problem I'm running into is that after the virtual machine is spun up and listening on consumer group C2 commences it will sometimes receive nothing, despite the fact that it should be receiving the same message that C1 received causing C2 to be listened on in the first place.
I'm using the following topic, producer, and consumer configs:
topic config:
partitions: 6
compression.type: producer
leader.replication.throttled.replicas: --
message.downconversion.enable: true
min.insync.replicas: 2
segment.jitter.ms: 0
cleanup.policy: delete
flush.ms: 9223372036854775807
follower.replication.throttled.replicas: --
segment.bytes: 104857600
retention.ms: 604800000
flush.messages: 9223372036854775807
message.format.version: 3.0-IV1
max.compaction.lag.ms: 9223372036854775807
file.delete.delay.ms: 60000
max.message.bytes: 8388608
min.compaction.lag.ms: 0
message.timestamp.type: CreateTime
preallocate: false
min.cleanable.dirty.ratio: 0.5
index.interval.bytes: 4096
unclean.leader.election.enable: false
retention.bytes: -1
delete.retention.ms: 86400000
segment.ms: 604800000
message.timestamp.difference.max.ms: 9223372036854775807
segment.index.bytes: 10485760
producer config:
("message.max.bytes", "20971520")
("queue.buffering.max.ms", "0")
consumer config:
("enable.partition.eof", "false")
("session.timeout.ms", "6000")
("enable.auto.commit", "true")
("auto.commit.interval.ms", "5000")
("enable.auto.of.store", "true")
The bug is intermittent. Sometimes it occurs, sometimes it doesn't and resending the exact same message after the consumer is up and listening on C2 always succeeds, so it isn't some issue like the message size being too large for the topic or anything like that.
I suspect it's related to offsets being committed/stored improperly. My topic configuration uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow and thus the new message that triggered C2's listening is being missed since it isn't the "latest" by kafka's accounting. The work done by the code listening on consumer group C2 is quite long-running and the consumer often reports a timeout, so maybe that's contributing?
EDIT: The timeout error I get is exactly:
WARN - librdkafka - librdkafka: MAXPOLL [thrd:main]: Application maximum poll interval (300000ms) exceeded by 424ms (adjust max.poll.interval.ms for long-running message processing): l
eaving group
I am using the Rust rdkafka library for both the producer and consumer with confluent cloud's hosted kafka.
uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow
That has nothing to do with committed values. Only where you start reading for a unique group id.
You have auto commits enabled, but you're getting errors. Therefore offsets are getting committed, but you're not successfully processing data. That's why there's skips.
Your error,
maximum poll interval (300000ms) exceeded by 424ms
Without seeing your consumer code, you'll need to do "slightly less" within your poll function. For example, removing a log line could reduce half a second, easily, assuming log statement takes 1ms and you're pooling 500 records each time.
Otherwise, increasing max.poll.interval.ms (allow consumer heartbeat to wait longer) or reducing max.poll.records (process less data per heartbeat, but poll more frequently) is the correct response to this error.
Related
Consumer should check messages at each 10 min intervals this time response message should contains from uncommitted offset,
Currently messages getting once producer send message
That's not really how a Kafka Consumer works. Usually, you have an infinite loop and just take whatever messages are given to you. Unless you're changing the group.id and not committing offsets between requests, you'll always get the next batch of messages.
If you want to add some max consumption limit, followed by a 10 minutes to sleep a thread within that loop, then that's an implementation detail of your application, but not specific to Kafka
I'm running a Kafka Streams application with three sub-topologies. The stages of activity are roughly as follows:
stream Topic A
selectKey and repartition Topic A to Topic B
stream Topic B
foreach Topic B to Topic C Producer
stream Topic C
Topic C to Topic D
Topics A, B, and C are each materialized, which means that if each topic has 40 partitions, my maximum parallelism is 120.
At first I was running 5 streams applications with 8 threads a piece. With this set up I was experiencing inconsistent performance. It seems like some sub-topologies sharing the same thread were hungrier for CPU than others and after a while, I'd get this error: Member [client_id] in group [consumer_group] has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator). Everything would get rebalanced, which could lead to decreased performance until the next failure and rebalance.
My questions are as follows:
How is it that multiple sub-topologies are able to be run on a single thread? A poll queue?
How does each thread decide how to allocate compute resources to each of its sub-topologies?
How do you optimize your thread to topic-partition ratio in such cases to avoid periodic consumer failures? e.g., will a 1:1 ratio ensure more consistent performance?
If you use a 1:1 ratio, how do you ensure that every thread gets assigned its own topic-partition and some threads aren't left idle?
The thread will poll() for all topics of different sub-topologies and check the records topic metadata to feed it into the correct task.
Each sub-topology is treated the same, ie, available resources are evenly distributed if you wish.
A 1:1 ratio is only useful if you have enough cores. I would recommend to monitor your CPU utilization. If it's too high (larger >80%) you should add more cores/threads.
Kafka Streams handles this for you automatically.
Couple of general comments:
you might consider to increase max.poll.interval.ms config to avoid that a consumer drops out of the group
you might consider to decrease max.poll.records to get less records per poll() call, and thus decrease the time between two consecutive calls to poll().
note, that max.poll.records does not imply increases network/broker communication -- if a single fetch request return more records than max.poll.records config, the data is just buffered within the consumer and the next poll() will be served from the buffered data avoiding a broker round trip
I am consuming a Kinesis stream with Spark Streaming 2.2.0 and using spark-streaming-kinesis-asl_2.11.
Kinesis Stream has 150 shards and I am monitoring GetRecords.IteratorAgeMilliseconds CloudWatch metric to see whether consumer is keeping up with the stream.
Kinesis Stream has a default data retention of 86400 seconds (1 day).
I am debugging a case where a few Kinesis Shards reached maximum GetRecords.IteratorAgeMilliseconds of 86400000 (== retention period)
This is only true for some shards (let's call them outdated shards), not all of them.
I have identified shardIds for outdated shards. One of them is shardId-000000000518 and I can see in DynamoDB table that holds checkpointing information the following:
leaseKey: shardId-000000000518
checkpoint: 49578988488125109498392734939028905131283484648820187234
checkpointSubSequenceNumber: 0
leaseCounter: 11058
leaseOwner: 10.0.165.44:52af1b14-3ed0-4b04-90b1-94e4d178ed6e
ownerSwitchesSinceCheckpoint: 37
parentShardId: { "shardId-000000000269" }
I can see the following in the logs of worker on 10.0.165.44:
17/11/22 01:04:14 INFO Worker: Current stream shard assignments: shardId-000000000339, ..., shardId-000000000280, shardId-000000000518
... which should mean that shardId-000000000518 was assigned to this worker. However, I never see anything else in the logs for this shardId. If the worker is not consuming from this shardId (but it should), this can explain why GetRecords.IteratorAgeMilliseconds never decreases. For some other (non-outdated shardIds), I can see in the logs
17/11/22 01:31:28 INFO SequenceNumberValidator: Validated sequence number 49578988151227751784190049362310810844771023726275728690 with shard id shardId-00000000033
I did verify that outdated shards have data flowing into them by looking at the IncomingRecords CloudWatch metric.
How can I debug/resolve this? Why would these shardIds never get picked up and by the Spark worker?
1: We are working on a near real time processing or Batch Processing using Spark Streaming. Our current design has Kafka included.
2: Every 15 minutes the Producer will send the messages.
3: We plan to use Spark Streaming to consume messages from Kafka topic.
That a very broad question:
Basically, there is no such thing as "all messages" because it's stream processing (but I still understand your question).
One way would be to inject a control message at last message that "ends a burst of data"
You could also use some "side communication channel" via an RPC such that the producer send the last offset it did write to the consumer
You could put an heuristic -- if poll() does return nothing for 1 minute, you just assume that all data got consumed
And there might be other methods... But it's all hand coded -- there is no support in Kafka (cf. (1.)).
We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here