Kinesis Shard GetRecords.IteratorAgeMilliseconds reached maximum 86.4M (1 day) and does not decrease even though consuming - apache-spark

I am consuming a Kinesis stream with Spark Streaming 2.2.0 and using spark-streaming-kinesis-asl_2.11.
Kinesis Stream has 150 shards and I am monitoring GetRecords.IteratorAgeMilliseconds CloudWatch metric to see whether consumer is keeping up with the stream.
Kinesis Stream has a default data retention of 86400 seconds (1 day).
I am debugging a case where a few Kinesis Shards reached maximum GetRecords.IteratorAgeMilliseconds of 86400000 (== retention period)
This is only true for some shards (let's call them outdated shards), not all of them.
I have identified shardIds for outdated shards. One of them is shardId-000000000518 and I can see in DynamoDB table that holds checkpointing information the following:
leaseKey: shardId-000000000518
checkpoint: 49578988488125109498392734939028905131283484648820187234
checkpointSubSequenceNumber: 0
leaseCounter: 11058
leaseOwner: 10.0.165.44:52af1b14-3ed0-4b04-90b1-94e4d178ed6e
ownerSwitchesSinceCheckpoint: 37
parentShardId: { "shardId-000000000269" }
I can see the following in the logs of worker on 10.0.165.44:
17/11/22 01:04:14 INFO Worker: Current stream shard assignments: shardId-000000000339, ..., shardId-000000000280, shardId-000000000518
... which should mean that shardId-000000000518 was assigned to this worker. However, I never see anything else in the logs for this shardId. If the worker is not consuming from this shardId (but it should), this can explain why GetRecords.IteratorAgeMilliseconds never decreases. For some other (non-outdated shardIds), I can see in the logs
17/11/22 01:31:28 INFO SequenceNumberValidator: Validated sequence number 49578988151227751784190049362310810844771023726275728690 with shard id shardId-00000000033
I did verify that outdated shards have data flowing into them by looking at the IncomingRecords CloudWatch metric.
How can I debug/resolve this? Why would these shardIds never get picked up and by the Spark worker?

Related

Kafka consumer that is spun-up and torn down misses message

I have an application using kafka and taking advantage of two separate consumer groups listening to one topic where one consumer group (C1) is always listening for messages and the other consumer group (C2) comes online and starts listening for messages then goes offline again for some time.
More specifically, the code that is always listening on consumer group C1 responds to a message by creating a virtual machine that starts listening on C2 and does some work using costly hardware.
The problem I'm running into is that after the virtual machine is spun up and listening on consumer group C2 commences it will sometimes receive nothing, despite the fact that it should be receiving the same message that C1 received causing C2 to be listened on in the first place.
I'm using the following topic, producer, and consumer configs:
topic config:
partitions: 6
compression.type: producer
leader.replication.throttled.replicas: --
message.downconversion.enable: true
min.insync.replicas: 2
segment.jitter.ms: 0
cleanup.policy: delete
flush.ms: 9223372036854775807
follower.replication.throttled.replicas: --
segment.bytes: 104857600
retention.ms: 604800000
flush.messages: 9223372036854775807
message.format.version: 3.0-IV1
max.compaction.lag.ms: 9223372036854775807
file.delete.delay.ms: 60000
max.message.bytes: 8388608
min.compaction.lag.ms: 0
message.timestamp.type: CreateTime
preallocate: false
min.cleanable.dirty.ratio: 0.5
index.interval.bytes: 4096
unclean.leader.election.enable: false
retention.bytes: -1
delete.retention.ms: 86400000
segment.ms: 604800000
message.timestamp.difference.max.ms: 9223372036854775807
segment.index.bytes: 10485760
producer config:
("message.max.bytes", "20971520")
("queue.buffering.max.ms", "0")
consumer config:
("enable.partition.eof", "false")
("session.timeout.ms", "6000")
("enable.auto.commit", "true")
("auto.commit.interval.ms", "5000")
("enable.auto.of.store", "true")
The bug is intermittent. Sometimes it occurs, sometimes it doesn't and resending the exact same message after the consumer is up and listening on C2 always succeeds, so it isn't some issue like the message size being too large for the topic or anything like that.
I suspect it's related to offsets being committed/stored improperly. My topic configuration uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow and thus the new message that triggered C2's listening is being missed since it isn't the "latest" by kafka's accounting. The work done by the code listening on consumer group C2 is quite long-running and the consumer often reports a timeout, so maybe that's contributing?
EDIT: The timeout error I get is exactly:
WARN - librdkafka - librdkafka: MAXPOLL [thrd:main]: Application maximum poll interval (300000ms) exceeded by 424ms (adjust max.poll.interval.ms for long-running message processing): l
eaving group
I am using the Rust rdkafka library for both the producer and consumer with confluent cloud's hosted kafka.
uses the default of "latest" for "auto.offset.reset", so I suspect that the offsets are getting dropped or not properly committed somehow
That has nothing to do with committed values. Only where you start reading for a unique group id.
You have auto commits enabled, but you're getting errors. Therefore offsets are getting committed, but you're not successfully processing data. That's why there's skips.
Your error,
maximum poll interval (300000ms) exceeded by 424ms
Without seeing your consumer code, you'll need to do "slightly less" within your poll function. For example, removing a log line could reduce half a second, easily, assuming log statement takes 1ms and you're pooling 500 records each time.
Otherwise, increasing max.poll.interval.ms (allow consumer heartbeat to wait longer) or reducing max.poll.records (process less data per heartbeat, but poll more frequently) is the correct response to this error.

Elastic search could not write all entries: May be es was overloaded

I have an application where I read csv files and do some transformations and then push them to elastic search from spark itself. Like this
input.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource", "{date}/" + type).save()
I have several nodes and in each node, I run 5-6 spark-submit commands that push to elasticsearch
I am frequently getting Errors
Could not write all entries [13/128] (Maybe ES was overloaded?). Error sample (first [5] error messages):
rejected execution of org.elasticsearch.transport.TransportService$7#32e6f8f8 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#4448a084[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 451515]]
My Elasticsearch cluster has following stats -
Nodes - 9 (1TB space,
Ram >= 15GB ) More than 8 cores per node
I have modified following parameters for elasticseach
spark.es.batch.size.bytes=5000000
spark.es.batch.size.entries=5000
spark.es.batch.write.refresh=false
Could anyone suggest, What can I fix to get rid of these errors?
This occurs because the bulk requests are incoming at a rate greater than elasticsearch cluster could process and the bulk request queue is full.
The default bulk queue size is 200.
You should handle ideally this on the client side :
1) by reducing the number the spark-submit commands running concurrently
2) Retry in case of rejections by tweaking the es.batch.write.retry.count and
es.batch.write.retry.wait
Example:
es.batch.write.retry.wait = "60s"
es.batch.write.retry.count = 6
On elasticsearch cluster side :
1) check if there are too many shards per index and try reducing it.
This blog has a good discussion on criteria for tuning the number of shards.
2) as a last resort increase the thread_pool.index.bulk.queue_size
Check this blog with an extensive discussion on bulk rejections.
The bulk queue in your ES cluster is hitting its capacity (200) . Try increasing it. See this page for how to change the bulk queue capacity.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
Also check this other SO answer where OP had a very similar issue and was fixed by increasing the bulk pool size.
Rejected Execution of org.elasticsearch.transport.TransportService Error

Kafka - get lag

I am using "node-rdkafka" npm module for our distributed service architecture written in Nodejs. We have a use case for metering where we allow only a certain amount of messages to be consumed and processed every n seconds. For example, a "main" topic has 100 messages pushed by a producer and "worker" consumes from main topic every 30 seconds. There is a lot more to the story of the use case.
The problem I am having is that I need to progamatically get the lag of a given topic(all partitions).
Is there a way for me to do that?
I know that I can use "bin/kafka-consumer-groups.sh" to access some of the data I need but is there another way?
Thank you in advance
You can retrieve that information directly from your node-rdkafka client via several methods:
Client metrics:
The client can emit metrics at defined interval that contain the current and committed offsets as well as the end offset so you can easily calculate the lag.
You first need to enable the metrics events by setting for example 'statistics.interval.ms': 5000 in your client configuration. Then set a listener on the event.stats events:
consumer.on('event.stats', function(stats) {
console.log(stats);
});
The full stats are documented on https://github.com/edenhill/librdkafka/wiki/Statistics but you probably are mostly interested in the partition stats: https://github.com/edenhill/librdkafka/wiki/Statistics#partitions
Query the cluster for offsets:
You can use queryWatermarkOffsets() to retrieve the first and last offsets for a partition.
consumer.queryWatermarkOffsets(topicName, partition, timeout, function(err, offsets) {
var high = offsets.highOffset;
var low = offsets.lowOffset;
});
Then use the consumer's current position (position()) or committed (committed()) offsets to calculate the lag.
Kafka exposes "records-lag-max" mbean which is the max records in lag for a partition via jmx, so you can get the lag querying this mbean
Refer to below doc for the exposed jmx mbean in detail .
https://docs.confluent.io/current/kafka/monitoring.html#consumer-group-metrics

Spark Streaming Kafka backpressure

We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here

What is spark.streaming.receiver.maxRate? How does it work with batch interval

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?
Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Resources