Spark Structured Streaming OutOfMemoryError caused by thousands of KafkaMbean instances - apache-spark

Spark Structured Streaming executor fails with OutOfMemoryError
Checking the heap allocation with VirtualVM indicate that JMX Mbean Server memory usage grows linearly with time.
After a further investigation it seems that JMX Mbean is filled with thousands of instances of KafkaMbean objects with metrics for consumer-(\d+) that goes into thousands (equal to the number of tasks created on the executor).
Running Kafka consumer with DEBUG logs on the executor shows that the executor adds thousands of metrics sensors and often does not remove them at all or only removes some
I am running HDP Spark 2.3.0.2.6.5.0-292 with HDP Kafka 1.0.0.2.6.5.0-292.
Here is how I initialise structured streaming:
sparkSession
.readStream
.format("kafka")
.options(Map("kafka.bootstrap.servers" -> KAFKA_BROKERS,
"subscribePattern" -> INPUT_TOPIC,
"startingOffsets" -> "earliest",
"failOnDataLoss" -> "false"))
.mapPartitions(processData)
.writeStream
.format("kafka")
.options(Map("kafka.bootstrap.servers" -> KAFKA_BROKERS,
"checkpointLocation" -> CHECKPOINT_LOCATION))
.queryName("Process Data")
.outputMode("update")
.trigger(Trigger.ProcessingTime(1000))
.load()
.start()
.awaitTermination()
I was expecting Spark/Kafka to properly clean the MBeans on task completion, but that seems not to be the case.

Your HDP version is probably using spark 2.3.1 which has a known bug reading data from Kafka(this issue appears when you are reading from a topic that doesn't have new data every micro batch):
https://issues.apache.org/jira/browse/SPARK-24987
https://issues.apache.org/jira/browse/SPARK-25106
This bug was as a result of a change that was made in version 2.3.1 (it doesn't exists in version 2.3.0) so you can either upgrade your spark version or get a patch for your HDP version.

Related

Spark structured streaming job failing after running for 3-4 days with java.io.FileNotFoundException: File does not exist

We have a spark structured streaming job written in scala running in production which reads from a kafka topic and writes to HDFS sink. Triggertime is 60 Seconds. The job has been deployed 4months back and after running well for a month, we started getting the below error and job fails instantly:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/XYZ/hive/prod/landing_dir/abc/_spark_metadata/.edfac8fb-aa6d-4c0e-aa19-48f44cc29774.tmp (inode 6736258272) Holder DFSClient_NONMAPREDUCE_841796220_49 does not have any open files
Earlier this issue was not regular i.e. it was happening once in 2-3 weeks. Last 1 month, these error has become very frequent and happening at an interval of 3-4 days and failing the job. We restart this job once in a week as part of regular maintenance. Spark version is 2.3.2 and we run on YARN cluster manager. From the error it is evident that something is not going right within Write Ahead Log(WAL) directory since the path is pointing to _spark_metadata. Would like to understand what causing this exception and how we can handle it. Is this something we can handle in our application or is it an environment issue need to be addressed at the infra level.
Below is the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()

Spark Structured Streaming Batch Read Checkpointing

I am fairly new to Spark and am still learning. One of the more difficult concepts I have come across is checkpointing and how Spark uses it to recover from failures. I am doing batch reads from Kafka using Structured Streaming and writing them to S3 as Parquet file as:
dataset
.write()
.mode(SaveMode.Append)
.option("checkpointLocation", checkpointLocation)
.partitionBy("date_hour")
.parquet(getS3PathForTopic(topicName));
The checkpoint location is a S3 filesystem path. However, as the job runs, I see no checkpointing files. In subsequent runs, I see the following log:
21/10/14 12:20:51 INFO ConsumerCoordinator: [Consumer clientId=consumer-spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0-5, groupId=spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0] Found no committed offset for partition topic-1
This indicates that the previous run did not checkpoint any offsets for this run to pick them up from. So it keeps consuming from the earliest offset.
How can I make my job pick up new offsets? Note that this is a batch query as described here.
This is how I read:
sparkSession
.read()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaProperties.bootstrapServers())
.option("subscribe", topic)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.truststore.location", sslConfig.truststoreLocation())
.option("kakfa.ssl.truststore.password", sslConfig.truststorePassword())
.option("kafka.ssl.keystore.location", sslConfig.keystoreLocation())
.option("kafka.ssl.keystore.password", sslConfig.keystorePassword())
.option("kafka.ssl.endpoint.identification.algorithm", "")
.option("failOnDataLoss", "true");
I am not sure why batch Spark Structured Streaming with Kafka still exists now. If you wish to use it, then you must code your own Offset management. See the guide, but it is badly explained.
I would say Trigger.Once is a better use case for you; Offset management is provided by Spark as it is thus not batch mode.

Structured Streaming OOM

I deploy a structured streaming job with on the k8s operator, which simply reads from kafka, deserializes, adds 2 columns and stores the results in the datalake (tried both delta and parquet) and after days the executor increases memory and eventually i get OOM. The input records are in terms of kbs really low.
P.s i use the exactly same code, but with cassandra as a sink which runs for almost a month now, without any issues. any ideas plz?
enter image description here
enter image description here
My code
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", MetisStreamsConfig.bootstrapServers)
.option("subscribe", MetisStreamsConfig.topics.head)
.option("startingOffsets", startingOffsets)
.option("maxOffsetsPerTrigger", MetisStreamsConfig.maxOffsetsPerTrigger)
.load()
.selectExpr("CAST(value AS STRING)")
.as[String]
.withColumn("payload", from_json($"value", schema))
// selection + filtering
.select("payload.*")
.select($"vesselQuantity.qid" as "qid", $"vesselQuantity.vesselId" as "vessel_id", explode($"measurements"))
.select($"qid", $"vessel_id", $"col.*")
.filter($"timestamp".isNotNull)
.filter($"qid".isNotNull and !($"qid"===""))
.withColumn("ingestion_time", current_timestamp())
.withColumn("mapping", MappingUDF($"qid"))
writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
log.info(s"Storing batch with id: `$batchId`")
val calendarInstance = Calendar.getInstance()
val year = calendarInstance.get(Calendar.YEAR)
val month = calendarInstance.get(Calendar.MONTH) + 1
val day = calendarInstance.get(Calendar.DAY_OF_MONTH)
batchDF.write
.mode("append")
.parquet(streamOutputDir + s"/$year/$month/$day")
}
.option("checkpointLocation", checkpointDir)
.start()
i changed to foreachBatch because using delta or parquet with partitionBy cause issues faster
There is a bug that is resolved in Spark 3.1.0.
See https://github.com/apache/spark/pull/28904
Other ways of overcoming the issue & a credit for debugging:
https://www.waitingforcode.com/apache-spark-structured-streaming/file-sink-out-of-memory-risk/read
You may find this helpful even though you are using foreachBatch ...
I had the same issue for some Structured Streaming Spark 2.4.4 applications writing some Delta lake (or parquet) output with partitionBy.
Seem to be related to the jvm memory allocation within a container, as thorougly explained here:
https://merikan.com/2019/04/jvm-in-a-container/
My solution (but depends on your jvm version) was to add some option in the yaml definition for my spark application:
spec:
javaOptions: >-
-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap
This way my Streamin App is functionning properly, with normal amount of memory (1GB for driver, 2GB for executors)
EDIT: while it seem that the first issue is solved (controller killing pods for memory consumption) there is still an issue with slowly growing non-heap memory size; after a few hours, the driver/executors are killed...

What is the meaning of "OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions"?

I use Apache Spark 2.4.1 and kafka data source.
Dataset<Row> df = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", SERVERS)
.option("subscribe", TOPIC)
.option("startingOffsets", "latest")
.option("auto.offset.reset", "earliest")
.load();
I have two sinks: raw data stored in hdfs location , after few transformations final data is stored in Cassandra table. The checkpointLocation is an HDFS directory.
When starting the streaming query it gives the warning below:
2019-12-10 08:20:38,926 [Executor task launch worker for task 639]
WARN org.apache.spark.sql.kafka010.InternalKafkaConsumer - Some data
may be lost. Recovering from the earliest offset: 470021 2019-12-10
08:20:38,926 [Executor task launch worker for task 639] WARN
org.apache.spark.sql.kafka010.InternalKafkaConsumer - The current
available offset range is AvailableOffsetRange(470021,470021). Offset
62687 is out of range, and records in [62687, 62727) will be skipped
(GroupId:
spark-kafka-source-1fba9e33-165f-42b4-a220-6697072f7172-1781964857-executor,
TopicPartition: INBOUND-19). Some data may have been lost because they
are not available in Kafka any more; either the data was aged out by
Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such
cases, set the source option "failOnDataLoss" to "true".
I also used auto.offset.reset as latest and startingOffsets as latest.
2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,_INBOUND-19)
What is this telling me? How to get rid of the warning (if possible)?
Some data may be lost. Recovering from the earliest offset: 470021
The above warning happens when your streaming query started with checkpointed offsets that are past what's currently available in topics.
In other words, the streaming query uses checkpointLocation with state that is no longer current and hence the warning (not error).
That means that your query is too slow compared to cleanup.policy (retention or compaction).

How to manually set group.id and commit kafka offsets in spark structured streaming?

I was going through the Spark structured streaming - Kafka integration guide here.
It is told at this link that
enable.auto.commit: Kafka source doesn’t commit any offset.
So how do I manually commit offsets once my spark application has successfully processed each record?
tl;dr
It is not possible to commit any messages to Kafka. Starting with Spark version 3.x you can define the name of the Kafka consumer group, however, this still does not allow you to commit any messages.
Since Spark 3.0.0
According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("kafka.group.id", "myConsumerGroup")
.load()
However, Spark still will not commit any offsets back so you will not be able to "manually" commit offsets to Kafka. This feature is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.
A full example of a Spark 3.x application is discussed and solved here.
Until Spark 2.4.x
The Spark Structured Streaming + Kafka integration Guide clearly states how it manages Kafka offsets. Spark will not commit any messages back to Kafka as it is relying on internal offset management for fault-tolerance.
The most important Kafka configurations for managing offsets are:
group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
auto.offset.reset: Set the source option startingOffsets to specify where to start instead.
Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it.
enable.auto.commit: Kafka source doesn’t commit any offset.
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).
2.4.x in Action
Let's say you have a simple Spark Structured Streaming application that reads and writes to Kafka, like this:
// create SparkSession
val spark = SparkSession.builder()
.appName("ListenerTester")
.master("local[*]")
.getOrCreate()
// read from Kafka topic
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testingKafkaProducer")
.option("failOnDataLoss", "false")
.load()
// write to Kafka topic and set checkpoint directory for this stream
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "testingKafkaProducerOut")
.option("checkpointLocation", "/home/.../sparkCheckpoint/")
.start()
Offset Management by Spark
Once this application is submitted and data is being processed, the corresponding offset can be found in the checkpoint directory:
myCheckpointDir/offsets/
{"testingKafkaProducer":{"0":1}}
Here the entry in the checkpoint file confirms that the next offset of partition 0 to be consumed is 1. It implies that the application already processes offset 0 from partition 0 of the topic named testingKafkaProducer.
More on the fault-tolerance-semantics are given in the Spark Documentation.
Offset Management by Kafka
However, as stated in the documentation, the offset is not committed back to Kafka.
This can be checked by executing the kafka-consumer-groups.sh of the Kafka installation.
./kafka/current/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "spark-kafka-source-92ea6f85-[...]-driver-0"
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testingKafkaProducer 0 - 1 - consumer-1-[...] /127.0.0.1 consumer-1
The current offset for this application is unknown to Kafka as it has never been committed.
Possible Workaround
Please carefully read the comments below from Spark committer #JungtaekLim about the workaround: "Spark's fault tolerance guarantee is based on the fact Spark has a full control of offset management, and they're voiding the guarantee if they're trying to modify it. (e.g. If they change to commit offset to Kafka, then there's no batch information and if Spark needs to move back to the specific batch "behind" guarantee is no longer valid.)"
What I have seen doing some research on the web is that you could commit offsets in the callback function of the onQueryProgress method in a customized StreamingQueryListener of Spark. That way, you could have a consumer group that keeps track of the current progress. However, its progress is not necessarily aligned with the actual consumer group.
Here are some links you may find helpful:
Code Example for Listener
Discussion on SO around offset management
General description on the StreamingQueryListener

Resources