spark structured streaming unable to start from checkpoint location - apache-spark

I'm doing a simple Spark program using structured streaming feature and Kafka. As Kafka is source, there are 2 sinks:
Sink 1- Console sink -- works fine in all cases
Sink 2 & 3 -H2 and Ignite Foreach sink
For the first run code runs fine but when I kill and restart the program with checkpoint location I'm getting the below error
17/07/12 07:11:48 ERROR StreamExecution: Query h2Out [id = 22ce7168-6f12-4220-8f28-f9eaaaba9c6a, runId = 39ecb40a-5b54-4b36-a0da-6e3057d66b2e] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.KafkaSource$$anon$1.parseVersion(Ljava/lang/String;I)I
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:116)
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:99)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:129)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:97)
at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:222)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:452)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:448)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:447)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:239)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)
I checked KafkaSource source code, the parseFunction method is available through org.apache.spark.sql.execution.streaming.HDFSMetadataLog I hope, for which the jar (spark-sql_2.11-2.1.1.jar) is available in classpath.
For info I'm using Kafka 0.10.2.1 maven dependencies.

This error means your Spark version is older than 2.1.1. HDFSMetadataLog.parseVersion adds in Spark 2.1.1, and spark-sql-kafka-0-10_2.11-2.1.1.jar calls it. If your Spark version is older than 2.1.1, you will see this NoSuchMethodError.
You can check your Spark version by calling SparkSession.version. (e.g., just type spark.version in Spark shell).

Related

AWS Glue - User did not initialize spark context

I am working in AWS glue.i am reading a hive metastore based table from AWS Glue Catalog from spark scala job in AWS glue with my custom spark code, please note i am writing my own code its our need. Code is working as expected, it is reading source table and loading to target table as well but still job goes to error every time.
Here is my sparksession
val spark = SparkSession.builder().appName("SPARK-Dev")
.master("local[*]")
.enableHiveSupport()
.getOrCreate
Job throws this error
2020-03-27 17:07:53,282 ERROR [main] yarn.ApplicationMaster (Logging.scala:logError(91)) - Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
You're are using master("local[*]"). I'm not sure that it's correct with cluster. Try to use it withought master func

Spark Structured Streaming OutOfMemoryError caused by thousands of KafkaMbean instances

Spark Structured Streaming executor fails with OutOfMemoryError
Checking the heap allocation with VirtualVM indicate that JMX Mbean Server memory usage grows linearly with time.
After a further investigation it seems that JMX Mbean is filled with thousands of instances of KafkaMbean objects with metrics for consumer-(\d+) that goes into thousands (equal to the number of tasks created on the executor).
Running Kafka consumer with DEBUG logs on the executor shows that the executor adds thousands of metrics sensors and often does not remove them at all or only removes some
I am running HDP Spark 2.3.0.2.6.5.0-292 with HDP Kafka 1.0.0.2.6.5.0-292.
Here is how I initialise structured streaming:
sparkSession
.readStream
.format("kafka")
.options(Map("kafka.bootstrap.servers" -> KAFKA_BROKERS,
"subscribePattern" -> INPUT_TOPIC,
"startingOffsets" -> "earliest",
"failOnDataLoss" -> "false"))
.mapPartitions(processData)
.writeStream
.format("kafka")
.options(Map("kafka.bootstrap.servers" -> KAFKA_BROKERS,
"checkpointLocation" -> CHECKPOINT_LOCATION))
.queryName("Process Data")
.outputMode("update")
.trigger(Trigger.ProcessingTime(1000))
.load()
.start()
.awaitTermination()
I was expecting Spark/Kafka to properly clean the MBeans on task completion, but that seems not to be the case.
Your HDP version is probably using spark 2.3.1 which has a known bug reading data from Kafka(this issue appears when you are reading from a topic that doesn't have new data every micro batch):
https://issues.apache.org/jira/browse/SPARK-24987
https://issues.apache.org/jira/browse/SPARK-25106
This bug was as a result of a change that was made in version 2.3.1 (it doesn't exists in version 2.3.0) so you can either upgrade your spark version or get a patch for your HDP version.

Spark 2 application failed with Couldn't find leader offsets for Error

I have my spark applciation reading data from Kafka and ingesting into Kudu. It has run successfully for almost 25hrs and ingested data into Kudu. After that, I see new leader was elected for kafka partitions from kafka logs. My application went into FINISHED state with the following error,
org.apache.spark.SparkException: ArrayBuffer(kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([test,0]))
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:133)
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:158)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Does it mean that, whenever a new leader is elected Spark application will fail?
I have seen a lot of posts on Stackoverflow where everyone said that they are not able to launch the application with this error. But, in my case, it ran for 25hrs and then FINISHED.
Any thoughts on what might have gone wrong? I searched on Kafka issues with no luck related to this.
The exception is thrown when the count of the topicPartition expected by the spark streaming doesn't match with the topic partition provided by simple Kafka client [Used by spark to get the topicPartition-Leader offsets].
So by the time Spark Streaming requested for the offsets, the leader for topicPartition [test, 0] was not available. Hence Spark threw the exception message. What version of Spark and Kafka are you using?

java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager

I am getting the following error message while I am connecting to a kinesis stream.
java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager.get(Lorg/apache/spark/storage/BlockId;)Lscala/Option;
at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD.getBlockFromBlockManager$1(KinesisBackedBlockRDD.scala:104)
My spark streaming code is,
sc = SparkContext(appName="PythonStreamingTest")
ssc = StreamingContext(sc, 10)
dstream = KinesisUtils.createStream(
ssc, "PythonStreamingTest", "questions", "https://kinesis.us-west-2.amazonaws.com", "us-west-2", InitialPositionInStream.TRIM_HORIZON, 1)
dstream.foreachRDD(stream_rdd)
def stream_rdd(rdd):
if not rdd.isEmpty():
return rdd.foreach(classify)
def classify(ele):
if ele!="":
print ele
Initially, the stream comes blank as it takes a while to connect to the Kinesis stream. But then all of a sudden, it breaks down the code.
The rest of the trace is,
17/04/02 17:52:00 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager.get(Lorg/apache/spark/storage/BlockId;)Lscala/Option;
at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD.getBlockFromBlockManager$1(KinesisBackedBlockRDD.scala:104)
at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD.compute(KinesisBackedBlockRDD.scala:117)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
I submit my job using the following command,
spark-submit --jars spark-streaming-kinesis-asl-assembly_2.11-2.0.0.jar --driver-memory 5g Question_Type_Classification_testing_purpose/classifier_streaming.py
I am running the code on a local machine. So if I am giving 5g of memory, the executor should work fine.
The same code works for Spark 1.6. Recently I changed to Spark 2.1 and I am not able to run this code. I updated my kinesis jar and Py4j as well.
I tested my code by writing a Kinesis consumer, and it gets the stream perfectly fine.
Can anyone please let me know what can be the possible issue? Is the empty stream creating an issue? If yes, why am I getting an empty stream while using Spark streaming? Any help is really appreciated.
spark-streaming-kinesis-asl is Spark’s own internal library and is using Spark internal APIs (e.g., BlockManager.get). The method signature of BlockManager.get was changed in https://github.com/apache/spark/commit/29cfab3f1524c5690be675d24dda0a9a1806d6ff#diff-2b643ea78c1add0381754b1f47eec132L605 so you will see NoSuchMethodError if the Spark version is >= 2.0.1 but spark-streaming-kinesis-asl version is < 2.0.1.
Generally, because Spark doesn’t promise not breaking internal APIs between releases, you must use spark-streaming-kinesis-asl with the same version of Spark.
For latest Spark releases, the kinesis asl assembly jar was removed because of the potential license issue [1], hence you may not be able to find the assembly jar. However, you can use --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0 to add spark-streaming-kinesis-asl and its dependencies into the classpath automatically, rather than building the assembly jar by yourself.
[1] https://issues.apache.org/jira/browse/SPARK-17418

Spark Listener EventLoggingListener threw an exception / ConcurrentModificationException

In our application (Spark 2.0.1) we have this exception popping up frequently.
I can't find anything about this.
What could be the cause ?
16/10/27 11:18:24 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at scala.Option.map(Option.scala:146)
at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1249)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
EDIT: One more information, our application is long-running, and to resume from potentially failed spark context, we use the SparkBuilder.getOrCreate() method between two "jobs". Could this mess-up with the listeners ?
It's a known problem in Spark 2.0.1 (SPARK-17816) and will be fixed with Spark 2.0.2/2.1.0 (related pull request).
To get rid of the exception without waiting for Spark 2.0.2/2.1.0, clone the latest, unstable spark version and build apache-spark manually.
Update: They released Spark 2.0.2!
We also just upgraded to Spark 2.0.1 and started seeing the same exception. We narrowed the cause down to a section of Python code containing the following idiom:
a = spark_context.textFile('..')
a = a.map(stuff)
b = a.filter(stuff).map(stuff)
I've had issues in the past with variable self-assignment in Spark, but after upgrading to 2.0.1 the problem got really acute and we started seeing ConcurrentModification exceptions.
The fix for us was simply changing the code to not do any self-assignments.
A similar issue has surfaced in Spark 3.1.0, related to EventLoggingListener race condition and is described in the following bug reports:
https://issues.apache.org/jira/browse/SPARK-34731
https://issues.apache.org/jira/browse/SPARK-32027
The issue was fixed in Spark 3.1.2, so upgrading Spark from 3.1.0/3.1.1 to 3.1.2 would solve it. Alternatively, it is possible to avoid the error by disabling event logging altogether:
spark.eventLog.enabled=false

Resources