java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager

java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager - apache-spark

I am getting the following error message while I am connecting to a kinesis stream.
java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager.get(Lorg/apache/spark/storage/BlockId;)Lscala/Option;
at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD.getBlockFromBlockManager$1(KinesisBackedBlockRDD.scala:104)
My spark streaming code is,
sc = SparkContext(appName="PythonStreamingTest")
ssc = StreamingContext(sc, 10)
dstream = KinesisUtils.createStream(
ssc, "PythonStreamingTest", "questions", "https://kinesis.us-west-2.amazonaws.com", "us-west-2", InitialPositionInStream.TRIM_HORIZON, 1)
dstream.foreachRDD(stream_rdd)
def stream_rdd(rdd):
if not rdd.isEmpty():
return rdd.foreach(classify)
def classify(ele):
if ele!="":
print ele
Initially, the stream comes blank as it takes a while to connect to the Kinesis stream. But then all of a sudden, it breaks down the code.
The rest of the trace is,
17/04/02 17:52:00 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager.get(Lorg/apache/spark/storage/BlockId;)Lscala/Option;
at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD.getBlockFromBlockManager$1(KinesisBackedBlockRDD.scala:104)
at org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD.compute(KinesisBackedBlockRDD.scala:117)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
I submit my job using the following command,
spark-submit --jars spark-streaming-kinesis-asl-assembly_2.11-2.0.0.jar --driver-memory 5g Question_Type_Classification_testing_purpose/classifier_streaming.py
I am running the code on a local machine. So if I am giving 5g of memory, the executor should work fine.
The same code works for Spark 1.6. Recently I changed to Spark 2.1 and I am not able to run this code. I updated my kinesis jar and Py4j as well.
I tested my code by writing a Kinesis consumer, and it gets the stream perfectly fine.
Can anyone please let me know what can be the possible issue? Is the empty stream creating an issue? If yes, why am I getting an empty stream while using Spark streaming? Any help is really appreciated.

spark-streaming-kinesis-asl is Spark’s own internal library and is using Spark internal APIs (e.g., BlockManager.get). The method signature of BlockManager.get was changed in https://github.com/apache/spark/commit/29cfab3f1524c5690be675d24dda0a9a1806d6ff#diff-2b643ea78c1add0381754b1f47eec132L605 so you will see NoSuchMethodError if the Spark version is >= 2.0.1 but spark-streaming-kinesis-asl version is < 2.0.1.
Generally, because Spark doesn’t promise not breaking internal APIs between releases, you must use spark-streaming-kinesis-asl with the same version of Spark.
For latest Spark releases, the kinesis asl assembly jar was removed because of the potential license issue [1], hence you may not be able to find the assembly jar. However, you can use --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0 to add spark-streaming-kinesis-asl and its dependencies into the classpath automatically, rather than building the assembly jar by yourself.
[1] https://issues.apache.org/jira/browse/SPARK-17418

Related

Spark 2 application failed with Couldn't find leader offsets for Error

I have my spark applciation reading data from Kafka and ingesting into Kudu. It has run successfully for almost 25hrs and ingested data into Kudu. After that, I see new leader was elected for kafka partitions from kafka logs. My application went into FINISHED state with the following error,
org.apache.spark.SparkException: ArrayBuffer(kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([test,0]))
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:133)
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:158)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Does it mean that, whenever a new leader is elected Spark application will fail?
I have seen a lot of posts on Stackoverflow where everyone said that they are not able to launch the application with this error. But, in my case, it ran for 25hrs and then FINISHED.
Any thoughts on what might have gone wrong? I searched on Kafka issues with no luck related to this.

The exception is thrown when the count of the topicPartition expected by the spark streaming doesn't match with the topic partition provided by simple Kafka client [Used by spark to get the topicPartition-Leader offsets].
So by the time Spark Streaming requested for the offsets, the leader for topicPartition [test, 0] was not available. Hence Spark threw the exception message. What version of Spark and Kafka are you using?

spark structured streaming unable to start from checkpoint location

I'm doing a simple Spark program using structured streaming feature and Kafka. As Kafka is source, there are 2 sinks:
Sink 1- Console sink -- works fine in all cases
Sink 2 & 3 -H2 and Ignite Foreach sink
For the first run code runs fine but when I kill and restart the program with checkpoint location I'm getting the below error
17/07/12 07:11:48 ERROR StreamExecution: Query h2Out [id = 22ce7168-6f12-4220-8f28-f9eaaaba9c6a, runId = 39ecb40a-5b54-4b36-a0da-6e3057d66b2e] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.KafkaSource$$anon$1.parseVersion(Ljava/lang/String;I)I
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:116)
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:99)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:129)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:97)
at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:222)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:452)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:448)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:447)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:239)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)
I checked KafkaSource source code, the parseFunction method is available through org.apache.spark.sql.execution.streaming.HDFSMetadataLog I hope, for which the jar (spark-sql_2.11-2.1.1.jar) is available in classpath.
For info I'm using Kafka 0.10.2.1 maven dependencies.

This error means your Spark version is older than 2.1.1. HDFSMetadataLog.parseVersion adds in Spark 2.1.1, and spark-sql-kafka-0-10_2.11-2.1.1.jar calls it. If your Spark version is older than 2.1.1, you will see this NoSuchMethodError.
You can check your Spark version by calling SparkSession.version. (e.g., just type spark.version in Spark shell).

spark streaming assertion failed: Failed to get records for spark-executor-a-group a-topic 7 244723248 after polling for 4096

Spark Streaming Problem with Kafka DirectStream:
spark streaming assertion failed: Failed to get records for
spark-executor-a-group a-topic 7 244723248 after polling for 4096
Tried:
1) Adjust increasing spark.streaming.kafka.consumer.poll.ms
-- from 512 to 4096, less failed, but even 10s the failed still exists
2) Adjust executor memory from 1G to 2G
-- partly work, much less failed
3) https://issues.apache.org/jira/browse/SPARK-19275
-- still got failed when streaming durations all less than 8s ("session.timeout.ms" -> "30000")
4) Try Spark 2.1
-- problem still there
with Scala 2.11.8, Kafka version : 0.10.0.0, Spark version : 2.0.2
Spark configs
.config("spark.cores.max", "4")
.config("spark.default.parallelism", "2")
.config("spark.streaming.backpressure.enabled", "true")
.config("spark.streaming.receiver.maxRate", "1024")
.config("spark.streaming.kafka.maxRatePerPartition", "256")
.config("spark.streaming.kafka.consumer.poll.ms", "4096")
.config("spark.streaming.concurrentJobs", "2")
using spark-streaming-kafka-0-10-assembly_2.11-2.1.0.jar
Error stacks:
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:194)
...
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:108)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:142)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:108)
...
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Losing 1%+ blocks datum from Kafka with this failure :( pls help!

Current solution:
Increase num.network.threads in kafka/config/server.properties, default is 3
Increase spark.streaming.kafka.consumer.poll.ms value ~! a large one ...
without config spark.streaming.kafka.consumer.poll.ms, it's using the spark.network.timeout, which is 120s -- causing some problem
Optional step: Decrease the "max.poll.records", default is 500
Optional step: use Future{} to run time-cost-task in parallel

A solution that worked for me:
Decrease the Kafka Consumer max.partition.fetch.bytes.
Reasoning: When the consumer polls for records from Kafka, it prefetches data up to this amount. If the network is slow, it may not be able to prefetch the full amount and the poll in the CachedKafakConsumer will timeout.

Spark Listener EventLoggingListener threw an exception / ConcurrentModificationException

In our application (Spark 2.0.1) we have this exception popping up frequently.
I can't find anything about this.
What could be the cause ?
16/10/27 11:18:24 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at scala.Option.map(Option.scala:146)
at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1249)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
EDIT: One more information, our application is long-running, and to resume from potentially failed spark context, we use the SparkBuilder.getOrCreate() method between two "jobs". Could this mess-up with the listeners ?

It's a known problem in Spark 2.0.1 (SPARK-17816) and will be fixed with Spark 2.0.2/2.1.0 (related pull request).
To get rid of the exception without waiting for Spark 2.0.2/2.1.0, clone the latest, unstable spark version and build apache-spark manually.
Update: They released Spark 2.0.2!

We also just upgraded to Spark 2.0.1 and started seeing the same exception. We narrowed the cause down to a section of Python code containing the following idiom:
a = spark_context.textFile('..')
a = a.map(stuff)
b = a.filter(stuff).map(stuff)
I've had issues in the past with variable self-assignment in Spark, but after upgrading to 2.0.1 the problem got really acute and we started seeing ConcurrentModification exceptions.
The fix for us was simply changing the code to not do any self-assignments.

A similar issue has surfaced in Spark 3.1.0, related to EventLoggingListener race condition and is described in the following bug reports:
https://issues.apache.org/jira/browse/SPARK-34731
https://issues.apache.org/jira/browse/SPARK-32027
The issue was fixed in Spark 3.1.2, so upgrading Spark from 3.1.0/3.1.1 to 3.1.2 would solve it. Alternatively, it is possible to avoid the error by disabling event logging altogether:
spark.eventLog.enabled=false

Hive on Spark CDH5.7 Execution Error

I've updated my cluster to CDH 5.7 recently and I am trying to run a Hive query processing on Spark.
I have configured the Hive client to use the Spark execution engine and Hive Dependency on a Spark Service from Cloudera Manager.
Via HUE, i'm simply running a simple select query but seem to get this error at all times: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Following are the logs for the same:
ERROR operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:374)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:180)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any help to solve this would be great!

This problem is due to a open JIRA: https://issues.apache.org/jira/browse/HIVE-11519. You should use another serialization tool..

Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
is not the real error message, you'd better turn on the DEBUG info by using hive cli, like
bin/hive --hiveconf hive.root.logger=DEBUG,console
and you will get more detailed logs, such as, those are something i got before:
16/03/17 13:55:43 [fxxxxxxxxxxxxxxxx4 main]: INFO exec.SerializationUtilities: Serializing MapWork using kryo
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
this is caused by some dependency conflicts, see https://issues.apache.org/jira/browse/HIVE-13301 for detail.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

java.lang.NoSuchMethodError: org.apache.spark.storage.BlockManager - apache-spark

Related

Spark 2 application failed with Couldn't find leader offsets for Error

spark structured streaming unable to start from checkpoint location

spark streaming assertion failed: Failed to get records for spark-executor-a-group a-topic 7 244723248 after polling for 4096

Spark Listener EventLoggingListener threw an exception / ConcurrentModificationException

Hive on Spark CDH5.7 Execution Error

Categories

Resources