Hive-Spark error - java.lang.IllegalStateException: unread block data - apache-spark

I have been trying to run a hive query at the Hive CLI, after configuring Hive to work Spark.
When spark.master is local it works just fine, but when I set it to my spark master spark://spark-master:7077 I get the following error in the Spark logs:
15/11/03 16:37:10 INFO util.Utils: Copying /tmp/spark-5e39df85-d3d7-446f-86e9-d2699501f97e/executor-70d24a32-6913-479d-85b8-32e535dd3dbf/-11208827301446565026180_cache to /usr/local/spark/work/app-20151103163705-0000/0/./hive-exec-1.2.1.jar
15/11/03 16:37:11 INFO executor.Executor: Adding file:/usr/local/spark/work/app-20151103163705-0000/0/./hive-exec-1.2.1.jar to class loader
15/11/03 16:37:11 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2428)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I work with Spark 1.4.1 and Hive 1.2.1

Just for others that might be having the same issue, I managed to fix this and get past it, I think this was something with the HBase jars at the executors’ side (it was only occurring when running queries that were touching HBase through hive, and only in spark cluster mode).
My solution was to add to the spark-env.sh:
export SPARK_CLASSPATH=$CLASSPATH
or
export SPARK_CLASSPATH=/usr/local/hbase-1.1.2/lib/hbase-protocol-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-common-1.1.2.jar:/usr/local/hbase-1.1.2/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hbase-1.1.2/lib/hbase-server-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-client-1.1.2.jar:/usr/local/hive-1.2.1/lib/hive-hbase-handler-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-common-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-exec-1.2.1.jar
Alternatively, one can add to the hive-site.xml:
<property>
<name>spark.executor.extraClassPath</name>
<value>/usr/local/hbase-1.1.2/lib/hbase-protocol-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-common-1.1.2.jar:/usr/local/hbase-1.1.2/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hbase-1.1.2/lib/hbase-server-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-client-1.1.2.jar:/usr/local/hive-1.2.1/lib/hive-hbase-handler-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-common-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-exec-1.2.1.jar</value>
</property>

Related

Unable to save RDD to HDFS in Apache Spark

I am getting the following error while trying to save the RDD to HDFS
17/09/13 17:06:42 WARN TaskSetManager: Lost task 7340.0 in stage 16.0 (TID 100118, XXXXXX.com, executor 2358): java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:865)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:401)
Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1218)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1359)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[CIRCULAR REFERENCE:java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.]
the final task in the stage is .saveAsTextFile(), In the Spark UI i am able to see that other tasks prior to .saveAsTextFile() finishes successfully. Using Spark 2.0.0 in YARN mode.
EDIT:
I have already seen the answer on Spark: Self-suppression not permitted when writing big file to HDFS and i made sure that issues mentioned in that answer were not the case here.

spark structured streaming unable to start from checkpoint location

I'm doing a simple Spark program using structured streaming feature and Kafka. As Kafka is source, there are 2 sinks:
Sink 1- Console sink -- works fine in all cases
Sink 2 & 3 -H2 and Ignite Foreach sink
For the first run code runs fine but when I kill and restart the program with checkpoint location I'm getting the below error
17/07/12 07:11:48 ERROR StreamExecution: Query h2Out [id = 22ce7168-6f12-4220-8f28-f9eaaaba9c6a, runId = 39ecb40a-5b54-4b36-a0da-6e3057d66b2e] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.KafkaSource$$anon$1.parseVersion(Ljava/lang/String;I)I
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:116)
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:99)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:129)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:97)
at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:222)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:452)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:448)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:447)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:239)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)
I checked KafkaSource source code, the parseFunction method is available through org.apache.spark.sql.execution.streaming.HDFSMetadataLog I hope, for which the jar (spark-sql_2.11-2.1.1.jar) is available in classpath.
For info I'm using Kafka 0.10.2.1 maven dependencies.
This error means your Spark version is older than 2.1.1. HDFSMetadataLog.parseVersion adds in Spark 2.1.1, and spark-sql-kafka-0-10_2.11-2.1.1.jar calls it. If your Spark version is older than 2.1.1, you will see this NoSuchMethodError.
You can check your Spark version by calling SparkSession.version. (e.g., just type spark.version in Spark shell).

Hive on Spark CDH5.7 Execution Error

I've updated my cluster to CDH 5.7 recently and I am trying to run a Hive query processing on Spark.
I have configured the Hive client to use the Spark execution engine and Hive Dependency on a Spark Service from Cloudera Manager.
Via HUE, i'm simply running a simple select query but seem to get this error at all times: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Following are the logs for the same:
ERROR operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:374)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:180)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any help to solve this would be great!
This problem is due to a open JIRA: https://issues.apache.org/jira/browse/HIVE-11519. You should use another serialization tool..
Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
is not the real error message, you'd better turn on the DEBUG info by using hive cli, like
bin/hive --hiveconf hive.root.logger=DEBUG,console
and you will get more detailed logs, such as, those are something i got before:
16/03/17 13:55:43 [fxxxxxxxxxxxxxxxx4 main]: INFO exec.SerializationUtilities: Serializing MapWork using kryo
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
this is caused by some dependency conflicts, see https://issues.apache.org/jira/browse/HIVE-13301 for detail.

Failed to get broadcast_4_piece0 of broadcast_4 in Spark Streaming

I am running a spark streaming application with the input source as Kafka. The version of spark is 1.4.0.
My application runs fine under, but now when I enable checkpointing, run the job and then restart the job to see if check-pointing is working properly I get the following flooded into the logs and the job halts.
Could you help me in resolving this issue. Please let me know if any other information is needed. Basically I want to add the checkpointing feature to my spark streaming application.
15/10/30 13:23:00 INFO TorrentBroadcast: Started reading broadcast variable 4
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of broadcast_4
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1257)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at com.toi.columbia.aggregate.util.CalendarUtil.isRecordCassandraInsertableV1(CalendarUtil.java:103)
at com.toi.columbia.aggregate.stream.v1.AdvPublisherV1$3.call(AdvPublisherV1.java:124)
at com.toi.columbia.aggregate.stream.v1.AdvPublisherV1$3.call(AdvPublisherV1.java:110)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:10)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at com.datastax.spark.connector.writer.TableWriter.measureMaxInsertSize(TableWriter.scala:89)
at com.datastax.spark.connector.writer.TableWriter.com$datastax$spark$connector$writer$TableWriter$$optimumBatchSize(TableWriter.scala:107)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:133)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:127)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:98)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:97)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:149)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:97)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:127)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:26)
at com.datastax.spark.connector.streaming.DStreamFunctions$$anonfun$saveToCassandra$1$$anonfun$apply$1.apply(DStreamFunctions.scala:26)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of broadcast_4
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.u
maybe you forgot to increase the spark.cleaner.ttl so the task gets cleaned.
see here https://issues.apache.org/jira/browse/SPARK-5594
I believe you are creating the broadcast variables inside
JavaStreamingContextFactory factory = new JavaStreamingContextFactory() {}
Try creating the broadcast variables outside this overridden method.
As is clear from you exception - the broadcast variables are not being intitialized when you restart your chekpointed application.
cheers!

Connection error while writing into Cassandra using pyspark

I am trying to write data into Cassandra using Pyspark shell,command:
dataframe_name.write.format("org.apache.spark.sql.cassandra").options(table="table_name",keyspace="keyspace_name").save(mode="append")
but I am getting the following error:
15/09/15 06:37:18 ERROR DAGScheduler: Failed to update accumulators for ResultTask(2, 198)
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.<init>(Socket.java:425)
at java.net.Socket.<init>(Socket.java:208)
at org.apache.spark.api.python.PythonAccumulatorParam.openSocket(PythonRDD.scala:813)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:828)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:798)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:80)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:342)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:337)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:337)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:945)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1014)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I have tried executing the same operation using python shell on pyspark machine. It is working fine.
from cassandra.cluster import Cluster
cluster= Cluster(['ip_of_cassandra_machine'])
session = cluster.connect('keyspace_name');
a = session.prepare(""" insert into table_name(t1,t2) values('value1', 'value2') """)
session.execute(a)
This looks like a networking issue from within Spark. Without the exact versions of Spark and the Spark Cassandra Connector it would be hard to diagnose. My guess is that the driver is incorrectly setup for communication with the executors. Are you sure that your driver application is reachable by your executors and vice-versa?
You can always test setting --master local to see if the problem exists when networking is out of the picture.

Resources