Hive on Spark CDH5.7 Execution Error - apache-spark

I've updated my cluster to CDH 5.7 recently and I am trying to run a Hive query processing on Spark.
I have configured the Hive client to use the Spark execution engine and Hive Dependency on a Spark Service from Cloudera Manager.
Via HUE, i'm simply running a simple select query but seem to get this error at all times: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Following are the logs for the same:
ERROR operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:374)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:180)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any help to solve this would be great!

This problem is due to a open JIRA: https://issues.apache.org/jira/browse/HIVE-11519. You should use another serialization tool..

Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
is not the real error message, you'd better turn on the DEBUG info by using hive cli, like
bin/hive --hiveconf hive.root.logger=DEBUG,console
and you will get more detailed logs, such as, those are something i got before:
16/03/17 13:55:43 [fxxxxxxxxxxxxxxxx4 main]: INFO exec.SerializationUtilities: Serializing MapWork using kryo
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
this is caused by some dependency conflicts, see https://issues.apache.org/jira/browse/HIVE-13301 for detail.

Related

Access Spark Hive metastore within an UDF running in the workers (Databricks)

Context
I have an operation that should be performed on some tables using pyspark.
This operation includes accessing the Spark metastore (in Databricks) to get some metadata.
Since I have plenty of tables I'm parallelizing this operation among the cluster workers with an RDD, as you can see in the code below:
base_spark_context = SparkContext.getOrCreate()
rdd = base_spark_context.sc.parallelize(tables_list)
rdd.map(lambda table_name: sync_table(table_name)).collect()
The UDF sync_table() run queries on the metastore, similar to this code line:
spark_client.session.sql("select 1")
Problem
The problem is that this SQL execution not succeeds. Rather I get some metastore related error. Traceback:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
(suppressed lines)
Caused by: java.lang.reflect.InvocationTargetException
(suppressed lines)
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader#16c0663d, see the next exception for details.
(suppressed lines)
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /databricks/spark/work/app-20210413201900-0000/0/metastore_db.
Is there any limitation accessing the Databricks metastore within a worker, after parallelizing the operation in such a way? Or there is a possibility of performing such an operation?

Issue while loading an avro dataset into Teradata with spark-streaming

I am trying to load a dataset of avro files into a Teradata table through spark streaming (jdbc). The configuration is properly set and the load succeeds to certain extent (I can validate rows of data have been inserted into the table), but halfways through I start having exceptions and the load fails. The stacktrace is below. Any inkling as to what might causing this ?
18/02/08 17:27:42 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 0)
java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.02] [Error 1154] [SQLState HY000] A failure occurred while inserting the batch of rows destined for database table "database"."table". Details of the failure can be found in the exception chain that is accessible with getNextException.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:149)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:133)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2389)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:592)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.02] [Error 1147] [SQLState HY000] The next failure(s) in the exception chain occurred while beginning FastLoad of database table "database"."table"
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:95)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:70)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.beginFastLoad(FastLoadManagerPreparedStatement.java:966)
at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2210)
The issue stems from trying to load data in append mode to an existing table. FAST LOAD doesn't support that. The table should be empty (a.k.a truncated) each time the process runs. Which makes this useful for staging data prior to crunching it. But not for storing it.

Spark Listener EventLoggingListener threw an exception / ConcurrentModificationException

In our application (Spark 2.0.1) we have this exception popping up frequently.
I can't find anything about this.
What could be the cause ?
16/10/27 11:18:24 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
at scala.Option.map(Option.scala:146)
at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1249)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
EDIT: One more information, our application is long-running, and to resume from potentially failed spark context, we use the SparkBuilder.getOrCreate() method between two "jobs". Could this mess-up with the listeners ?
It's a known problem in Spark 2.0.1 (SPARK-17816) and will be fixed with Spark 2.0.2/2.1.0 (related pull request).
To get rid of the exception without waiting for Spark 2.0.2/2.1.0, clone the latest, unstable spark version and build apache-spark manually.
Update: They released Spark 2.0.2!
We also just upgraded to Spark 2.0.1 and started seeing the same exception. We narrowed the cause down to a section of Python code containing the following idiom:
a = spark_context.textFile('..')
a = a.map(stuff)
b = a.filter(stuff).map(stuff)
I've had issues in the past with variable self-assignment in Spark, but after upgrading to 2.0.1 the problem got really acute and we started seeing ConcurrentModification exceptions.
The fix for us was simply changing the code to not do any self-assignments.
A similar issue has surfaced in Spark 3.1.0, related to EventLoggingListener race condition and is described in the following bug reports:
https://issues.apache.org/jira/browse/SPARK-34731
https://issues.apache.org/jira/browse/SPARK-32027
The issue was fixed in Spark 3.1.2, so upgrading Spark from 3.1.0/3.1.1 to 3.1.2 would solve it. Alternatively, it is possible to avoid the error by disabling event logging altogether:
spark.eventLog.enabled=false

Hive-Spark error - java.lang.IllegalStateException: unread block data

I have been trying to run a hive query at the Hive CLI, after configuring Hive to work Spark.
When spark.master is local it works just fine, but when I set it to my spark master spark://spark-master:7077 I get the following error in the Spark logs:
15/11/03 16:37:10 INFO util.Utils: Copying /tmp/spark-5e39df85-d3d7-446f-86e9-d2699501f97e/executor-70d24a32-6913-479d-85b8-32e535dd3dbf/-11208827301446565026180_cache to /usr/local/spark/work/app-20151103163705-0000/0/./hive-exec-1.2.1.jar
15/11/03 16:37:11 INFO executor.Executor: Adding file:/usr/local/spark/work/app-20151103163705-0000/0/./hive-exec-1.2.1.jar to class loader
15/11/03 16:37:11 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2428)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I work with Spark 1.4.1 and Hive 1.2.1
Just for others that might be having the same issue, I managed to fix this and get past it, I think this was something with the HBase jars at the executors’ side (it was only occurring when running queries that were touching HBase through hive, and only in spark cluster mode).
My solution was to add to the spark-env.sh:
export SPARK_CLASSPATH=$CLASSPATH
or
export SPARK_CLASSPATH=/usr/local/hbase-1.1.2/lib/hbase-protocol-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-common-1.1.2.jar:/usr/local/hbase-1.1.2/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hbase-1.1.2/lib/hbase-server-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-client-1.1.2.jar:/usr/local/hive-1.2.1/lib/hive-hbase-handler-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-common-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-exec-1.2.1.jar
Alternatively, one can add to the hive-site.xml:
<property>
<name>spark.executor.extraClassPath</name>
<value>/usr/local/hbase-1.1.2/lib/hbase-protocol-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-common-1.1.2.jar:/usr/local/hbase-1.1.2/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hbase-1.1.2/lib/hbase-server-1.1.2.jar:/usr/local/hbase-1.1.2/lib/hbase-client-1.1.2.jar:/usr/local/hive-1.2.1/lib/hive-hbase-handler-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-common-1.2.1.jar:/usr/local/hive-1.2.1/lib/hive-exec-1.2.1.jar</value>
</property>

Connection error while writing into Cassandra using pyspark

I am trying to write data into Cassandra using Pyspark shell,command:
dataframe_name.write.format("org.apache.spark.sql.cassandra").options(table="table_name",keyspace="keyspace_name").save(mode="append")
but I am getting the following error:
15/09/15 06:37:18 ERROR DAGScheduler: Failed to update accumulators for ResultTask(2, 198)
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.<init>(Socket.java:425)
at java.net.Socket.<init>(Socket.java:208)
at org.apache.spark.api.python.PythonAccumulatorParam.openSocket(PythonRDD.scala:813)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:828)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:798)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:80)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:342)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:337)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:337)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:945)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1014)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I have tried executing the same operation using python shell on pyspark machine. It is working fine.
from cassandra.cluster import Cluster
cluster= Cluster(['ip_of_cassandra_machine'])
session = cluster.connect('keyspace_name');
a = session.prepare(""" insert into table_name(t1,t2) values('value1', 'value2') """)
session.execute(a)
This looks like a networking issue from within Spark. Without the exact versions of Spark and the Spark Cassandra Connector it would be hard to diagnose. My guess is that the driver is incorrectly setup for communication with the executors. Are you sure that your driver application is reachable by your executors and vice-versa?
You can always test setting --master local to see if the problem exists when networking is out of the picture.

Resources