Spark action stuck with EOFException - apache-spark

I'm trying to execute an action with Spark with gets stuck. The corresponding executor throws following exception:
2019-03-06 11:18:16 ERROR Inbox:91 - Ignoring error
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:131)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:130)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:130)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:96)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
My environment is a standalone Spark cluster on Docker with Zeppelin as Spark driver. The connection to the cluster is working fine.
My Spark action is a simple output of a database read like:
spark.read.jdbc(jdbcString, "table", props).show()
I can print the schema of the table, so there shouldn't be a problem with the connection.

Please check your environment JAVA, Python, Pysaprk must be same in MASTER and WORKER and path, version same too

Our driver machine had a different version of Java compared to spark standalone cluster. When we tried with another machine with the same java version, it worked.

I had the same issue in one of folder available on S3. Data was stored as Parquet with Snappy compression. When I changed it to ORC with Snappy compression, it worked like charm.

Related

Spark 3.2 on Kubernetes keeps throwing okhttp3/okio EOFException

I'm using Spark 3.2.1 image that was built from the official distribution via `docker-image-tool.sh', on Kubernetes 1.18 cluster. Everything works fine, except for this error message every 90 seconds:
WARN WatcherWebSocketListener: Exec Failure
java.io.EOFException
at okio.RealBufferedSource.require(RealBufferedSource.java:61)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This error message does not effect the application, but it's really annoying, especially for Jupyter users, and the lack of details makes it very hard to debug.
It appears on any submit variation - spark-submit, pyspark, spark-shell, and regardless to dynamic execution enabled or disabled.
I've found traces of it on the internet, but all occurrences were from older versions of Spark and resolved by using "newer" version of fabric8 (4.x). Spark 3.2.1 already use fabric8 5.4.1.
I wonder if anyone else still sees this error in Spark 3.x, and has a resolution.
Thanks.
Update:
This seems to be related to the Kubernetes cluster itself. After migrating to a new cluster this error was gone.

Error reading Hive table from Spark using JdbcStorageHandler

I've set up access to an external relational store (PostgreSQL) via my Spark/Hive deployment. I can read this table via Hive/Beeline, but it fails when I try to read via SparkSQL/pyspark3 jupyter notebook, because it's unable to find JdbcStorageHandler. I've tried to add the appropriate jars in a couple of ways but am hitting the same stack trace across the board - any advice on what jar and version I need, and where exactly I should put it, for this to work? Stack trace:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hive.storage.jdbc.JdbcStorageHandler
..
..
java.lang.ClassNotFoundException: org.apache.hive.storage.jdbc.JdbcStorageHandler
In terms of getting Hive/Beeline to work: I did as described in this JDBC Storage Handler document. I hit a few jar dependency problems while doing this, but resolved it by adding the hive-jdbc-2.0.0.jar, postgresql-42.2.12.jar jars after launching Beeline, and can now successfully read data directly from the relational store from Beeline.
Some things I've tried:
Add the jars listed above with spark.jars.packages in the notebook sparkmagic conf. hive-jdbc 2.0.0 installs cleanly but yields aforementioned error. I tried hive-jdbc 3.1.0 also, but it errors out and does not install. I was a little confused as to how to assess compatibility here, might be a distraction.
Launch spark-sql on the cluster directly, add hive-jdbc-2.0.0.jar jar (successfully). Same stack trace.
Add Apache Hive libraries across the cluster during cluster creation (the hive-jdbc, and postgres driver)
Look around the rest of /usr/hdp for hive-jdbc, of which there are a variety of versions (beneath zeppelin, spark2, oozie, hive-hcatalog, hive, ranger-admin).
Environment details:
running on Azure HDInsight
Spark 2.4 (HDI 4.0)
Please copy the hive-jdbc-handler.jar to $SPARK_HOME/standalone-metastore directory in all nodes.
cp /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar /usr/hdp/current/spark2-client/standalone-metastore/
After that launch the spark-shell and test the example
sudo -u spark spark-shell --master yarn --jars /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar
scala> spark.sql("select * from table_name").show()
If you get below error, then there is an issue created for this one, you need to back port that code Spark 3.0.0 to your cluster spark version for example Spark 2.3.2.
Caused by: java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:226)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
https://issues.apache.org/jira/browse/SPARK-27259

NoSuchMethodError when trying to query data on HBase from Spark

I'm trying to query data loaded into an HBase table using SparkSQL/DataFrames. My cluster is based on Cloudera CDH 6.2.0 (Spark version 2.4.0 and HBase version 2.1.0).
Following this guide I selected my HBase service in HBase Service property of my Spark Service. This operation added the following jars to my Spark classpath:
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/audience-annotations-0.5.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.1.0-cdh6.2.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/hbase-spark.jar
I started then the spark-shell. Following this example, which uses this Spark-HBase Connector, I managed to load and retrieve data from HBase and put them into a DataFrame. When I try to query this DataFrame, using SparkSQL or DataFrame API, I get the following exception:
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.util.ByteStringer.wrap([B)Lcom/google/protobuf/ByteString;
at org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter.toByteArray(SparkSQLPushDownFilter.java:256)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
at scala.Option.map(Option.scala:146)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.toSerializedTypedFilter(HBaseTableScanRDD.scala:267)
at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:88)
at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:80)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
...
I tried to start the spark-shell 'as is' without passing the above connector and the result is the same. I read that this issues can be cause by different version of protocol buffer but I don't know how to resolve it.
We had the same issue with CDH 6.3.3 and ended up compiling Hortonworks shc-core from source and so far it seems to work with CDH 6.3.3 without any issues.

Hive on Spark ERROR java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS

Running Hive on Spark with a simple select * from table query runs smoothly, but on joins and sums, the ApplicationMaster returns this stack trace for the associated spark container:
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
)
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:486)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:724)
Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Deleting staging directory hdfs://LOSLDAP01:9000/user/hdfs/.sparkStaging/application_1553880018684_0001
2019-03-29 17:23:43 INFO ShutdownHookManager:54 - Shutdown hook called
I have already tried to increase yarn container memory allocation (and decrease spark memory) with no success.
Using:
Hadoop 2.9.2
Spark 2.3.0
Hive 2.3.4
Thank you for your help.
This was asked 6 months ago. Hope this helps others.
The reason for this error is SPARK_RPC_SERVER_ADDRESS added in hive version 2.x and Spark by default supports hive 1.2.1.
I was able to enable hive-on-spark using this manual on EMR 5.25 cluster (Hadoop 2.8.5, hive 2.3.5, Spark 2.4.3) for running on YARN. However, manual needs to be updated, it is missing some key items.
To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib. Manual didn't mention linking the last library spark-unsafe.jar
ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar /usr/lib/hive/lib/scala-library.jar
ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.3.jar /usr/lib/hive/lib/spark-core.jar
ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.3.jar /usr/lib/hive/lib/spark-network-common.jar
ln -s /usr/lib/spark/jars/spark-unsafe_2.11-2.4.3.jar /usr/lib/hive/lib/spark-unsafe.jar
Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs.
Hive 2.2.0, upload all jars in $SPARK_HOME/jars to hdfs folder and add following in hive-site.xml
<property>
<name>spark.yarn.jars</name>
<value>hdfs://xxxx:8020/spark-jars/*</value>
</property>
Manual is missing key information that you need to exclude default hive 1.2.1 jars. This is what I did:
hadoop fs -mkdir /spark-jars
hadoop fs -put /usr/lib/spark/jars/*.jar /spark-jars/
hadoop fs -rm /spark-jars/*hive*1.2.1*
Also, you need to add the following to spark-defaults.conf file:
spark.sql.hive.metastore.version 2.3.0;
spark.sql.hive.metastore.jars /usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
For more information on interacting with different versions of hive metastore please check this link.
It turned out that Hive-on-Spark has a lot of implementation problems and essentially does not work at all unless you write your own custom Hive connector. In a nutshell, Spark devs are struggling to keep up with Hive releases and they did not yet decided how to deal with backward compatibility on how to load Hive versions ~< 2 while focusing on the newest branch.
Solutions
1) Go back to Hive 1.x
Not ideal. Especially if you want some more modern integration with file formats such as ORC.
2) Use Hive-on-Tez
This is the one we decided to adopt. *This solution does not break the open source stack* and works perfectly along with Spark-on-Yarn. 3rd party Hadoop ecosystems, like those for Azure, AWS and Hortonworks all add proprietary code just for running Hive-On-Spark because of the mess that it became.
By installing Tez, your Hadoop queries will work like this:
A direct Hive query (e.g. jdbc connection from DBeaver) will run a Tez container on the cluster
A Spark job will be able to access the Hive metastore as normal and will use a Spark container on the cluster when creating the SparkSession.builder.enableHiveSupport().getOrCreate() (this is pyspark code)
Installing Hive-on-Tez with Spark-on-Yarn
Note: I'll keep it short since I do not see much interest on these boards. Ask for details and I'll be happy to help and expand.
Version matrix
Hadoop 2.9.2
Tez 0.9.2
Hive 2.3.4
Spark 2.4.2
Hadoop is installed in cluster mode.
This is what worked for us. I would not expect it to work seamlessly when switching to Hadoop 3.x, which we will be doing at some point in the future, but it should work fine if you do not change the main release version for each component.
Basic guide
Compile Tez from source as written in the official install guide, with Mode A for sharing hadoop jars. Do not use any pre-compiled Tez distro. Test it by the hive shell with a simple query which is not a simple data access (i.e. just a select). For example, use: select count(*) from myDb.myTable. You should see the Tez bars from the hive console.
Compile Spark from source. To do so, follow the official guide (Important: download the archive labeled without-hadoop !!), but before compiling it edit the source code at ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala and comment out the following line: ConfVars.HIVE_STATS_JDBC_TIMEOUT -> TimeUnit.SECONDS,
Share $HIVE_HOME/conf/hive-site.xml, in your $SPARK_HOME/conf/ dir. You must make a hard copy of this config file and not a symlink. The reason is that you must remove all Tez-related Hive config values from it to guarantee that Spark co-exist independently with Tez, as explained above. This does include the hive.execution.engine=tez property which must be left empty. Just remove it completely from the Spark's hive-site.xml, while leaving it in the Hive's hive-site.xml.
In $HADOOP_HOME/etc/hadoop/mapred-site.xml set property mapreduce.framework.name=yarn. This will be picked up correctly by both environments even if it is not set to yarn-tez. It just means that raw mapreduce jobs will not run on Tez, while Hive jobs will indeed use it. This is a problem only for legacy jobs, since raw mapred is obsolete.
Good luck!

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Resources