Spark streaming and Kafka intergration - apache-spark

I'm new to Apache Spark and I've been doing a project related to sentiment analysis on twitter data which involves spark streaming and kafka integration. I have been following the github code (link provided below)
https://github.com/sridharswamy/Twitter-Sentiment-Analysis-Using-Spark-Streaming-And-Kafka
However, in the last stage, that is during the integration of Kafka with Apache Spark, the following errors were obtained
py4j.protocol.Py4JError: An error occurred while calling o24.createDirectStreamWithoutMessageHandler. Trace:
py4j.Py4JException: Method createDirectStreamWithoutMessageHandler([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Command used: bin/spark-submit --packages org.apache.spark:spark-streaming-kafka_2.10:1.5.1 twitterStream.py
Apache Spark version: spark-2.1.0-bin-hadoop2.4
Kafka version: kafka_2.11-0.10.1.1
I haven't been able to debug this and any help would be much appreciated.

The example you are trying to run is desinged for running in spark 1.5. You should either download spark 1.5 or run the spark-submit from spark 2.1.0 but with kafka package related to 2.1.0, for example:
./bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0.

Related

Cannot connect to Cassandra with Spark connector 3.1.0 and Spark 3.1.2

I'm trying to connect cassandra, using spark-cassandra-connector, but the following message appears:
spark.version: 3.1.2
cassandra.connector.version: 3.1.0
Caused by: java.io.IOException: Failed to open native connection to Cassandra at {10.99.249.84:9042} :: org/apache/tinkerpop/gremlin/structure/io/BufferFactory
at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:173)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$sessionCache$1(CassandraConnector.scala:161)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
at com.datastax.spark.connector.datasource.CassandraCatalog$.com$datastax$spark$connector$datasource$CassandraCatalog$$getMetadata(CassandraCatalog.scala:455)
at com.datastax.spark.connector.datasource.CassandraCatalog$.getTableMetaData(CassandraCatalog.scala:421)
at org.apache.spark.sql.cassandra.DefaultSource.getTable(DefaultSource.scala:68
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:334)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:317)
... 1 more
Caused by: java.lang.NoClassDefFoundError: org/apache/tinkerpop/gremlin/structure/io/BufferFactory
at com.datastax.dse.driver.internal.core.graph.GraphRequestAsyncProcessor.<init>(GraphRequestAsyncProcessor.java:48)
The stack trace indicates to me that you're trying to run a Graph-Analytics query which suggests you're connecting to a DataStax Enterprise (DSE) cluster.
The first issue is that the latest version of DSE (which is 6.8) ships with Apache Spark 2.4. You cannot run Spark 3.1 jobs against DSE 6.8 since it only supports Spark 2.4.
The second issue is that the open-source Spark connector cannot access DSE-only features including Graph. DSE features are only accessible using the DSE version of the Spark connector which is included in DSE. You will not be able to access DSE features using the OSS version of the Spark connector. Cheers!

Error reading Hive table from Spark using JdbcStorageHandler

I've set up access to an external relational store (PostgreSQL) via my Spark/Hive deployment. I can read this table via Hive/Beeline, but it fails when I try to read via SparkSQL/pyspark3 jupyter notebook, because it's unable to find JdbcStorageHandler. I've tried to add the appropriate jars in a couple of ways but am hitting the same stack trace across the board - any advice on what jar and version I need, and where exactly I should put it, for this to work? Stack trace:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hive.storage.jdbc.JdbcStorageHandler
..
..
java.lang.ClassNotFoundException: org.apache.hive.storage.jdbc.JdbcStorageHandler
In terms of getting Hive/Beeline to work: I did as described in this JDBC Storage Handler document. I hit a few jar dependency problems while doing this, but resolved it by adding the hive-jdbc-2.0.0.jar, postgresql-42.2.12.jar jars after launching Beeline, and can now successfully read data directly from the relational store from Beeline.
Some things I've tried:
Add the jars listed above with spark.jars.packages in the notebook sparkmagic conf. hive-jdbc 2.0.0 installs cleanly but yields aforementioned error. I tried hive-jdbc 3.1.0 also, but it errors out and does not install. I was a little confused as to how to assess compatibility here, might be a distraction.
Launch spark-sql on the cluster directly, add hive-jdbc-2.0.0.jar jar (successfully). Same stack trace.
Add Apache Hive libraries across the cluster during cluster creation (the hive-jdbc, and postgres driver)
Look around the rest of /usr/hdp for hive-jdbc, of which there are a variety of versions (beneath zeppelin, spark2, oozie, hive-hcatalog, hive, ranger-admin).
Environment details:
running on Azure HDInsight
Spark 2.4 (HDI 4.0)
Please copy the hive-jdbc-handler.jar to $SPARK_HOME/standalone-metastore directory in all nodes.
cp /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar /usr/hdp/current/spark2-client/standalone-metastore/
After that launch the spark-shell and test the example
sudo -u spark spark-shell --master yarn --jars /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar
scala> spark.sql("select * from table_name").show()
If you get below error, then there is an issue created for this one, you need to back port that code Spark 3.0.0 to your cluster spark version for example Spark 2.3.2.
Caused by: java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:226)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
https://issues.apache.org/jira/browse/SPARK-27259

NoSuchMethodError when trying to query data on HBase from Spark

I'm trying to query data loaded into an HBase table using SparkSQL/DataFrames. My cluster is based on Cloudera CDH 6.2.0 (Spark version 2.4.0 and HBase version 2.1.0).
Following this guide I selected my HBase service in HBase Service property of my Spark Service. This operation added the following jars to my Spark classpath:
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/audience-annotations-0.5.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.1.0-cdh6.2.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/hbase-spark.jar
I started then the spark-shell. Following this example, which uses this Spark-HBase Connector, I managed to load and retrieve data from HBase and put them into a DataFrame. When I try to query this DataFrame, using SparkSQL or DataFrame API, I get the following exception:
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.util.ByteStringer.wrap([B)Lcom/google/protobuf/ByteString;
at org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter.toByteArray(SparkSQLPushDownFilter.java:256)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
at scala.Option.map(Option.scala:146)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.toSerializedTypedFilter(HBaseTableScanRDD.scala:267)
at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:88)
at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:80)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
...
I tried to start the spark-shell 'as is' without passing the above connector and the result is the same. I read that this issues can be cause by different version of protocol buffer but I don't know how to resolve it.
We had the same issue with CDH 6.3.3 and ended up compiling Hortonworks shc-core from source and so far it seems to work with CDH 6.3.3 without any issues.

Hive on Spark ERROR java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS

Running Hive on Spark with a simple select * from table query runs smoothly, but on joins and sums, the ApplicationMaster returns this stack trace for the associated spark container:
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
)
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:486)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:724)
Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Deleting staging directory hdfs://LOSLDAP01:9000/user/hdfs/.sparkStaging/application_1553880018684_0001
2019-03-29 17:23:43 INFO ShutdownHookManager:54 - Shutdown hook called
I have already tried to increase yarn container memory allocation (and decrease spark memory) with no success.
Using:
Hadoop 2.9.2
Spark 2.3.0
Hive 2.3.4
Thank you for your help.
This was asked 6 months ago. Hope this helps others.
The reason for this error is SPARK_RPC_SERVER_ADDRESS added in hive version 2.x and Spark by default supports hive 1.2.1.
I was able to enable hive-on-spark using this manual on EMR 5.25 cluster (Hadoop 2.8.5, hive 2.3.5, Spark 2.4.3) for running on YARN. However, manual needs to be updated, it is missing some key items.
To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib. Manual didn't mention linking the last library spark-unsafe.jar
ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar /usr/lib/hive/lib/scala-library.jar
ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.3.jar /usr/lib/hive/lib/spark-core.jar
ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.3.jar /usr/lib/hive/lib/spark-network-common.jar
ln -s /usr/lib/spark/jars/spark-unsafe_2.11-2.4.3.jar /usr/lib/hive/lib/spark-unsafe.jar
Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs.
Hive 2.2.0, upload all jars in $SPARK_HOME/jars to hdfs folder and add following in hive-site.xml
<property>
<name>spark.yarn.jars</name>
<value>hdfs://xxxx:8020/spark-jars/*</value>
</property>
Manual is missing key information that you need to exclude default hive 1.2.1 jars. This is what I did:
hadoop fs -mkdir /spark-jars
hadoop fs -put /usr/lib/spark/jars/*.jar /spark-jars/
hadoop fs -rm /spark-jars/*hive*1.2.1*
Also, you need to add the following to spark-defaults.conf file:
spark.sql.hive.metastore.version 2.3.0;
spark.sql.hive.metastore.jars /usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
For more information on interacting with different versions of hive metastore please check this link.
It turned out that Hive-on-Spark has a lot of implementation problems and essentially does not work at all unless you write your own custom Hive connector. In a nutshell, Spark devs are struggling to keep up with Hive releases and they did not yet decided how to deal with backward compatibility on how to load Hive versions ~< 2 while focusing on the newest branch.
Solutions
1) Go back to Hive 1.x
Not ideal. Especially if you want some more modern integration with file formats such as ORC.
2) Use Hive-on-Tez
This is the one we decided to adopt. *This solution does not break the open source stack* and works perfectly along with Spark-on-Yarn. 3rd party Hadoop ecosystems, like those for Azure, AWS and Hortonworks all add proprietary code just for running Hive-On-Spark because of the mess that it became.
By installing Tez, your Hadoop queries will work like this:
A direct Hive query (e.g. jdbc connection from DBeaver) will run a Tez container on the cluster
A Spark job will be able to access the Hive metastore as normal and will use a Spark container on the cluster when creating the SparkSession.builder.enableHiveSupport().getOrCreate() (this is pyspark code)
Installing Hive-on-Tez with Spark-on-Yarn
Note: I'll keep it short since I do not see much interest on these boards. Ask for details and I'll be happy to help and expand.
Version matrix
Hadoop 2.9.2
Tez 0.9.2
Hive 2.3.4
Spark 2.4.2
Hadoop is installed in cluster mode.
This is what worked for us. I would not expect it to work seamlessly when switching to Hadoop 3.x, which we will be doing at some point in the future, but it should work fine if you do not change the main release version for each component.
Basic guide
Compile Tez from source as written in the official install guide, with Mode A for sharing hadoop jars. Do not use any pre-compiled Tez distro. Test it by the hive shell with a simple query which is not a simple data access (i.e. just a select). For example, use: select count(*) from myDb.myTable. You should see the Tez bars from the hive console.
Compile Spark from source. To do so, follow the official guide (Important: download the archive labeled without-hadoop !!), but before compiling it edit the source code at ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala and comment out the following line: ConfVars.HIVE_STATS_JDBC_TIMEOUT -> TimeUnit.SECONDS,
Share $HIVE_HOME/conf/hive-site.xml, in your $SPARK_HOME/conf/ dir. You must make a hard copy of this config file and not a symlink. The reason is that you must remove all Tez-related Hive config values from it to guarantee that Spark co-exist independently with Tez, as explained above. This does include the hive.execution.engine=tez property which must be left empty. Just remove it completely from the Spark's hive-site.xml, while leaving it in the Hive's hive-site.xml.
In $HADOOP_HOME/etc/hadoop/mapred-site.xml set property mapreduce.framework.name=yarn. This will be picked up correctly by both environments even if it is not set to yarn-tez. It just means that raw mapreduce jobs will not run on Tez, while Hive jobs will indeed use it. This is a problem only for legacy jobs, since raw mapred is obsolete.
Good luck!

Spark action stuck with EOFException

I'm trying to execute an action with Spark with gets stuck. The corresponding executor throws following exception:
2019-03-06 11:18:16 ERROR Inbox:91 - Ignoring error
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:131)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:130)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:130)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:96)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
My environment is a standalone Spark cluster on Docker with Zeppelin as Spark driver. The connection to the cluster is working fine.
My Spark action is a simple output of a database read like:
spark.read.jdbc(jdbcString, "table", props).show()
I can print the schema of the table, so there shouldn't be a problem with the connection.
Please check your environment JAVA, Python, Pysaprk must be same in MASTER and WORKER and path, version same too
Our driver machine had a different version of Java compared to spark standalone cluster. When we tried with another machine with the same java version, it worked.
I had the same issue in one of folder available on S3. Data was stored as Parquet with Snappy compression. When I changed it to ORC with Snappy compression, it worked like charm.

Resources