NoSuchMethodError when trying to query data on HBase from Spark - apache-spark

I'm trying to query data loaded into an HBase table using SparkSQL/DataFrames. My cluster is based on Cloudera CDH 6.2.0 (Spark version 2.4.0 and HBase version 2.1.0).
Following this guide I selected my HBase service in HBase Service property of my Spark Service. This operation added the following jars to my Spark classpath:
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/audience-annotations-0.5.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.1.0-cdh6.2.0.jar
/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hbase/hbase-spark.jar
I started then the spark-shell. Following this example, which uses this Spark-HBase Connector, I managed to load and retrieve data from HBase and put them into a DataFrame. When I try to query this DataFrame, using SparkSQL or DataFrame API, I get the following exception:
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.util.ByteStringer.wrap([B)Lcom/google/protobuf/ByteString;
at org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter.toByteArray(SparkSQLPushDownFilter.java:256)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$$anonfun$toSerializedTypedFilter$1.apply(HBaseTableScanRDD.scala:267)
at scala.Option.map(Option.scala:146)
at org.apache.hadoop.hbase.spark.datasources.SerializedFilter$.toSerializedTypedFilter(HBaseTableScanRDD.scala:267)
at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:88)
at org.apache.hadoop.hbase.spark.datasources.HBaseTableScanRDD$$anonfun$1.apply(HBaseTableScanRDD.scala:80)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
...
I tried to start the spark-shell 'as is' without passing the above connector and the result is the same. I read that this issues can be cause by different version of protocol buffer but I don't know how to resolve it.

We had the same issue with CDH 6.3.3 and ended up compiling Hortonworks shc-core from source and so far it seems to work with CDH 6.3.3 without any issues.

Related

Can I use spark3.3.1 and hive3 together?

I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks
From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.
I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.

Spark action stuck with EOFException

I'm trying to execute an action with Spark with gets stuck. The corresponding executor throws following exception:
2019-03-06 11:18:16 ERROR Inbox:91 - Ignoring error
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:131)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:130)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:130)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:96)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
My environment is a standalone Spark cluster on Docker with Zeppelin as Spark driver. The connection to the cluster is working fine.
My Spark action is a simple output of a database read like:
spark.read.jdbc(jdbcString, "table", props).show()
I can print the schema of the table, so there shouldn't be a problem with the connection.
Please check your environment JAVA, Python, Pysaprk must be same in MASTER and WORKER and path, version same too
Our driver machine had a different version of Java compared to spark standalone cluster. When we tried with another machine with the same java version, it worked.
I had the same issue in one of folder available on S3. Data was stored as Parquet with Snappy compression. When I changed it to ORC with Snappy compression, it worked like charm.

Hive on Tez doesn't work in Spark 2

when working with HDP 2.5 with spark 1.6.2 we used Hive with Tez as its execution engine and it worked.
But when we moved to HDP 2.6 with spark 2.1.0, Hive didn't work with Tez as its execution engine, and the following exception was thrown when the DataFrame.saveAsTable API was called:
java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init> HiveClientImpl.scala:188)
after looking at the answer to this question, we switched hive execution engine to MR (MapReduce) instead of Tez and it worked.
However, we'd like to work with Hive on Tez. what's required to solve the above exception in order for Hive on Tez to work?
I had the same issue when the spark job was running in YARN CLUSTER mode and that was resolved when correct hive-site.xml was added to ( add to spark-default configuration) " spark.yarn.dist.files "
Basically there are two different hive-site.xml files,
one is for hive configuration : /usr/hdp/current/hive-client/conf/hive-site.xml
The other one is lighter version for spark ( had the details only for spark to work with hive) : /etc/spark//0/hive-site.xml ( please check the path once for your setup)
we need to use the second file for spark.yarn.dist.files.

Pyspark and Cassandra Connection Error

I have stucked with a problem. When i write sample cassandra connection code while import cassandra connector gives error.
i am starting the script like below code (both of them gave error)
./spark-submit --jars spark-cassandra-connector_2.11-1.6.0-M1.jar /home/beyhan/sparkCassandra.py
./spark-submit --jars spark-cassandra-connector_2.10-1.6.0.jar /home/beyhan/sparkCassandra.py
But giving below error while
import pyspark_cassandra
ImportError: No module named pyspark_cassandra
Which part i did wrong ?
Note:I have already installed cassandra database.
You are mixing up DataStax' Spark Cassandra Connector (in the jar you add to spark submit), and TargetHolding's PySpark Cassandra project (which has the pyspark_cassandra module). The latter is deprecated, so you should probably use the Spark Cassandra Connector. Documention for this package can be found here.
To use it, you can add the following flags to spark submit:
--conf spark.cassandra.connection.host=127.0.0.1 \
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3
Of course use the IP address on which Cassandra is listening, and check what connector version you need to use: 2.0.0-M3 is the latest version and works with Spark 2.0 and most Cassandra versions. See the compatibility table in case you are using a different version of Spark. 2.10 or 2.11 is the version of Scala your Spark version is built with. If you use Spark 2, by default it is 2.11, before 2.x it was version 2.10.
Then the nicest way to work with the connector is to use it to read dataframes, which looks like this:
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
See the PySpark with DataFrames documentation for more details

spark driver not found

I am trying to write dataframe to sqlserver using spark. I am using the method write for dataframewriter to write to sql server.
Using DriverManager.getConnection I am able to get connection of sqlserver and able to write but when using jdbc method and passing uri I am getting "No suitable driver found".
I have passed the jtds jar in the --jars in spark-shell.
Spark version : 1.4
The issue is that spark is not finding driver jar file. So download jar and place in all worker nodes of spark cluster on the same path and add this path to SPARK_CLASSPATH in spark-env.sh file
as follow
SPARK_CLASSPATH=/home/mysql-connector-java-5.1.6.jar
hope it will help

Resources