Error: while writing pyspark dataframe into Habse - apache-spark

I am trying to write the pyspark dataframe into Hbase. Facing below error.
Spark and Hbase version on my cluster are:
Spark Version: 2.4.0
Hbase Version: 1.4.8
Spark Submit
spark-submit --jars /tmp/hbase-spark-1.0.0.jar --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml to_hbase.py
error:
Any help would be much appreciated!

It is a known problem - using spark-habse-connector (shc) with spark2.4.
There is a fix #dhananjay_patka did.
Check: SHC With spark 2.4
and his fix

Related

spark 3.x on HDP 3.1 in headless mode with hive - hive tables not found

How can I configure Spark 3.x on HDP 3.1 using headless (https://spark.apache.org/docs/latest/hadoop-provided.html) version of spark to interact with hive?
First, I have downloaded and unzipped the headless spark 3.x:
cd ~/development/software/spark-3.0.0-bin-without-hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/hdp/current/spark2-client/conf classpath)
ls /usr/hdp # note version ad add it below and replace 3.1.x.x-xxx with it
./bin/spark-shell --master yarn --queue myqueue --conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
spark.sql("show databases").show
// only showing default namespace, existing hive tables are missing
+---------+
|namespace|
+---------+
| default|
+---------+
spark.conf.get("spark.sql.catalogImplementation")
res2: String = in-memory # I want to see hive here - how? How to add hive jars onto the classpath?
NOTE
This is an updated version of How can I run spark in headless mode in my custom version on HDP? for Spark 3.x ond HDP 3.1 and custom spark does not find hive databases when running on yarn.
Furthermore: I am aware of the problems of ACID hive tables in spark. For now, I simply want to be able to see the existing databases
edit
We must get the hive jars onto the class path. Trying as follows:
export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"
And now using spark-sql:
./bin/spark-sql --master yarn --queue myqueue--conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
fails with:
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
I.e. the line: export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}", had no effect (same issue if not set).
As noted above and custom spark does not find hive databases when running on yarn the Hive JARs are needed. They are not supplied in the headless version.
I was unable to retrofit these.
Solution: instead of worrying: simply use the spark build with Hadoop 3.2 (on HDP 3.1)

List all additional jars loaded in pyspark

I want to see the jars my spark context is using.
I found the code in Scala:
$ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0
scala> spark.sparkContext.listJars.foreach(println)
spark://datasci:42661/jars/net.sf.saxon_Saxon-HE-9.6.0-7.jar
spark://datasci:42661/jars/elsevierlabs-os_spark-xml-utils-1.6.0.jar
spark://datasci:42661/jars/org.apache.commons_commons-lang3-3.4.jar
spark://datasci:42661/jars/commons-logging_commons-logging-1.2.jar
spark://datasci:42661/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar
spark://datasci:42661/jars/commons-io_commons-io-2.4.jar
Source: List All Additional Jars Loaded in Spark
But I could not find how to do it in PySpark.
Any suggestions?
Thanks
sparkContext._jsc.sc().listJars()
_jsc is the java spark context
I really got the extra jars with this command:
print(spark.sparkContext._jsc.sc().listJars())

Addingdependency jars to spark classpath in a standalone mode

I m trying to execute my local java program in spark which has dependencies, i tried executing spark submit option as below :
spark-submit --class com.cerner.doc.DocumentExtractor /Users/sp054800/Downloads/Docs_lib_jar/Docs_RestAPI.jar
after setting the
spark.driver.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.driver.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
in spark-defaults.conf, but still no help could anyone help me to fix this of how do i need to include the jars in spark. I m using spark2.2.0

Why does Structured Streaming fail with "java.lang.IncompatibleClassChangeError: Implementing class"?

I'd like to run a Spark application using Structured Streaming with PySpark.
I use Spark 2.2 with Kafka 0.10 version.
I fail with the following error:
java.lang.IncompatibleClassChangeError: Implementing class
spark-submit command used as below:
/bin/spark-submit \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 \
--master local[*] \
/home/umar/structured_streaming.py localhost:2181 fortesting
structured_streaming.py code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StructuredStreaming").config("spark.driver.memory", "2g").config("spark.executor.memory", "2g").getOrCreate()
raw_DF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:2181").option("subscribe", "fortesting").load()
values = raw_DF.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream.trigger(ProcessingTime("5 seconds")).outputMode("append").format("console").start().awaitTermination()
You need spark-sql-kafka for structured streaming:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
Also make sure that you use the same versions of Scala (2.11 above) and Spark (2.2.0) as you use on your cluster.
Please reference This
You're using spark-streaming-kafka-0-10 which currently not support python.

Issue with spark csv reading on spark 1.6.1

Hitting with the error when i am trying to read the CSV file .I m using spark 1.6.1 here is my code
val reftable_df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/home/hadoop1/Reference_Currencyctoff.csv")
reftable_df.show()
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
at com.databricks.spark.csv.package$.<init>(package.scala:27)
at com.databricks.spark.csv.package$.<clinit>(package.scala)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:218)
at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:72)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:157)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:44)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at scb.HBaseBroadcast$.main(HBaseBroadcast.scala:138)
at scb.HBaseBroadcast.main(HBaseBroadcast.scala)
Note : Already i tired with following CSV dependencies
Spark Csv » 1.3.0
Spark Csv » 1.3.1
Spark Csv » 1.4.0
Spark Csv » 1.5.0
Thanks!
I faced same issue
--jars /path/to/spark-csv.jar,/path/to/commons-csv.jar
solved the issue.
commons-csv.jar has this class
you can see the class using jar -tvf commons-csv.jar | grep CSVFormat
try this while starting spark shell
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
include this package

Resources