How do I add Hive support in Apache Spark? [duplicate] - apache-spark

This question already has answers here:
How to create SparkSession with Hive support (fails with "Hive classes are not found")?
(10 answers)
Closed 2 years ago.
I've got the following set-up:
- HDFS
- Hive
- Remote Hive Metastore (and a metastore db)
- Apache Spark (downloaded and installed from https://archive.apache.org/dist/spark/spark-2.4.3/)
I can use Hive as expected, create tables - read data from HDFS and all that. But, cannot get spark to run with Hive Support. Whenever I run val sparkSession = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
I get java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
Hive classes are in the path, and I have copied over hive-site.xml, core-site.xml and hdfs-site.xml
Do I need to build spark with hive support (as mentioned here: https://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support) to get spark to work with hive?
Is there a Spark with Hive support tar available which I can extract instead of building from source?
Thanks!

What environment are you running spark in? The easy answer is to let whatever packaging tool is available do all the heavy lifting. For example if you're on osx use brew to install everything. If you're in a maven/sbt project bring in the spark-hive package, etc.
Do I need to build spark with hive support
If you're manually building spark from source yes you do. Here's an example command. (but chances are you don't have todo this)
./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support
If you're missing class,spark is internally checking for the pressence of "org.apache.hadoop.hive.conf.HiveConf" which is in the hive-exec-1.2.1.spark.jar. Note this is a customized version of hive designed to work nicely with spark.
https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark

Related

Can I use spark3.3.1 and hive3 together?

I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks
From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.
I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8
The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

How to configure Hive to use Spark execution engine on Google Dataproc?

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

Apache spark installation and db_metastore

I am beginner in Spark.
I installed java and spark-1.6.1-bin-hadoop2.6.tgz(I have not installed Hadoop) and with out changing any configuration in conf directory ran spark-shell.
In the director where spark is installed , I see another metastore_db created with tmp folder inside it.
why is this metastore_db is created , where is this configured ?
Also I see sqlContext being created after running spark-shell, what does this sqlContext represent?
When running spark-shell, a SparkContext and SQLContext are created. SQLContext is an extension of SparkContext to enable support of Spark SQL. It has method to execute sql queries (method sql) and to create DataFrames.
db_metastore is a Hive metastore path. Spark support Apache Hive queries via HiveContext. If there is no hive-site.xml configured, Spark will use db_metastore path, see documentation for details.
However, it would be good if you will download Spark 2.0. There you've got unified entry point to Spark, named SparkSession. This class allows you to read data from many sources, create Datasets, etc.

Spark with custom hive bindings

How can I build spark with current (hive 2.1) bindings instead of 1.2?
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support
Does not mention how this works.
Does spark work well with hive 2.x?
I had the same question and this is what I've found so far. You can try to build spark with the newer version of hive:
mvn -Dhive.group=org.apache.hive -Dhive.version=2.1.0 clean package
This runs for a long time and fails in unit tests. If you skip tests, you get a bit farther but then run into compilation errors. In summary, spark does not work well with hive 2.x!
I also searched through the ASF Jira for Spark and Hive and haven't found any mentions of upgrading. This is the closest ticket I was able to find: https://issues.apache.org/jira/browse/SPARK-15691

Resources