How can I build spark with current (hive 2.1) bindings instead of 1.2?
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support
Does not mention how this works.
Does spark work well with hive 2.x?
I had the same question and this is what I've found so far. You can try to build spark with the newer version of hive:
mvn -Dhive.group=org.apache.hive -Dhive.version=2.1.0 clean package
This runs for a long time and fails in unit tests. If you skip tests, you get a bit farther but then run into compilation errors. In summary, spark does not work well with hive 2.x!
I also searched through the ASF Jira for Spark and Hive and haven't found any mentions of upgrading. This is the closest ticket I was able to find: https://issues.apache.org/jira/browse/SPARK-15691
Related
Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!
This question already has answers here:
How to create SparkSession with Hive support (fails with "Hive classes are not found")?
(10 answers)
Closed 2 years ago.
I've got the following set-up:
- HDFS
- Hive
- Remote Hive Metastore (and a metastore db)
- Apache Spark (downloaded and installed from https://archive.apache.org/dist/spark/spark-2.4.3/)
I can use Hive as expected, create tables - read data from HDFS and all that. But, cannot get spark to run with Hive Support. Whenever I run val sparkSession = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
I get java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
Hive classes are in the path, and I have copied over hive-site.xml, core-site.xml and hdfs-site.xml
Do I need to build spark with hive support (as mentioned here: https://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support) to get spark to work with hive?
Is there a Spark with Hive support tar available which I can extract instead of building from source?
Thanks!
What environment are you running spark in? The easy answer is to let whatever packaging tool is available do all the heavy lifting. For example if you're on osx use brew to install everything. If you're in a maven/sbt project bring in the spark-hive package, etc.
Do I need to build spark with hive support
If you're manually building spark from source yes you do. Here's an example command. (but chances are you don't have todo this)
./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support
If you're missing class,spark is internally checking for the pressence of "org.apache.hadoop.hive.conf.HiveConf" which is in the hive-exec-1.2.1.spark.jar. Note this is a customized version of hive designed to work nicely with spark.
https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
I'm trying to run a simple select count(*) from table query on Hive, but it fails with the following error:
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 5414a8a4-5252-4ccf-b63e-2ee563f7d772_0: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
This is happening since I've moved to CDH 6.2 and enabled Spark (version 2.4.0-cdh6.2.0) as the execution engine of Hive (version 2.1.1-cdh6.2.0).
My guess is that Hive is not correctly configured to launch Spark. I've tried setting the spark.home property of the hive-site.xml to /opt/cloudera/parcels/CDH/lib/spark/, and setting the SPARK_HOME environment variable to the same value, but it made no difference.
A similar issue was reported here, but the solution (i.e., to put the spark-assembly.jar file in Hive's lib directory) cannot be applied (as the file is no longer built in latest Spark's versions).
A previous question addressed a similar but different issue, related to memory limits on YARN.
Also, switching to MapReduce as the execution engine still fails, but with a different error:
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org/apache/hadoop/hdfs/protocol/SystemErasureCodingPolicies
Looking for the latest error on Google shows no result at all.
UPDATE: I discovered that queries do work when connecting to Hive through other tools (e.g., Beeline, Hue, Spark) and independently of the underlying execution engine (i.e., MapReduce or Spark). Thus, the error may lie within the Hive CLI, which is currently deprecated.
UPDATE 2: the same problem actually happened on Beeline and Hue with a CREATE TABLE query; I was able to execute it only with the Hive interpreter of Zeppelin
I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.
I read the documentation of spark and hbase :
http://hbase.apache.org/book.html#spark
I can see that the last stable version of HBase is 1.1.2, but I also see that apidocs is on version 2.0.0-SNAPSHOT and that the apidoc of spark is empty.
I am confused, why the apidocs and HBase version don't match?
My goal is to use Spark and HBase (bulkGet, bulkPut..etc). How do I know in which HBase version those functions have been implemented?
If someone have complementary documentation on this, it will be awesome.
I am on hbase-0.98.13-hadoop1.
Below is the main JIRA ticket for Spark integration into HBase, the target version is 2.0.0 which still under development, need waiting for the release, or build a version from source code by your own
https://issues.apache.org/jira/browse/HBASE-13992
Within the ticket, there are several links for documentation.
If you just want to access HBase from Spark RDD, you can consider it as normal Hadoop datasource, based on HBase specific TableInputFormat and TableOutputFormat
As of now, Spark doesn't come with HBase API as it has for the hive, you have manually put HBase jars in spark's classpath in spark-default.conf file.
see below link it has complete information about how to connect to HBase:
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html