Can I use spark3.3.1 and hive3 together? - apache-spark

I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks

From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.

I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.

Related

How to adopt Ranger policy in Spark SQL?

I am using Spark 3.0.1 on HDP 3.1.4. Everything is running well except Spark SQL can't honor Ranger standard SQL policy.
In the past days, I tried the solution which found from the community, the hive warehouse connector and spark-authorizer and spark-llap.
Unfortunately I can't solve it. Seems the code was not maintained and the latest release version doesn't support Spark 3.0. I saw many people are also struggling in this problem.
Is there any suggestion to make Spark SQL adopt Ranger column/ row level permission policy ? Any idea are appreciated. Thank you.
hive warehouse connector, it works on spark 2.3.1, but not 3.0.
spark-authorizer, spark-llap both are version not compatible error.
The version is Spark 3.0.1, HDP 3.1.1, Hive 3.1.0, Ranger 1.2.0

Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive Metastore version 2.3. But the Cluster managing team has decided to upgrade to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is compatible with Hadoop 3 and Hive 3, as we could not test if anything breaks.
Is there any possible way to stick to spark 2.4.x version and still be able to use Hadoop 3 and Hive 3?
I got to know backporting is one option, It would be great if you could point me in that direction.
You can compile Spark 2.4 with Hadoop 3.1 profile instead of relying on default version. You need to use hadoop-3.1 profile as described in documentation on building Spark, something like:
./build/mvn -Pyarn -Phadoop-3.1 -DskipTests clean package

How to use hive warehouse connector in HDP 2.6.5

I have a requirement to read hive table from spark which is ACID enabled.
Spark by native doesn't support to read ORC file which is ACID enabled, only option is use spark jdbc.
We can also use hive warehouse connector to read files , can someone explain what is the steps to read using hive warehouse connector.
Is HWC only work in HDP 3 version.Kindly advise.
Spark version :2.3.0
HDP -2.6.5
Spark can read ORC file, check documentation on it here: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#orc-files
Here is a sample of code to read orc file:
spark.read.format("orc").load("example.orc")
HWC is made for HDP 3 version, as Hive and Spark catalogs are not compatible anymore in HDP 3, (Hive is in version 3, and Spark in version 2).
See documentation on it here: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

Cloudera Hive on Spark 2.x?

Looking at this:
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#hive_on_spark
To summarize, it says Hive doesn't work on Spark 2.x in Cloudera.
However, I assume Hive does run on Spark 2.x in other distributions. Has anyone configured CDH 5.10.x or higher to run Hive on Spark 2.x?
Is Spark 2.x a big leap forward from Spark 1.6?
The latest released version of Hive as of now is 2.1.x and it does not support Spark 2.x (see https://issues.apache.org/jira/browse/HIVE-14029). When Hive version 2.2.0 is released it will support Spark 2.x.

In which version HBase integrate a spark API?

I read the documentation of spark and hbase :
http://hbase.apache.org/book.html#spark
I can see that the last stable version of HBase is 1.1.2, but I also see that apidocs is on version 2.0.0-SNAPSHOT and that the apidoc of spark is empty.
I am confused, why the apidocs and HBase version don't match?
My goal is to use Spark and HBase (bulkGet, bulkPut..etc). How do I know in which HBase version those functions have been implemented?
If someone have complementary documentation on this, it will be awesome.
I am on hbase-0.98.13-hadoop1.
Below is the main JIRA ticket for Spark integration into HBase, the target version is 2.0.0 which still under development, need waiting for the release, or build a version from source code by your own
https://issues.apache.org/jira/browse/HBASE-13992
Within the ticket, there are several links for documentation.
If you just want to access HBase from Spark RDD, you can consider it as normal Hadoop datasource, based on HBase specific TableInputFormat and TableOutputFormat
As of now, Spark doesn't come with HBase API as it has for the hive, you have manually put HBase jars in spark's classpath in spark-default.conf file.
see below link it has complete information about how to connect to HBase:
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

Resources