Cloudera Hive on Spark 2.x? - apache-spark

Looking at this:
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#hive_on_spark
To summarize, it says Hive doesn't work on Spark 2.x in Cloudera.
However, I assume Hive does run on Spark 2.x in other distributions. Has anyone configured CDH 5.10.x or higher to run Hive on Spark 2.x?
Is Spark 2.x a big leap forward from Spark 1.6?

The latest released version of Hive as of now is 2.1.x and it does not support Spark 2.x (see https://issues.apache.org/jira/browse/HIVE-14029). When Hive version 2.2.0 is released it will support Spark 2.x.

Related

CDH(Cloudera Distributed Hadoop) to CDP(Cloudera Data Platform) migration Spark 1x-3x query

We are currently doing a feasibility study on migrating from CDH(Cloudera Distributed Hadoop) to CDP(Cloudera Data Platform) wrt spark(currently in version 1.6).
When checked the documenation,it was understood that 1.6 is not supported ,we need to refactor it to 2.4 and the steps to do manually is given
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdh/topics/cdp-one-workload-migra...
But We are planning to migrate to Spark 3.x in CDP.In one of the cloudera blogs about the same(link below
https://blog.cloudera.com/upgrade-journey-the-path-from-cdh-to-cdp-private-cloud/
As part of pre upgrade step ,it is mentioned that we need to convert Spark 1.x jobs to 2.4.5.
Phase 2: Pre-upgrade
Backup existing cluster using the backup steps list here
Confirm if all the prerequisites are addressed. Ensure all outstanding dependencies are met.
Convert Spark 1.x jobs to Spark 2.4.5. Test and validate the jobs to ensure all the required code changes are performed and tested.
My doubt is :
If the migration is from Spark 1.x-3.x when moving from cdh to cdp,is it mandatory to have a step in between to convert spark 1x-2x and then 2x to 3,if yes then the refactoring of 1x-2x is automated or it should be done manually as the steps given in cloudera
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdh/topics/cdp-one-workload-migration-spark16-to-spark24.html
If not,can we directly refactor from spark 1x-3x when moving from CDH to CDP.Kindly help.
Thanks in advance.
tried looking for the solution in exisiting cloudera docuementation but couldnt get anything,in terms of Migrating Spark workloads to CDP ,there are only 2 options
Spark 1.6 to Spark 2.4 Refactoring
Because Spark 1.6 is not supported on CDP, you need to refactor Spark workloads from Spark 1.6 on CDH or HDP to Spark 2.4 on CDP.
Spark 2.3 to Spark 2.4 Refactoring
Because Spark 2.3 is not supported on CDP, you need to refactor Spark workloads from Spark 2.3 on CDH or HDP to Spark 2.4 on CDP.
Spark 2.4 to 3.x
But, if in case if we have Spark 1.6,then moving it to 2.4 and then to 3 will be double the effort

Can I use spark3.3.1 and hive3 together?

I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks
From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.
I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.

How to adopt Ranger policy in Spark SQL?

I am using Spark 3.0.1 on HDP 3.1.4. Everything is running well except Spark SQL can't honor Ranger standard SQL policy.
In the past days, I tried the solution which found from the community, the hive warehouse connector and spark-authorizer and spark-llap.
Unfortunately I can't solve it. Seems the code was not maintained and the latest release version doesn't support Spark 3.0. I saw many people are also struggling in this problem.
Is there any suggestion to make Spark SQL adopt Ranger column/ row level permission policy ? Any idea are appreciated. Thank you.
hive warehouse connector, it works on spark 2.3.1, but not 3.0.
spark-authorizer, spark-llap both are version not compatible error.
The version is Spark 3.0.1, HDP 3.1.1, Hive 3.1.0, Ranger 1.2.0

Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive Metastore version 2.3. But the Cluster managing team has decided to upgrade to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is compatible with Hadoop 3 and Hive 3, as we could not test if anything breaks.
Is there any possible way to stick to spark 2.4.x version and still be able to use Hadoop 3 and Hive 3?
I got to know backporting is one option, It would be great if you could point me in that direction.
You can compile Spark 2.4 with Hadoop 3.1 profile instead of relying on default version. You need to use hadoop-3.1 profile as described in documentation on building Spark, something like:
./build/mvn -Pyarn -Phadoop-3.1 -DskipTests clean package

Apache Spark 2.3.1 compatibility with Hadoop 3.0 in HDP 3.0

I am plannig to upgrade from Hortonworks Data platform[HDP] (version 2.6.x) to HDP 3.0. But, there seems to be some major bugs in Apache Spark 2.3.x and its integration with Hadoop 3.0, which are still unresolved in Apache Spark JIRA issues. Although the Spark development team is working to resolve them. Do these issues have a workaround/resolutions by Hortonworks team, or do they still exist in HDP 3.0?
Some unresolved issues concerning my use case:
Spark DataFrames does not work with Hadoop 3.0 https://issues.apache.org/jira/browse/SPARK-18673
Kerberos Ticket renewal fails in Hadoop 3 https://issues.apache.org/jira/browse/SPARK-24493
Spark run on Hadoop 3 https://issues.apache.org/jira/browse/SPARK-23534
I checked integration with HDP Spark-2.3.1 and Hadoop - 3.0.1. It works perfectly and above issues were resolved in HDP version of Spark, but were not provided in HDP-3 release notes.
Check the community answer

Resources