How to adopt Ranger policy in Spark SQL? - apache-spark

I am using Spark 3.0.1 on HDP 3.1.4. Everything is running well except Spark SQL can't honor Ranger standard SQL policy.
In the past days, I tried the solution which found from the community, the hive warehouse connector and spark-authorizer and spark-llap.
Unfortunately I can't solve it. Seems the code was not maintained and the latest release version doesn't support Spark 3.0. I saw many people are also struggling in this problem.
Is there any suggestion to make Spark SQL adopt Ranger column/ row level permission policy ? Any idea are appreciated. Thank you.
hive warehouse connector, it works on spark 2.3.1, but not 3.0.
spark-authorizer, spark-llap both are version not compatible error.
The version is Spark 3.0.1, HDP 3.1.1, Hive 3.1.0, Ranger 1.2.0

Related

Can I use spark3.3.1 and hive3 together?

I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks
From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.
I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.

Apache spark cassandra dataframe load error

I have an error with Spark-Cassandra load. Pls help!
This is known bug in the alpha version of Spark Cassandra Connector 3.0. You need to use 3.0.0-beta version that was released this week.
P.S. You don't need to create SparkSession instance in Zeppelin - it's already there. You can set properties for Cassandra in the Interpreter settings, or even pass via option when reading or writing...

Apache Spark 2.3.1 compatibility with Hadoop 3.0 in HDP 3.0

I am plannig to upgrade from Hortonworks Data platform[HDP] (version 2.6.x) to HDP 3.0. But, there seems to be some major bugs in Apache Spark 2.3.x and its integration with Hadoop 3.0, which are still unresolved in Apache Spark JIRA issues. Although the Spark development team is working to resolve them. Do these issues have a workaround/resolutions by Hortonworks team, or do they still exist in HDP 3.0?
Some unresolved issues concerning my use case:
Spark DataFrames does not work with Hadoop 3.0 https://issues.apache.org/jira/browse/SPARK-18673
Kerberos Ticket renewal fails in Hadoop 3 https://issues.apache.org/jira/browse/SPARK-24493
Spark run on Hadoop 3 https://issues.apache.org/jira/browse/SPARK-23534
I checked integration with HDP Spark-2.3.1 and Hadoop - 3.0.1. It works perfectly and above issues were resolved in HDP version of Spark, but were not provided in HDP-3 release notes.
Check the community answer

Cloudera Hive on Spark 2.x?

Looking at this:
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#hive_on_spark
To summarize, it says Hive doesn't work on Spark 2.x in Cloudera.
However, I assume Hive does run on Spark 2.x in other distributions. Has anyone configured CDH 5.10.x or higher to run Hive on Spark 2.x?
Is Spark 2.x a big leap forward from Spark 1.6?
The latest released version of Hive as of now is 2.1.x and it does not support Spark 2.x (see https://issues.apache.org/jira/browse/HIVE-14029). When Hive version 2.2.0 is released it will support Spark 2.x.

In which version HBase integrate a spark API?

I read the documentation of spark and hbase :
http://hbase.apache.org/book.html#spark
I can see that the last stable version of HBase is 1.1.2, but I also see that apidocs is on version 2.0.0-SNAPSHOT and that the apidoc of spark is empty.
I am confused, why the apidocs and HBase version don't match?
My goal is to use Spark and HBase (bulkGet, bulkPut..etc). How do I know in which HBase version those functions have been implemented?
If someone have complementary documentation on this, it will be awesome.
I am on hbase-0.98.13-hadoop1.
Below is the main JIRA ticket for Spark integration into HBase, the target version is 2.0.0 which still under development, need waiting for the release, or build a version from source code by your own
https://issues.apache.org/jira/browse/HBASE-13992
Within the ticket, there are several links for documentation.
If you just want to access HBase from Spark RDD, you can consider it as normal Hadoop datasource, based on HBase specific TableInputFormat and TableOutputFormat
As of now, Spark doesn't come with HBase API as it has for the hive, you have manually put HBase jars in spark's classpath in spark-default.conf file.
see below link it has complete information about how to connect to HBase:
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

Resources