In which version HBase integrate a spark API? - apache-spark

I read the documentation of spark and hbase :
http://hbase.apache.org/book.html#spark
I can see that the last stable version of HBase is 1.1.2, but I also see that apidocs is on version 2.0.0-SNAPSHOT and that the apidoc of spark is empty.
I am confused, why the apidocs and HBase version don't match?
My goal is to use Spark and HBase (bulkGet, bulkPut..etc). How do I know in which HBase version those functions have been implemented?
If someone have complementary documentation on this, it will be awesome.
I am on hbase-0.98.13-hadoop1.

Below is the main JIRA ticket for Spark integration into HBase, the target version is 2.0.0 which still under development, need waiting for the release, or build a version from source code by your own
https://issues.apache.org/jira/browse/HBASE-13992
Within the ticket, there are several links for documentation.
If you just want to access HBase from Spark RDD, you can consider it as normal Hadoop datasource, based on HBase specific TableInputFormat and TableOutputFormat

As of now, Spark doesn't come with HBase API as it has for the hive, you have manually put HBase jars in spark's classpath in spark-default.conf file.
see below link it has complete information about how to connect to HBase:
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

Related

Can I use spark3.3.1 and hive3 together?

I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks
From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.
I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.

How to adopt Ranger policy in Spark SQL?

I am using Spark 3.0.1 on HDP 3.1.4. Everything is running well except Spark SQL can't honor Ranger standard SQL policy.
In the past days, I tried the solution which found from the community, the hive warehouse connector and spark-authorizer and spark-llap.
Unfortunately I can't solve it. Seems the code was not maintained and the latest release version doesn't support Spark 3.0. I saw many people are also struggling in this problem.
Is there any suggestion to make Spark SQL adopt Ranger column/ row level permission policy ? Any idea are appreciated. Thank you.
hive warehouse connector, it works on spark 2.3.1, but not 3.0.
spark-authorizer, spark-llap both are version not compatible error.
The version is Spark 3.0.1, HDP 3.1.1, Hive 3.1.0, Ranger 1.2.0

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8
The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

Apache spark cassandra dataframe load error

I have an error with Spark-Cassandra load. Pls help!
This is known bug in the alpha version of Spark Cassandra Connector 3.0. You need to use 3.0.0-beta version that was released this week.
P.S. You don't need to create SparkSession instance in Zeppelin - it's already there. You can set properties for Cassandra in the Interpreter settings, or even pass via option when reading or writing...

Source API in Spark 2.0+

I would like to write a data source using the spark source API. I found in the internet examples and documentations that were written on top of spark 1.X using RDD.
Is it still relevant for spark 2.0+?
It is still relevant. RDD is a core data structure in Spark and it didn't change with Spark 2.0.

Resources