Reading TinkerPop library generated files using Spark - apache-spark

Is there a direct way of reading tinkerpop format org.apache.tinkerpop.gremlin.hadoop.structure.io.ObjectWritable files using spark.
Spark version: 3.*

Related

Can I use spark3.3.1 and hive3 together?

I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks
From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.
I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

Source API in Spark 2.0+

I would like to write a data source using the spark source API. I found in the internet examples and documentations that were written on top of spark 1.X using RDD.
Is it still relevant for spark 2.0+?
It is still relevant. RDD is a core data structure in Spark and it didn't change with Spark 2.0.

How to read and write data in Google Cloud Bigtable in PySpark application?

I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector?
How can we access Bigtable from a PySpark application?
Cloud Bigtable is usually best accessed from Spark using the Apache HBase APIs.
HBase currently only provides Hadoop MapReduce I/O formats. These can be accessed from Spark (or PySpark) using the SparkContext.newAPIHadoopRDD methods. However converting the records into something usable in Python is difficult.
HBase is developing Spark SQL APIs, but these have not been integrated in a released version. Hortonworks has a Spark HBase Connector, but it compiles against Spark 1.6 (which requires Cloud Dataproc version 1.0) and I have not used it, so I cannot speak to how easy it is to use.
Alternatively you could use a Python based Bigtable client, and simply use PySpark for parallelism.

In which version HBase integrate a spark API?

I read the documentation of spark and hbase :
http://hbase.apache.org/book.html#spark
I can see that the last stable version of HBase is 1.1.2, but I also see that apidocs is on version 2.0.0-SNAPSHOT and that the apidoc of spark is empty.
I am confused, why the apidocs and HBase version don't match?
My goal is to use Spark and HBase (bulkGet, bulkPut..etc). How do I know in which HBase version those functions have been implemented?
If someone have complementary documentation on this, it will be awesome.
I am on hbase-0.98.13-hadoop1.
Below is the main JIRA ticket for Spark integration into HBase, the target version is 2.0.0 which still under development, need waiting for the release, or build a version from source code by your own
https://issues.apache.org/jira/browse/HBASE-13992
Within the ticket, there are several links for documentation.
If you just want to access HBase from Spark RDD, you can consider it as normal Hadoop datasource, based on HBase specific TableInputFormat and TableOutputFormat
As of now, Spark doesn't come with HBase API as it has for the hive, you have manually put HBase jars in spark's classpath in spark-default.conf file.
see below link it has complete information about how to connect to HBase:
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

Resources