Source API in Spark 2.0+ - apache-spark

I would like to write a data source using the spark source API. I found in the internet examples and documentations that were written on top of spark 1.X using RDD.
Is it still relevant for spark 2.0+?

It is still relevant. RDD is a core data structure in Spark and it didn't change with Spark 2.0.

Related

How to migrate sql_context.registerDataFrameAsTable from spark 1.x to spark 2.x

How to register a data frame as a table in PySpark/Spark 2.x? I am using the sql_context.registerDataFrameAsTable(input_df, input_tablename) method with Spark 1.x. Now, I have to migrate to Spark 2.x and want to use the spark session instead of the sql_context. However, this method does not exist anymore in the Spark context. What is the best approach to replace that function in Spark 2.x?
Use directly Dataset:
input_df.createTempView("input_table_name")
or
input_df.createOrReplaceTempView("input_table_name")

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

Spark Kafka Structured Streaming integration with Apache Ignite

Right now there is no way by which i can save spark DataFrames in Apche Ignite. It will get included in Apache Ignite 2.2 version as mentioned here https://issues.apache.org/jira/browse/IGNITE-3084. I am using Structured Streaming API of Apache Spark with Kafka for consuming data. I want to do some aggregations like average value for a particular column or min-max value on consumed data.
My question is whether i should use Spark SQL DataFrame API to do above mentioned aggregations or should i wait for Apache Ignite 2.2 version ? They have mentioned it in documentation that Ignite SQL is 100s faster than Spark SQL.
Actually, it's up to you. You could go ahead with Spark now, then wait for DataFrames support in Ignite is ready, compare these two approaches and choose which fits your needs better.

Cache Dataframe with Apache Ignite RDD using scala

Have a dataframe that reads data from oracle db. I need to cache this dataframe to Ignite Shared RDD so it can be accessed across multiple sesssions. I tried the solution posted in "How to cache Dataframe in Apache ignite". Lookslike the loadCache() api's are available in java. Am not able to find the loadCache() method in scala when I import the libraries. Any info on this will help. Thanks, VM
loadCache() method is on IgniteCache API, not Ignite. Looks like there is a mistake in that response. All Java APIs can be used in Scala.

In which version HBase integrate a spark API?

I read the documentation of spark and hbase :
http://hbase.apache.org/book.html#spark
I can see that the last stable version of HBase is 1.1.2, but I also see that apidocs is on version 2.0.0-SNAPSHOT and that the apidoc of spark is empty.
I am confused, why the apidocs and HBase version don't match?
My goal is to use Spark and HBase (bulkGet, bulkPut..etc). How do I know in which HBase version those functions have been implemented?
If someone have complementary documentation on this, it will be awesome.
I am on hbase-0.98.13-hadoop1.
Below is the main JIRA ticket for Spark integration into HBase, the target version is 2.0.0 which still under development, need waiting for the release, or build a version from source code by your own
https://issues.apache.org/jira/browse/HBASE-13992
Within the ticket, there are several links for documentation.
If you just want to access HBase from Spark RDD, you can consider it as normal Hadoop datasource, based on HBase specific TableInputFormat and TableOutputFormat
As of now, Spark doesn't come with HBase API as it has for the hive, you have manually put HBase jars in spark's classpath in spark-default.conf file.
see below link it has complete information about how to connect to HBase:
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

Resources