Spark SQL encapsulation of data sources - apache-spark

I have a Dataset where 98% (older than one day ) of its data would be in Parquet file and 2% (the current day - real time feed) of data would be in HBase, i always need to union them to get final data set for that particular table or entity.
So i would like my clients use the data seamlessly like below in any language they use for accessing spark or via spark shell or any BI tools
spark.read.format("my.datasource").load("entity1")
internally i will read entity1's data from parquet and hbase then union them and return it.
I googled and got few examples on extending DatasourceV2, most of them says you need to develop reader, but here i do not need new reader, but need to make use the existing ones (parquet and HBase).
as i am not introducing any new datasource as such, do i need to create new datasource? or is there any higher level abstraction/hook available?

You have to implement a new datasource per se "parquet+hbase", in the implementation you will make use of existing readers of parquet and hbase, may be extending your classes with both of them and union them etc
For your reference here are some links, which can help you implementing new DataSource.
spark "bigquery" datasource implementation
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
Implementing custom datasource
https://michalsenkyr.github.io/2017/02/spark-sql_datasource

After going through various resource below is what i found and implemented the same.
it might help someone, so adding it as answer
Custom datasource is required only if we introduce a new datasource. For combining existing datasources we have to extend SparkSession and DataFrameReader. In the extended data frame reader we can invoke spark parquet read method, hbase reader and get the corresponding datasets then combine the datasets and return the combined dataset.
in scala we can use implicits to add custom logic to the spark session and dataframe.
in java we need to extend spark session and dataframe, then when using it use imports of extended classes

Related

Generate Spark schema code/persist and reuse schema

I am implementing some Spark Structured Streaming transformations from a Parquet data source. In order to read the data into a streaming DataFrame, one has to specify the schema (it cannot be automatically inferred). The schema is really complex and manually writing the schema code will be a very complex task.
Can you suggest a walkaround? Currently I am creating a batch DataFrame beforehand (using the same data source), Spark infers the schema and then I save the schema to a Scala object and use it as an input for the Structured Streaming reader.
I don't think it is a reliable or a well performing solution. Please suggest how to generate the schema code automatically or somehow persist the schema in a file and reuse it.
From the docs:
By default, Structured Streaming from file based sources requires you
to specify the schema, rather than rely on Spark to infer it
automatically. This restriction ensures a consistent schema will be
used for the streaming query, even in the case of failures. For ad-hoc
use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference to true.
You could also open a shell, read one of the parquet files with automatic schema inference enabled, and save the schema to JSON for later reuse. You only have to do this one time, so it might be faster / more efficient than doing the similar-sounding workaround you're using now.

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Spark SQL with different data sources

Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?
Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.

Using spark sql DataFrameWriter to create external Hive table

As part of a data integration process I am working on, I have a need to persist a Spark SQL DataFrame as an external Hive table.
My constraints at the moment:
Currently limited to Spark 1.6 (v1.6.0)
Need to persist the data in a specific location, retaining the data even if the table definition is dropped (hence external table)
I have found what appears to be a satisfactory solution to write the dataframe, df, as follows:
df.write.saveAsTable('schema.table_name',
format='parquet',
mode='overwrite',
path='/path/to/external/table/files/')
Doing a describe extended schema.table_name against the resulting table confirms that it is indeed external. I can also confirm that the data is retained (as desired) even if the table itself is dropped.
My main concern is that I can't really find a documented example of this anywhere, nor can I find much mention of it in the official docs -
particularly the use of a path to enforce the creation of an external table.
(https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter).
Is there a better/safer/more standard way to persist the dataframe?
I rather creating the Hive tables myself (e.g. CREATE EXTERNAL TABLE IF NOT EXISTS) exactly as I need and then in Spark just do: df.write.saveAsTable('schema.table_name', mode='overwrite').
This way you have control about the table creation and don't depend on the HiveContext doing what you need. In the past there where issues with the Hive tables created this way and the behavior can change in the future since that API is generic and cannot guarantee the underlying implementation by HiveContext.

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Resources