Generate Spark schema code/persist and reuse schema - apache-spark

I am implementing some Spark Structured Streaming transformations from a Parquet data source. In order to read the data into a streaming DataFrame, one has to specify the schema (it cannot be automatically inferred). The schema is really complex and manually writing the schema code will be a very complex task.
Can you suggest a walkaround? Currently I am creating a batch DataFrame beforehand (using the same data source), Spark infers the schema and then I save the schema to a Scala object and use it as an input for the Structured Streaming reader.
I don't think it is a reliable or a well performing solution. Please suggest how to generate the schema code automatically or somehow persist the schema in a file and reuse it.

From the docs:
By default, Structured Streaming from file based sources requires you
to specify the schema, rather than rely on Spark to infer it
automatically. This restriction ensures a consistent schema will be
used for the streaming query, even in the case of failures. For ad-hoc
use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference to true.
You could also open a shell, read one of the parquet files with automatic schema inference enabled, and save the schema to JSON for later reuse. You only have to do this one time, so it might be faster / more efficient than doing the similar-sounding workaround you're using now.

Related

Spark SQL encapsulation of data sources

I have a Dataset where 98% (older than one day ) of its data would be in Parquet file and 2% (the current day - real time feed) of data would be in HBase, i always need to union them to get final data set for that particular table or entity.
So i would like my clients use the data seamlessly like below in any language they use for accessing spark or via spark shell or any BI tools
spark.read.format("my.datasource").load("entity1")
internally i will read entity1's data from parquet and hbase then union them and return it.
I googled and got few examples on extending DatasourceV2, most of them says you need to develop reader, but here i do not need new reader, but need to make use the existing ones (parquet and HBase).
as i am not introducing any new datasource as such, do i need to create new datasource? or is there any higher level abstraction/hook available?
You have to implement a new datasource per se "parquet+hbase", in the implementation you will make use of existing readers of parquet and hbase, may be extending your classes with both of them and union them etc
For your reference here are some links, which can help you implementing new DataSource.
spark "bigquery" datasource implementation
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
Implementing custom datasource
https://michalsenkyr.github.io/2017/02/spark-sql_datasource
After going through various resource below is what i found and implemented the same.
it might help someone, so adding it as answer
Custom datasource is required only if we introduce a new datasource. For combining existing datasources we have to extend SparkSession and DataFrameReader. In the extended data frame reader we can invoke spark parquet read method, hbase reader and get the corresponding datasets then combine the datasets and return the combined dataset.
in scala we can use implicits to add custom logic to the spark session and dataframe.
in java we need to extend spark session and dataframe, then when using it use imports of extended classes

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Lazy loading of partitioned parquet in Apache Spark

As I understand it, Apache Spark uses lazy evaluation. So for example code like the following that consists only of transformations will do no actual processing:
val transformed_df = df.filter("some_field = 10").select("some_other_field", "yet_another_field")
Only when we do an "action" will any processing actually occur:
transformed_df.show()
I had been under the impression that load operations are also lazy in spark. (See How spark loads the data into memory.)
However, my experiences with spark have not borne this out. When I do something like the following,
val df = spark.read.parquet("/path/to/parquet/")
execution seems to depend greatly on the size of the data in the path. In other words, it's not strictly lazy. This is inconvenient if the data is partitioned and I only need to look at a fraction of the partitions.
For example:
df.filter("partitioned_field = 10").show()
If the data is partitioned in storage on "partitioned_field", I would have expected spark to wait until show() is called, and then read only data under "/path/to/parquet/partitioned_field=10/". But again, this doesn't seem to be the case. Spark appears to perform at least some operations on all of the data as soon as read or load is called.
I could get around this by only loading /path/to/parquet/partitioned_field=10/ in the first place, but this is much less elegant than just calling "read" and filtering on the partitioned field, and it's harder to generalize.
Is there a more elegant preferred way to lazily load partitions of parquet data?
(To clarify, I am using Spark 2.4.3)
I think I've stumbled on an answer to my question while learning about a key distinction that is often overlooked when talking about lazy evaluation in spark.
Data is lazily evaluated, but schemas are not. So if we are reading parquet, which is a structured data type, spark does have to at least determine the schema of any files it's reading as soon as read() or load() is called. So calling read() on a large number of files will take longer than on a small number of files.
Given that partitions are part of the schema, it's less surprising to me now that spark has to look at all of the files in the path to determine the schema before filtering on a partition field.
It would be convenient for my purposes if spark were to wait until schema evaluation was strictly necessary and was able to filter on partition fields prior to determining the rest of the schema, but it sounds like this is not the case. I believe Dataset objects always must have a schema, so I'm not sure there's a way around this problem without significant changes to Spark.
In conclusion, it seems like my only option currently is to pass in a list of paths for the partitions that I need rather than the base path if I want to avoid evaluating the schema over the entire data repository.

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Spark changes the schema when writing to Avro

I have a Spark job (in CDH 5.5.1) that loads two Avro files (both with the same schema), combines them to make a DataFrame (also with the same schema) then writes them back out to Avro.
The job explicitly compares the two input schemas to ensure they are the same.
This is used to combine existing data with a few updates (since the files are immutable). I then replace the original file with the new combined file by renaming them in HDFS.
However, if I repeat the update process (i.e. try to add some further updates to the previously updated file), the job fails because the schemas are now different! What is going on?
This is due to the behaviour of the spark-avro package.
When writing to Avro, spark-avro writes everything as unions of the given type along with a null option.
In other words, "string" becomes ["string", "null"] so every field becomes nullable.
If your input schema already contains only nullable fields, then this problem doesn't become apparent.
This isn't mentioned on the spark-avro page, but is described as one of the limitations of spark-avro in some Cloudera documentation:
Because Spark is converting data types, watch for the following:
Enumerated types are erased - Avro enumerated types become strings when they are read into Spark because Spark does not support
enumerated types.
Unions on output - Spark writes everything as unions of the given type along with a null option.
Avro schema changes - Spark reads everything into an internal representation. Even if you just read and then write the data, the
schema for the output will be different.
Spark schema reordering - Spark reorders the elements in its schema when writing them to disk so that the elements being
partitioned on are the last elements.
See also this github issue: (spark-avro 92)

Resources