Spark parquet schema evolution - apache-spark

I have a partitioned hdfs parquet location which is having different schema is different partition.
Say 5 columns in first partition, 4 cols in 2nd partition. Now I try to read the base Parquet path and then filter the 2nd partition.
This gives me 5 columns in the DF even though I have only 4 columns in Parquet files in 2nd partition.
When I read the 2nd partition directly, it gives correct 4 cols. How to fix this.

You can specify the required schema(4 columns) while reading the parquet file!
Then spark only reads the fields that included in the schema, if field not exists in the data then null will be returned.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val sch=new StructType().add("i",IntegerType).add("z",StringType)
spark.read.schema(sch).parquet("<parquet_file_path>").show()
//here i have i in my data and not have z field
//+---+----+
//| i| z|
//+---+----+
//| 1|null|
//+---+----+

I would really like to hep you but I am not sure what you actually want to achieve. What's your intention about this?
If you to read the parquet file with all it's partitions and you just wanna get the columns both partitions have, maybe the read option "mergeSchema" fits your need.
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a
necessity in most cases, we turned it off by default starting from
1.5.0. You may enable it by setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
refer to spark documentation
so it would be interesting which version of spark you are using and how the properties spark.sql.parquet.mergeSchema (spark setting) and mergeSchema (client) are set

Related

Performance of spark while reading from hive vs parquet

Assuming I have an external hive table on top parquet/orc files partitioned on date, what would be the performance impact of using
spark.read.parquet("s3a://....").filter("date_col='2021-06-20'")
v/s
spark.sql("select * from table").filter("date_col='2021-06-20'")
After reading into a dataframe, It will be followed by a series of transformations and aggregations.
spark version : 2.3.0 or 3.0.2
hive version : 1.2.1000
number of records per day : 300-700 Mn
My hunch is that there won't be any performance difference while using either of the above queries since parquet natively has most of the optimizations that a hive metastore can provide and spark is capable of using it. Like, predicate push-down, advantages of columnar storage etc.
As a follow-up question, what happens if
The underlying data was csv instead of parquet. Does having a hive table on top improves performance ?
Hive table was bucketed. Does it make sense to read the underlying file system in this case instead of reading from table ?
Also, are there any situations where reading directly from parquet is a better option compared to hive ?
Hive should actually be faster here because they both have pushdowns, Hive already has the schema stored. The parquet read as you have it here will need to infer the merged schema. You can make them about the same by providing the schema.
You can make the Parquet version even faster by navigating directly to the partition. This avoids having to do the initial filter on the available partitions.
So something like this would do it:
spark.read.option("basePath", "s3a://....").parquet("s3a://..../date_col=2021-06-20")
Note this works best if you already have a schema, because this also skips schema merging.
As to your follow-ups:
It would make a huge difference if it's CSV, as it would then have to parse all of the data and then filter out those columns. CSV is really bad for large datasets.
Shouldn't really gain you all that much and may get you into trouble. The metadata that Hive stores can allow Spark to navigate your data more efficiently here than you trying to do it yourself.

How to select columns that contain any of the given strings as part of the column name in Pyspark [duplicate]

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Spark - Stream kafka to file that changes every day?

I have a kafka stream I will be processing in spark. I want to write the output of this stream to a file. However, I want to partition these files by day, so everyday it will start writing to a new file. Can something like this be done? I want this to be left running and when a new day occurs, it will switch to write to a new file.
val streamInputDf = spark.readStream.format("kafka")
.option("kafka.bootstrapservers", "XXXX")
.option("subscribe", "XXXX")
.load()
val streamSelectDf = streamInputDf.select(...)
streamSelectDf.writeStream.format("parquet)
.option("path", "xxx")
???
Adding partition from spark can be done with partitionBy provided in
DataFrameWriter for non-streamed or with DataStreamWriter for
streamed data.
Below are the signatures :
public DataFrameWriter partitionBy(scala.collection.Seq
colNames)
DataStreamWriter partitionBy(scala.collection.Seq colNames)
Partitions the output by the given columns on the file system.
DataStreamWriter partitionBy(String... colNames) Partitions the
output by the given columns on the file system.
Description :
partitionBy public DataStreamWriter partitionBy(String... colNames)
Partitions the output by the given columns on the file system. If
specified, the output is laid out on the file system similar to Hive's
partitioning scheme. As an example, when we partition a dataset by
year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize
physical data layout. It provides a coarse-grained index for skipping
unnecessary data reads when queries have predicates on the partitioned
columns. In order for partitioning to work well, the number of
distinct values in each column should typically be less than tens of
thousands.
Parameters: colNames - (undocumented) Returns: (undocumented) Since:
2.0.0
so if you want to partition data by year and month spark will save the data to folder like:
year=2019/month=01/05
year=2019/month=02/05
Option 1 (Direct write):
You have mentioned parquet - you can use saving as a parquet format with:
df.write.partitionBy('year', 'month','day').format("parquet").save(path)
Option 2 (insert in to hive using same partitionBy ):
You can also insert into hive table like:
df.write.partitionBy('year', 'month', 'day').insertInto(String tableName)
Getting all hive partitions:
Spark sql is based on hive query language so you can use SHOW PARTITIONS
To get list of partitions in the specific table.
sparkSession.sql("SHOW PARTITIONS partitionedHiveParquetTable")
Conclusion :
I would suggest option 2 ... since Advantage is later you can query data based on partition (aka query on raw data to know what you have received) and underlying file can be parquet or orc.
Note :
Just make sure you have .enableHiveSupport() when you are creating session with SparkSessionBuilder and also make sure whether you have hive-conf.xml etc. configured properly.
Based on this answer spark should be able to write to a folder based on the year, month and day, which seems to be exactly what you are looking for. Have not tried it in spark streaming, but hopefully this example gets you on the right track:
df.write.partitionBy("year", "month", "day").format("parquet").save(outPath)
If not, you might be able to put in a variable filepath based on current_date()

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

Resources