Efficient way to read specific columns from parquet file in spark - apache-spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.

val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]

At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.

Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame

Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Related

Column Indexing in Parquet

Has anyone tried to create column indexes while writing to parquet? Parquet 2.0 provided support for Column Indexes in https://issues.apache.org/jira/browse/PARQUET-1201 but I'm not able to figure out how to use that.
Basically while writing to parquet from spark, I want one column to be indexed, such that when I read it again, I can have faster queries. But I am not able to figure out how to proceed with this.
The column index is write to parquet file while using parquet-mr library v1.11+ automatically. So spark just need implement read function, and it has been implemented in spark v3.2.0.

How to select columns that contain any of the given strings as part of the column name in Pyspark [duplicate]

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Spark parquet schema evolution

I have a partitioned hdfs parquet location which is having different schema is different partition.
Say 5 columns in first partition, 4 cols in 2nd partition. Now I try to read the base Parquet path and then filter the 2nd partition.
This gives me 5 columns in the DF even though I have only 4 columns in Parquet files in 2nd partition.
When I read the 2nd partition directly, it gives correct 4 cols. How to fix this.
You can specify the required schema(4 columns) while reading the parquet file!
Then spark only reads the fields that included in the schema, if field not exists in the data then null will be returned.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val sch=new StructType().add("i",IntegerType).add("z",StringType)
spark.read.schema(sch).parquet("<parquet_file_path>").show()
//here i have i in my data and not have z field
//+---+----+
//| i| z|
//+---+----+
//| 1|null|
//+---+----+
I would really like to hep you but I am not sure what you actually want to achieve. What's your intention about this?
If you to read the parquet file with all it's partitions and you just wanna get the columns both partitions have, maybe the read option "mergeSchema" fits your need.
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a
necessity in most cases, we turned it off by default starting from
1.5.0. You may enable it by setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
refer to spark documentation
so it would be interesting which version of spark you are using and how the properties spark.sql.parquet.mergeSchema (spark setting) and mergeSchema (client) are set

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

spark: case sensitive partitionBy column

I am trying to write out a dataframe in hiveContext(for orc format) with a partition key:
df.write().partitionBy("event_type").mode(SaveMode.Overwrite).orc("/path");
However the column on which I am trying to partition has case sensitive values and this is throwing an error while writing:
Caused by: java.io.IOException: File already exists: file:/path/_temporary/0/_temporary/attempt_201607262359_0001_m_000000_0/event_type=searchFired/part-r-00000-57167cfc-a9db-41c6-91d8-708c4f7c572c.orc
event_type column has both searchFired and SearchFired as values. However if I remove one of them from the dataframe then I am able to write successfully. How do I solve this?
It is generally not a good idea to rely on case differences in file systems.
The solution is to combine values that differ by case into the same partition using something like (using the Scala DSL):
df
.withColumn("par_event_type", expr("lower(event_type)"))
.write
.partitionBy("par_event_type")
.mode(SaveMode.Overwrite)
.orc("/path")
This adds an extra column for partitioning. If that causes problems, you can use drop to remove it when you read the data.

Resources