Column Indexing in Parquet - apache-spark

Has anyone tried to create column indexes while writing to parquet? Parquet 2.0 provided support for Column Indexes in https://issues.apache.org/jira/browse/PARQUET-1201 but I'm not able to figure out how to use that.
Basically while writing to parquet from spark, I want one column to be indexed, such that when I read it again, I can have faster queries. But I am not able to figure out how to proceed with this.

The column index is write to parquet file while using parquet-mr library v1.11+ automatically. So spark just need implement read function, and it has been implemented in spark v3.2.0.

Related

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?
Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain() on your query to see if the feature is being used. (Not all versions of hdfs support this.)

Most optimal method to check length of a parquet table in dbfs with pyspark?

I have a table on dbfs I can read with pyspark, but I only need to know the length of it (nrows). I know I could just read the file and do a table.count() to get it, but that would take some time.
Is there a better way to solve this?
I am afraid not.
Since you are using dbfs, I suppose you are using Delta format with Databricks. So, theoretically, you could check the metastore, but:
The metastore is not the source of truth about the latest information
of a Delta table
https://docs.delta.io/latest/delta-batch.html#control-data-location

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

Reading orc/parquet files from hdfs in spark

I have a doubt while loading data into spark cluster(standalone mode) from hdfs say parquet or orc file whether it will just read the meta data or the entire data at the first instance. If I apply filter whether it will load the entire data and apply filter to it or it reads only the selected column and whether the entire data will be copied to spark if so where it will be present ...
thanks in advance..
Not sure exactly what you mean by filtering but generally speaking, when accessing columnar structured files like Parquet or ORC, if you select specific columns you will only read them to memory and not the other columns.
Specifically if you are asking for something like ds.filter("..."), the ability to only read to memory the data that answers the filter is called "Predicate pushdown". generally speaking it is available in Spark but depends on exactly what you are trying to do. for example AFAIA, Spark can't do predicate pushdown on complex typed columns in Parquet files.
So I would suggest if possible only selecting relevant columns and then filtering. if you use complex types or spark SQL check on Google if predicate pushdown is supported.
Also, it doesn't matter if files are on HDFS or somewhere else like S3, behaviour should be the same
If I apply filter whether it will load the entire data and apply filter to it or it reads only..
Spark doesn't load data into memory when filter transformation is done, it will not load data from the file till any action is done on it. This is because of lazy evaluation.

Resources