Filtering Spark dataframe based on partition URIs that do not follow the standard Spark format - apache-spark

In Spark, for partitioned datasets that don't follow the standard URI format, the Catalyst does not pick up the partitioning information automatically. This means that filters can be highly inefficient. For instance, let's say the partitioning URI format is /{year}/{month}/{day} instead of /year={year}/month={month}/day={day}.
PySpark solutions that don't involve rewriting the data, registering each partition on the metastore or renaming the directories are appreciated. My guess is that this is possible in Scala?
To make the Catalyst recognize the partitioning structure, a potential solution is to use F.input_file_name() to load the partitioning, but this still loads unnecessary partitions instead of filtering. It's easy to test this doesn't work by corrupting a single partition that was not supposed to be read and also specifying the schema.

Related

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?
Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain() on your query to see if the feature is being used. (Not all versions of hdfs support this.)

Performance of spark while reading from hive vs parquet

Assuming I have an external hive table on top parquet/orc files partitioned on date, what would be the performance impact of using
spark.read.parquet("s3a://....").filter("date_col='2021-06-20'")
v/s
spark.sql("select * from table").filter("date_col='2021-06-20'")
After reading into a dataframe, It will be followed by a series of transformations and aggregations.
spark version : 2.3.0 or 3.0.2
hive version : 1.2.1000
number of records per day : 300-700 Mn
My hunch is that there won't be any performance difference while using either of the above queries since parquet natively has most of the optimizations that a hive metastore can provide and spark is capable of using it. Like, predicate push-down, advantages of columnar storage etc.
As a follow-up question, what happens if
The underlying data was csv instead of parquet. Does having a hive table on top improves performance ?
Hive table was bucketed. Does it make sense to read the underlying file system in this case instead of reading from table ?
Also, are there any situations where reading directly from parquet is a better option compared to hive ?
Hive should actually be faster here because they both have pushdowns, Hive already has the schema stored. The parquet read as you have it here will need to infer the merged schema. You can make them about the same by providing the schema.
You can make the Parquet version even faster by navigating directly to the partition. This avoids having to do the initial filter on the available partitions.
So something like this would do it:
spark.read.option("basePath", "s3a://....").parquet("s3a://..../date_col=2021-06-20")
Note this works best if you already have a schema, because this also skips schema merging.
As to your follow-ups:
It would make a huge difference if it's CSV, as it would then have to parse all of the data and then filter out those columns. CSV is really bad for large datasets.
Shouldn't really gain you all that much and may get you into trouble. The metadata that Hive stores can allow Spark to navigate your data more efficiently here than you trying to do it yourself.

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

Reading orc/parquet files from hdfs in spark

I have a doubt while loading data into spark cluster(standalone mode) from hdfs say parquet or orc file whether it will just read the meta data or the entire data at the first instance. If I apply filter whether it will load the entire data and apply filter to it or it reads only the selected column and whether the entire data will be copied to spark if so where it will be present ...
thanks in advance..
Not sure exactly what you mean by filtering but generally speaking, when accessing columnar structured files like Parquet or ORC, if you select specific columns you will only read them to memory and not the other columns.
Specifically if you are asking for something like ds.filter("..."), the ability to only read to memory the data that answers the filter is called "Predicate pushdown". generally speaking it is available in Spark but depends on exactly what you are trying to do. for example AFAIA, Spark can't do predicate pushdown on complex typed columns in Parquet files.
So I would suggest if possible only selecting relevant columns and then filtering. if you use complex types or spark SQL check on Google if predicate pushdown is supported.
Also, it doesn't matter if files are on HDFS or somewhere else like S3, behaviour should be the same
If I apply filter whether it will load the entire data and apply filter to it or it reads only..
Spark doesn't load data into memory when filter transformation is done, it will not load data from the file till any action is done on it. This is because of lazy evaluation.

How to combine small parquet files with Spark?

I have a Hive table that has a lot of small parquet files and I am creating a Spark data frame out of it to do some processing using SparkSQL. Since I have a large number of splits/files my Spark job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provides, that is, to combine these small input splits into larger ones by specifying a max split size setting. How can I achieve this with Spark? I tried using the coalesce function, but I can only specify the number of partitions with it (I can only control the number of output files with it). Instead I really want some control over the (combined) input split size that a task processes.
Edit: I am using Spark itself, not Hive on Spark.
Edit 2: Here is the current code I have:
//create a data frame from a test table
val df = sqlContext.table("schema.test_table").filter($"my_partition_column" === "12345")
//coalesce it to a fixed number of partitions. But as I said in my question
//with coalesce I cannot control the file sizes, I can only specify
//the number of partitions
df.coalesce(8).write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.insertInto("schema.test_table")
I have not tried but read it in getting started guide that setting this property should work "hive.merge.sparkfiles=true"
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
In case using Spark on Hive, than Spark's abstraction doesn't provide explicit split of data. However we can control the parallelism in several ways.
You can leverage DataFrame.repartition(numPartitions: Int) to explicitly control the number of partitions.
In case you are using Hive Context than ensure hive-site.xml contains the CombinedInputFormat. That may help.
For more info, take a look at following documentation about Spark data parallelism - http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism.

Resources