Why is my parquet partitioned data slower than non-partitioned one? - apache-spark

My understanding is: If I partition my data on a column I will query by it should be faster. However, when I tried it, it seem to be slower instead why?
I have a users dataframe which I tried partitioning my yearmonth and not.
So I have 1 dataset partitioned by creation_yearmonth.
questionsCleanedDf.repartition("creation_yearmonth") \
.write.partitionBy('creation_yearmonth') \
.parquet('wasb://.../parquet/questions.parquet')
I have another not partitioned
questionsCleanedDf \
.write \
.parquet('wasb://.../parquet/questions_nopartition.parquet')
Then I tried creating a dataframe from these 2 parquet files and running the same query
questionsDf = spark.read.parquet('wasb://.../parquet/questions.parquet')
and
questionsDf = spark.read.parquet('wasb://.../parquet/questions_nopartition.parquet')
The query
spark.sql("""
SELECT * FROM questions
WHERE creation_yearmonth = 201606
""")
It seem like the no partition one is consistently faster or have similar times (~2 - 3s) while partitioned one is slighly slower (~3 - 4s).
I tried to do an explain:
For the partitioned dataset:
== Physical Plan ==
*FileScan parquet [id#6404,title#6405,tags#6406,owner_user_id#6407,accepted_answer_id#6408,view_count#6409,answer_count#6410,comment_count#6411,creation_date#6412,favorite_count#6413,creation_yearmonth#6414] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 1, PartitionFilters: [isnotnull(creation_yearmonth#6414), (creation_yearmonth#6414 = 201606)], PushedFilters: [], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...
PartitionCount: 1 I should since in this case, it can just go directly to the parition it should be faster?
For the non-paritioned one:
== Physical Plan ==
*Project [id#6440, title#6441, tags#6442, owner_user_id#6443, accepted_answer_id#6444, view_count#6445, answer_count#6446, comment_count#6447, creation_date#6448, favorite_count#6449, creation_yearmonth#6450]
+- *Filter (isnotnull(creation_yearmonth#6450) && (creation_yearmonth#6450 = 201606))
+- *FileScan parquet [id#6440,title#6441,tags#6442,owner_user_id#6443,accepted_answer_id#6444,view_count#6445,answer_count#6446,comment_count#6447,creation_date#6448,favorite_count#6449,creation_yearmonth#6450] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions_nopartition.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_yearmonth), EqualTo(creation_yearmonth,201606)], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...
Also very surprising. At first the dataset has dates as strings, so I need to do a query like:
spark.sql("""
SELECT * FROM questions
WHERE CAST(creation_date AS date) BETWEEN '2017-06-01' AND '2017-07-01'
""").show(20, False)
I expected this to be even slower but it turns out, it performs the best ~1-2s. Why is that? I thought in this case, it needs to cast each row?
The explain output here:
== Physical Plan ==
*Project [id#6521, title#6522, tags#6523, owner_user_id#6524, accepted_answer_id#6525, view_count#6526, answer_count#6527, comment_count#6528, creation_date#6529, favorite_count#6530]
+- *Filter ((isnotnull(creation_date#6529) && (cast(cast(creation_date#6529 as date) as string) >= 2017-06-01)) && (cast(cast(creation_date#6529 as date) as string) <= 2017-07-01))
+- *FileScan parquet [id#6521,title#6522,tags#6523,owner_user_id#6524,accepted_answer_id#6525,view_count#6526,answer_count#6527,comment_count#6528,creation_date#6529,favorite_count#6530] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/filtered/questions.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_date)], ReadSchema: struct<id:string,title:string,tags:array<string>,owner_user_id:string,accepted_answer_id:string,v...

Overpartitioning can actually reduce performance:
If a column has only a few rows matching each value, the number of
directories to process can become a limiting factor, and the data file
in each directory could be too small to take advantage of the Hadoop
mechanism for transmitting data in multi-megabyte blocks.
This excerpt was taken from the documentation of a different Hadoop component, Impala, but the presented argument should be valid to all components of the Hadoop stack.
I think that regardless of the partitioning scheme used, the advantages of partitioning will not be apparent until the table grows way beyond 900 MB-s.

Related

Does spark bring entire hive table to memory

I am in the process of learning the working of Apache Spark and have some basic queries. Let's say I have a Spark application running which connects to a Hive table.
My hive table is as follows:
Name
Age
Marks
A
50
100
B
50
100
C
75
200
When I run the following code snippets, which rows and columns will be loaded into memory during the execution? Will the filtering of rows/columns be done after the entire table is loaded into the memory?
1. spark_session.sql("SELECT name, age from table").collect()
2. spark_session.sql("SELECT * from table WHERE age=50").collect()
3. spark_session.sql("SELECT * from table").select("name", "age").collect()
4. spark_session.sql("SELECT * from table").filter(df.age = 50).collect()
If the datasource supports predicate pushdown then spark will not load entire data to memory while filtering the data.
Let's check the spark plan for hive table with parquet as file format:
>>> df = spark.createDataFrame([('A', 25, 100),('B', 30, 100)], ['name', 'age', 'marks'])
>>> df.write.saveAsTable('table')
>>> spark.sql('select * from table where age=25').explain(True)
== Physical Plan ==
*(1) Filter (isnotnull(age#1389L) AND (age#1389L = 25))
+- *(1) ColumnarToRow
+- FileScan parquet default.table[name#1388,age#1389L,marks#1390L] Batched: true, DataFilters: [isnotnull(age#1389L), (age#1389L = 25)],
Format: Parquet, Location: InMemoryFileIndex[file:/Users/mohan/spark-warehouse/table],
PartitionFilters: [], PushedFilters: [IsNotNull(age), EqualTo(age,25)], ReadSchema: struct<name:string,age:bigint,marks:bigint>
You can verify if filter pushed to underlying storage by looking at PushedFilters: [IsNotNull(age), EqualTo(age,25)]

The difference on reading files in PySpark between reading the whole directory then filtering and reading a part of the directory?

Suppose I have a data model that runs daily and the sample HDFS path is
data_model/sales_summary/grass_date=2021-04-01
If I want to read all the models in Feb and March, what is the difference if I read in the following two ways:
A:
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*')
B:
spark.read.parquet('data_model/sales_summary/').filter(col('grass_date').between('2021-02-01', '2021-03-30'))
Are these two reading methods equivalent? If not, under what circumstances which one can be more efficient?
Spark will do a partition filter when reading the files, so the performance of the two methods should be similar. The query plans below show how the partition filters are used in the filescan operation.
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*').explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary/grass_date=2021-02-21, file:/tmp/data_model/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
spark.read.parquet('data_model/sales_summary/').filter(F.col('grass_date').between('2021-02-01', '2021-03-30')).explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#24,grass_date#25] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary], PartitionFilters: [isnotnull(grass_date#25), (grass_date#25 >= 18659), (grass_date#25 <= 18716)], PushedFilters: [], ReadSchema: struct<id:int>
But note that the partitioning column will be missing from the dataframe if you use the first method to read the files, so you'd probably prefer the second method.
We should use First one (A) is right .
A- We are selecting specific folders (we are reading required data only).
B - We are reading all data and then applying filter ( here we are reading all data which is costly ).

perform massively parallelizable tasks in pyspark

I often find myself performing massively parallelizable tasks in spark, but for some reason, spark keeps on dying. For instance, right now I have two tables (both stored on s3) that are essentially just collections of (unique) strings. I want to cross join, compute levenshtein distance, and write it out to s3 as a new table. So my code looks like:
OUT_LOC = 's3://<BUCKET>/<PREFIX>/'
if __name__ == '__main__':
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName('my-app') \
.config("hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport() \
.getOrCreate()
tb0 = spark.sql('SELECT col0 FROM db0.table0')
tb1 = spark.sql('SELECT col1 FROM db1.table1')
spark.sql("set spark.sql.crossJoin.enabled=true")
tb0.join(tb1).withColumn("levenshtein_distance",
F.levenshtein(F.col("col0"), F.col("col1"))) \
.write.format('parquet').mode('overwrite') \
.options(path=OUT_LOC, compression='snappy', maxRecordsPerFile=10000) \
.saveAsTable('db2.new_table')
It seems to me that this is massively parallelizable, and spark should be able to chug through this while only reading in a minimal amount of data at a time. But for some reason, the tasks keep on ghost dying.
So my questions are:
Is there a setting I'm missing? Or just more generally, what's going on here?
There's no reason for the whole thing to be stored locally, right?
What are some best practices here that I should consider?
For what it's worth, I have googled around extensively, but couldn't find anyone else with this issue. Maybe my google-foo isn't strong enough, or maybe I'm just doing something stupid.
edit
To #egordoe's advice...
I ran the explain and got back the following...
== Parsed Logical Plan ==
'Project [col0#0, col1#3, levenshtein('col0, 'col1) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Analyzed Logical Plan ==
col0: string, col1: string, levenshtein_distance: int
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Optimized Logical Plan ==
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Relation[col0#0] parquet
+- Relation[col1#3] parquet
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- BroadcastNestedLoopJoin BuildRight, Inner
:- FileScan parquet db0.table0[col0#0] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col0:string>
+- BroadcastExchange IdentityBroadcastMode
+- FileScan parquet db1.table1[col1#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:string>
========== finished
Seems reasonable to me, but the explanation doesn't include the actual writing of the data. I assume that's because it likes to build up a cache of results locally then ship the whole thing to s3 as a table after? That would be pretty lame.
edit 1
I also ran the foreach example you suggested with a simple print statement in there. It hung around for 40 minutes without printing anything before I killed it. I'm now running the job with a function that does nothing (it's just a pass statement) to see if it even finishes.

How to load a bucketed DataFrame that would preserve bucketing?

I have bucketized a dataframe, i.e. bucketBy and saveAsTable.
If I load it with spark.read.parquet, I don't benefit from optimization (no shuffling).
scala> spark.read.parquet("${spark-warehouse}/tab1").groupBy("a").count.explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#35117], functions=[count(1)], output=[a#35117, count#35126L])
+- Exchange hashpartitioning(a#35117, 200)
+- *HashAggregate(keys=[a#35117], functions=[partial_count(1)], output=[a#35117, count#35132L])
+- *FileScan parquet [a#35117] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I need to load it with spark.table to benefit from optimization.
scala> spark.table("tab1").groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#35140L])
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#35146L])
+- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I don't understand why Spark do not detect automatically the bucketization in the first case, by using the filename for example that is a bit different in this case part-00007-ca117fc2-2552-4693-b6f7-6b27c7c4bca7_00001.snappy.parquet ?
I don't understand why Spark do not detect automatically the bucketization in the first case
Simple. No support for bucketed dataframes that are not loaded as bucketed tables using spark.table.

How to load only the data of the last partition

I have some data partitioned this way:
/data/year=2016/month=9/version=0
/data/year=2016/month=10/version=0
/data/year=2016/month=10/version=1
/data/year=2016/month=10/version=2
/data/year=2016/month=10/version=3
/data/year=2016/month=11/version=0
/data/year=2016/month=11/version=1
When using this data, I'd like to load the last version only of each month.
A simple way to do this is to do load("/data/year=2016/month=11/version=3") instead of doing load("/data").
The drawback of this solution is the loss of partitioning information such as year and month, which means it would not be possible to apply operations based on the year or the month anymore.
Is it possible to ask Spark to load the last version only of each month? How would you go about this?
Well, Spark supports predicate push-down, so if you provide a filter following the load, it will only read in the data fulfilling the criteria in the filter. Like this:
spark.read.option("basePath", "/data").load("/data").filter('version === 3)
And you get to keep the partitioning information :)
Just an addition to previous answers for reference
I have a below ORC format table in hive which is partitioned on year,month & date column.
hive (default)> show partitions test_dev_db.partition_date_table;
OK
year=2019/month=08/day=07
year=2019/month=08/day=08
year=2019/month=08/day=09
If I set below properties, I can read the latest partition data in spark sql as shown below:
spark.sql("SET spark.sql.orc.enabled=true");
spark.sql("SET spark.sql.hive.convertMetastoreOrc=true")
spark.sql("SET spark.sql.orc.filterPushdown=true")
spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day='07' """).explain(True)
we can see PartitionCount: 1 in plan and it's obvious that it has filtered the latest partition.
== Physical Plan ==
*(1) FileScan orc test_dev_db.partition_date_table[emp_id#212,emp_name#213,emp_salary#214,emp_date#215,year#216,month#217,day#218] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxxx/dev/hadoop/database/test_dev..., **PartitionCount: 1**, PartitionFilters: [isnotnull(year#216), isnotnull(month#217), isnotnull(day#218), (year#216 = 2019), (month#217 = 0..., PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
whereas same will not work if I use below query:
even if we create dataframe using spark.read.format("orc").load(hdfs absolute path of table) and create a temporary view and run spark sql on that. It will still scan all the partitions available for that table until and unless we use specific filter condition on top of that.
spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day in (select max(day) from test_dev_db.partition_date_table)""").explain(True)
It still has scanned all the three partitions, here PartitionCount: 3
== Physical Plan ==
*(2) BroadcastHashJoin [day#282], [max(day)#291], LeftSemi, BuildRight
:- *(2) FileScan orc test_dev_db.partition_date_table[emp_id#276,emp_name#277,emp_salary#278,emp_date#279,year#280,month#281,day#282] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 3, PartitionFilters: [isnotnull(year#280), isnotnull(month#281), (year#280 = 2019), (month#281 = 08)], PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
To filter out the data based on the max partition using spark sql, we can use the below approach. we can use below technique for partition pruning to limits the number of files and partitions that Spark reads when querying the Hive ORC table data.
rdd=spark.sql("""show partitions test_dev_db.partition_date_table""").rdd.flatMap(lambda x:x)
newrdd=rdd.map(lambda x : x.replace("/","")).map(lambda x : x.replace("year=","")).map(lambda x : x.replace("month=","-")).map(lambda x : x.replace("day=","-")).map(lambda x : x.split('-'))
max_year=newrdd.map(lambda x : (x[0])).max()
max_month=newrdd.map(lambda x : x[1]).max()
max_day=newrdd.map(lambda x : x[2]).max()
prepare your query to filter Hive partition table using these max values.
query = "select * from test_dev_db.partition_date_table where year ='{0}' and month='{1}' and day ='{2}'".format(max_year,max_month,max_day)
>>> spark.sql(query).show();
+------+--------+----------+----------+----+-----+---+
|emp_id|emp_name|emp_salary| emp_date|year|month|day|
+------+--------+----------+----------+----+-----+---+
| 3| Govind| 810000|2019-08-09|2019| 08| 09|
| 4| Vikash| 5500|2019-08-09|2019| 08| 09|
+------+--------+----------+----------+----+-----+---+
spark.sql(query).explain(True)
If you see the plan of this query, you can see that it has scanned only one partition of given Hive table.
here PartitionCount is 1
== Optimized Logical Plan ==
Filter (((((isnotnull(day#397) && isnotnull(month#396)) && isnotnull(year#395)) && (year#395 = 2019)) && (month#396 = 08)) && (day#397 = 09))
+- Relation[emp_id#391,emp_name#392,emp_salary#393,emp_date#394,year#395,month#396,day#397] orc
== Physical Plan ==
*(1) FileScan orc test_dev_db.partition_date_table[emp_id#391,emp_name#392,emp_salary#393,emp_date#394,year#395,month#396,day#397] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 1, PartitionFilters: [isnotnull(day#397), isnotnull(month#396), isnotnull(year#395), (year#395 = 2019), (month#396 = 0..., PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
I think you have to use Spark's Window Function and then find and filter out the latest version.
import org.apache.spark.sql.functions.{col, first}
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("year","month").orderBy(col("version").desc)
spark.read.load("/data")
.withColumn("maxVersion", first("version").over(windowSpec))
.select("*")
.filter(col("maxVersion") === col("version"))
.drop("maxVersion")
Let me know if this works for you.
Here's a Scala general function:
/**
* Given a DataFrame, use keys (e.g. last modified time), to show the most up to date record
*
* #param dF DataFrame to be parsed
* #param groupByKeys These are the columns you would like to groupBy and expect to be duplicated,
* hence why you're trying to obtain records according to a latest value of keys.
* #param keys The sequence of keys used to rank the records in the table
* #return DataFrame with records that have rank 1, this means the most up to date version of those records
*/
def getLastUpdatedRecords(dF: DataFrame, groupByKeys: Seq[String], keys: Seq[String]): DataFrame = {
val part = Window.partitionBy(groupByKeys.head, groupByKeys.tail: _*).orderBy(array(keys.head, keys.tail: _*).desc)
val rowDF = dF.withColumn("rn", row_number().over(part))
val res = rowDF.filter(col("rn")===1).drop("rn")
res
}

Resources