spark order by followed by partition by - apache-spark

I have below code snippets and I am little confused about the execution order of "orderBy" and "partitionBy".
MY_DATA_FRAME.orderBy(ORDER_BY_FIELD).coalesce(NUM_OF_PARTITIONS).write.format("parquet").option("compression", "zip").partitionBy(PARTITION_BY_FIELD).option("path",LOCATION).save(FILE_NAME)
May I know after partitionBy and then write to an output file, is this output file still satisfied order by ORDER_BY_FIELD?
Thank you.

Looking at the spark physical plan it seems they don't perform any ordering operation while saving the partition file after order by, Hence i think, the ordering of the rows as specified in order by should be maintained
spark.sql(
"""
|CREATE TABLE IF NOT EXISTS data_source_tab1 (col1 INT, p1 STRING, p2 STRING)
| USING PARQUET PARTITIONED BY (p1, p2)
""".stripMargin).show(false)
val table = spark.sql("select p2, col1 from values ('bob', 1), ('sam', 2), ('bob', 1) T(p2,col1)")
table.createOrReplaceTempView("table")
spark.sql(
"""
|INSERT INTO data_source_tab1 PARTITION (p1 = 'part1', p2)
| SELECT p2, col1 FROM table order by col1
""".stripMargin).explain(true)
Physical plan-
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/.../spark-warehouse/data_source_tab1, Map(p1 -> part1), false, [p1#14, p2#13], Parquet, Map(path -> file:/.../spark-warehouse/data_source_tab1), Append, CatalogTable(
Database: default
Table: data_source_tab1
Created Time: Wed Jun 10 11:25:12 IST 2020
Last Access: Thu Jan 01 05:29:59 IST 1970
Created By: Spark 2.4.5
Type: MANAGED
Provider: PARQUET
Location: file:/.../spark-warehouse/data_source_tab1
Partition Provider: Catalog
Partition Columns: [`p1`, `p2`]
Schema: root
-- col1: integer (nullable = true)
-- p1: string (nullable = true)
-- p2: string (nullable = true)
), org.apache.spark.sql.execution.datasources.CatalogFileIndex#bbb7b43b, [col1, p1, p2]
+- *(1) Project [cast(p2#1 as int) AS col1#12, part1 AS p1#14, cast(col1#2 as string) AS p2#13]
+- *(1) Sort [col1#2 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(col1#2 ASC NULLS FIRST, 2)
+- LocalTableScan [p2#1, col1#2]

Related

Spark ENSURE_REQUIREMENTS explanation

Can someone explain with a practical example how ENSURE_REQUIREMENTS is effected?
Reading on this topic is not really clear.
I looked here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala but not sure what to make of it. Some sort of insurance by Spark that things go well? I find the documentation cryptic.
You can refer to another SO question of mine: Spark JOIN on 2 DF's with same Partitioner in 2.4.5 vs 3.1.2 appears to differ in approach, unfavourably for newer version. There I experimented but do not get the gist of why this is occuring.
None of my colleagies can explain it either.
Lets assume we want to find out how weather affects tourist visits to Acadia National Park:
scala> spark.sql("SET spark.sql.shuffle.partitions=10")
scala> val ds = spark.sql("SELECT Date, AVG(VisitDuration) AvgVisitDuration FROM visits GROUP BY Date")
scala> ds.createOrReplaceTempView("visit_stats")
scala> val dwv = spark.sql("SELECT /*+ MERGEJOIN(v) */ w.*, v.AvgVisitDuration FROM weather w JOIN visit_stats v ON w.Date = v.Date")
scala> dwv.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [Date#91, MaxTemp#92, AverageTemp#93, MinTemp#94, Precip#95, Conditions#96, AvgVisitDuration#216]
+- SortMergeJoin [Date#91], [Date#27], Inner
:- Sort [Date#91 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(Date#91, 10), ENSURE_REQUIREMENTS, [id=#478]
: +- Filter isnotnull(Date#91)
: +- FileScan ...
+- Sort [Date#27 ASC NULLS FIRST], false, 0
+- HashAggregate(keys=[Date#27], functions=[avg(cast(VisitDuration#31 as double))])
+- Exchange hashpartitioning(Date#27, 10), ENSURE_REQUIREMENTS, [id=#474]
+- HashAggregate(keys=[Date#27], functions=[partial_avg(cast(VisitDuration#31 as double))])
+- Filter isnotnull(Date#27)
+- FileScan ...
Worth noting that a) Spark decided to shuffle both datasets using 10 partitions to calculate the average and to perform a join, both, and that b) shuffle origin in both cases is ENSURE_REQUIREMENTS.
Now lets say visits dataset is quite large, so we want to increase parallelism of our stats calculations and we repartitioned it to a higher number.
scala> val dvr = dv.repartition(100,col("Date"))
scala> dvr.createOrReplaceTempView("visits_rep")
scala> val ds = spark.sql("SELECT Date, AVG(AvgDuration) AvgVisitDuration FROM visits_rep GROUP BY Date")
scala> ds.createOrReplaceTempView("visit_stats")
scala> val dwv = spark.sql("SELECT /*+ MERGEJOIN(v) */ w.*, v.AvgVisitDuration from weather w join visit_stats v on w.Date = v.Date")
scala> dwv.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [Date#91, MaxTemp#92, AverageTemp#93, MinTemp#94, Precip#95, Conditions#96, AvgVisitDuration#231]
+- SortMergeJoin [Date#91], [Date#27], Inner
:- Sort [Date#91 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(Date#91, 100), ENSURE_REQUIREMENTS, [id=#531]
: +- Filter isnotnull(Date#91)
: +- FileScan ...
+- Sort [Date#27 ASC NULLS FIRST], false, 0
+- HashAggregate(keys=[Date#27], functions=[avg(cast(VisitDuration#31 as double))])
+- HashAggregate(keys=[Date#27], functions=[partial_avg(cast(VisitDuration#31 as double))])
+- Exchange hashpartitioning(Date#27, 100), REPARTITION_BY_NUM, [id=#524]
+- Filter isnotnull(Date#27)
+- FileScan ...
Here, REPARTITION_BY_NUM shuffle origin dictated the need to have 100 partitions, so Spark optimized the other, ENSURE_REQUIREMENTS, origin to also use a hundred. Thus eliminating the need for another shuffle.
This is just one simple case, but I'm sure there are many other optimizations Spark can apply to DAG that contain shuffles with ENSURE_REQUIREMENTS origin.

How does pyspark join happen on dataframes that are already suitably partitioned?

With the example of join:
A typical workflow of spark join is:
Shuffle the datasets to bring the same keys to the same partitions for the respective dataset
sort
join across partitions
What if I use repartition with same number of partitions and merge_key on both the datasets to be joined beforehand.
Then the join should not do shuffle since I have already achieved that.
How does pyspark know this? Is this told by the user explicitly (in which case what is the way to tell this?) or does pyspark explicitly check this iterating over all the keys on all the partitions once?
Is this true for all wide transformations? If I use repartition beforehand then, how does spark decide to not shuffle?
The following code was used to JOIN 2 DF's already suitably partitioned. And the shuffle.partitions param matching for good measure. In addition I compared Spark 2.4.5 & 3.1.2.
%python
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import rand, randn
df1 = sqlContext.range(0, 10000000)
df2 = sqlContext.range(0, 10000000)
df3 = sqlContext.range(0, 10000000)
# Bump up numbers
df1 = df1.select("id", (20 * rand(seed=10)).cast(IntegerType()).alias("v1"))
df2 = df2.select("id", (50 * randn(seed=27)).cast(IntegerType()).alias("v1"))
df3 = df3.select("id", (50 * randn(seed=27)).cast(IntegerType()).alias("v2"))
df1rc = df1.repartition(23, "v1")
df2rc = df2.repartition(6, "v1")
df3rc = df3.repartition(23, "v2")
spark.sparkContext.setCheckpointDir("/foo/bar")
df1rc = df1rc.checkpoint()
df2rc = df2rc.checkpoint()
df3rc = df3rc.checkpoint()
spark.conf.set("spark.sql.shuffle.partitions", 23)
res = df1rc.join(df3rc, df1rc.v1 == df3rc.v2).explain()
.explain() returns Physical Plan in 2.4.5 as per below, this shows that the correct course to take by Catalyst (non-AQE) to not do a shuffle as both DF's have the same Partitioner (for a different column) and thus same number of Partitions:
== Physical Plan ==
*(3) SortMergeJoin [v1#84], [v2#90], Inner
:- *(1) Sort [v1#84 ASC NULLS FIRST], false, 0
: +- *(1) Filter isnotnull(v1#84)
: +- *(1) Scan ExistingRDD[id#78L,v1#84]
+- *(2) Sort [v2#90 ASC NULLS FIRST], false, 0
+- *(2) Filter isnotnull(v2#90)
+- *(2) Scan ExistingRDD[id#82L,v2#90]
.explain() returns Physical Plan in 3.1.2 as per below, in which we see hash partitioning - a shuffle being applied. To me that seems to be a bug, I think unnecessary shuffles are occurring. ENSURE_REQUIREMENTS seems to add a redundant - in our case shuffle.
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [v1#91], [v2#97], Inner
:- Sort [v1#91 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(v1#91, 23), ENSURE_REQUIREMENTS, [id=#331]
: +- Filter isnotnull(v1#91)
: +- Scan ExistingRDD[id#85L,v1#91]
+- Sort [v2#97 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(v2#97, 23), ENSURE_REQUIREMENTS, [id=#332]
+- Filter isnotnull(v2#97)
+- Scan ExistingRDD[id#89L,v2#97]

Does Spark SQL consider limit when join?

I did the following experiment.
Query 1:
select f1, f2 from A where id = 10 limit 1
| f1 | f2 |
--------------
| 1 | 2 |
Query 2:
select * from B as b where b.f1 = 1 and b.f2 = 2 limit 1
Both Query 1 and Query 2 run very fast.
However when I did the following
select B.*
from B join A
on B.f1 = A.f1 and B.f2 = A.f2
where A.id = 10 limit 1
It runs slow with many stages and tasks...
I had assumed the last query will not be much expensive than query 1 and query 2 given 'limit 1'. Its plan is like the following. Does this indicate that the limit 1 is used only after all join are finished...?
== Optimized Logical Plan ==
GlobalLimit 1
+- LocalLimit 1
+- Join Inner, ((obj_id#352L = obj_id#342L) && (obj_type#351 = obj_type#341))
:- Project [uid#350L, obj_type#351, obj_id#352L]
: +- Filter ...
: +- Relation[...] parquet
+- Aggregate [obj_id#342L, obj_type#341], [obj_id#342L, obj_type#341]
+- Project [obj_type#341, obj_id#342L]
+- Filter ...
+- Relation[...] parquet

Spark 2.1 Hive Partition Adding Issue ORC Format

I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's
create table A (
objid bigint,
occur_date timestamp)
STORED AS ORC;
create table B (
objid bigint,
occur_date timestamp)
PARTITIONED BY (
occur_date_pt date)
STORED AS ORC;
I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that
for row in incremental_df.select(partitioned_column).distinct().collect():
path = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0])
old_df = merge_df.where(col(partitioned_column).isin(format(row[0])))
new_df = incremental_df.where(col(partitioned_column).isin(format(row[0])))
output_df = old_df.subtract(new_df)
output_df = output_df.unionAll(new_df)
output_df.write.option("compression","none").mode("overwrite").format("orc").save(path)
refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name
sqlContext.sql(refresh_metadata_sql)
On Execution of the code I am able to see the partitions in HDFS
Found 3 items
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03
But when I am trying to access the table inside Spark I am getting array out of bound error
>> merge_df = sqlContext.sql('select * from B')
DataFrame[]
>>> merge_df.show()
17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827)
java.lang.IndexOutOfBoundsException: toIndex = 3
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
Any help or pointers to resolve the issue will be appreciated
Posting the comment as an answer for easier reference:
Pls ensure that partitioned column is not included in dataframe.

Spark: How to time range join two lists in memory?

I am new to Spark and I’m having difficulties wrapping my mind around this way of thinking.
The following problems seem generic, but I have no idea how I can solve them using Spark and the memory of its nodes only.
I have two lists (i.e.: RDDs):
List1 - (id, start_time, value) where the tuple (id, start_time) is unique
List2 - (id, timestamp)
First problem: go over List2 and for each (id, timestamp) find in List1 a value that has the same id and the maximal start_time that is before the timestamp.
For example:
List1:
(1, 10:00, a)
(1, 10:05, b)
(1, 10:30, c)
(2, 10:02, d)
List2:
(1, 10:02)
(1, 10:29)
(2, 10:03)
(2: 10:04)
Result:
(1, 10:02) => a
(1, 10:29) => b
(2, 10:03) => d
(2: 10:04) => d
Second problem: very similar to the first problem, but now the start_time and timestamp are fuzzy. This means that a time t may be anywhere between (t - delta) and (t + delta). Again, I need to time join the lists.
Notes:
There is a solution to the first problem using Cassandra, but I'm interested in solving it using Spark and the memory of the nodes only.
List1 has thousands of entries.
List2 has tens of millions of entries.
For brevity I have converted your time data 10:02 to decimal data 10.02. just use a function that would convert the time string to a number.
The first problem can be easily solved using SparkSQL as shown below.
val list1 = spark.sparkContext.parallelize(Seq(
(1, 10.00, "a"),
(1, 10.05, "b"),
(1, 10.30, "c"),
(2, 10.02, "d"))).toDF("col1", "col2", "col3")
val list2 = spark.sparkContext.parallelize(Seq(
(1, 10.02),
(1, 10.29),
(2, 10.03),
(2, 10.04)
)).toDF("col1", "col2")
list1.createOrReplaceTempView("table1")
list2.createOrReplaceTempView("table2")
scala> spark.sql("""
| SELECT col1,col2,col3
| FROM
| (SELECT
| t2.col1, t2.col2, t1.col3,
| ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
| FROM table2 t2
| LEFT JOIN table1 t1
| ON t1.col1 = t2.col1
| AND t2.col2 > t1.col2) tmp
| WHERE tmp.rank = 1""").show()
+----+-----+----+
|col1| col2|col3|
+----+-----+----+
| 1|10.02| a|
| 1|10.29| b|
| 2|10.03| d|
| 2|10.04| d|
+----+-----+----+
similarly the solution for the 2'nd problem can be derived by just changing the joining condition as shown below
spark.sql("""
SELECT col1,col2,col3
FROM
(SELECT
t2.col1, t2.col2, t1.col3,
ROW_NUMBER() over(PARTITION BY t2.col1, t2.col2 ORDER BY t1.col2 DESC) as rank
FROM table2 t2
LEFT JOIN table1 t1
ON t1.col1 = t2.col1
AND t2.col2 between t1.col2 - ${delta} and t1.col2 + ${delta} ) tmp // replace delta with actual value
WHERE tmp.rank = 1""").show()

Resources