Does spark perform transformations row by row or together? - apache-spark

I am trying to understand spark and have the following code where df,df2,df3,df4 are Dataset<Row>
df=df.join(df2,"ID");
df4=df4.join(df3,"ID");
df=df.union(df4);
long count=df.count();
My question is how do the transformation happen? In the above example, does the union waits for both joins to complete fully (i.e join for all rows is done) and then start the union? Or does it goes row by row in a pipelined way, where once join for one row is complete, the union transform starts on it (even though join for other rows is still going on)?
I tried searching on this but am not able to find any answer.

Use explain() to see what happen.
df.explain()
== Physical Plan ==
Union
:- *(5) Project [ID#3599L, VALUE#3600, VALUE#3604]
: +- *(5) SortMergeJoin [ID#3599L], [ID#3603L], Inner
: :- *(2) Sort [ID#3599L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#3599L, 200), true, [id=#1332]
: : +- *(1) Filter isnotnull(ID#3599L)
: : +- *(1) Scan ExistingRDD[ID#3599L,VALUE#3600]
: +- *(4) Sort [ID#3603L ASC NULLS FIRST], false, 0
: +- ReusedExchange [ID#3603L, VALUE#3604], Exchange hashpartitioning(ID#3599L, 200), true, [id=#1332]
+- *(10) Project [ID#3599L, VALUE#3600, VALUE#3609]
+- *(10) SortMergeJoin [ID#3599L], [ID#3608L], Inner
:- *(7) Sort [ID#3599L ASC NULLS FIRST], false, 0
: +- ReusedExchange [ID#3599L, VALUE#3600], Exchange hashpartitioning(ID#3599L, 200), true, [id=#1332]
+- *(9) Sort [ID#3608L ASC NULLS FIRST], false, 0
+- ReusedExchange [ID#3608L, VALUE#3609], Exchange hashpartitioning(ID#3599L, 200), true, [id=#1332]
Sort, SortMergeJoin, Project, and Union. This happens when you take an action, such as count() and before that, spark will just plan how to proceed.

Related

Spark ENSURE_REQUIREMENTS explanation

Can someone explain with a practical example how ENSURE_REQUIREMENTS is effected?
Reading on this topic is not really clear.
I looked here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala but not sure what to make of it. Some sort of insurance by Spark that things go well? I find the documentation cryptic.
You can refer to another SO question of mine: Spark JOIN on 2 DF's with same Partitioner in 2.4.5 vs 3.1.2 appears to differ in approach, unfavourably for newer version. There I experimented but do not get the gist of why this is occuring.
None of my colleagies can explain it either.
Lets assume we want to find out how weather affects tourist visits to Acadia National Park:
scala> spark.sql("SET spark.sql.shuffle.partitions=10")
scala> val ds = spark.sql("SELECT Date, AVG(VisitDuration) AvgVisitDuration FROM visits GROUP BY Date")
scala> ds.createOrReplaceTempView("visit_stats")
scala> val dwv = spark.sql("SELECT /*+ MERGEJOIN(v) */ w.*, v.AvgVisitDuration FROM weather w JOIN visit_stats v ON w.Date = v.Date")
scala> dwv.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [Date#91, MaxTemp#92, AverageTemp#93, MinTemp#94, Precip#95, Conditions#96, AvgVisitDuration#216]
+- SortMergeJoin [Date#91], [Date#27], Inner
:- Sort [Date#91 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(Date#91, 10), ENSURE_REQUIREMENTS, [id=#478]
: +- Filter isnotnull(Date#91)
: +- FileScan ...
+- Sort [Date#27 ASC NULLS FIRST], false, 0
+- HashAggregate(keys=[Date#27], functions=[avg(cast(VisitDuration#31 as double))])
+- Exchange hashpartitioning(Date#27, 10), ENSURE_REQUIREMENTS, [id=#474]
+- HashAggregate(keys=[Date#27], functions=[partial_avg(cast(VisitDuration#31 as double))])
+- Filter isnotnull(Date#27)
+- FileScan ...
Worth noting that a) Spark decided to shuffle both datasets using 10 partitions to calculate the average and to perform a join, both, and that b) shuffle origin in both cases is ENSURE_REQUIREMENTS.
Now lets say visits dataset is quite large, so we want to increase parallelism of our stats calculations and we repartitioned it to a higher number.
scala> val dvr = dv.repartition(100,col("Date"))
scala> dvr.createOrReplaceTempView("visits_rep")
scala> val ds = spark.sql("SELECT Date, AVG(AvgDuration) AvgVisitDuration FROM visits_rep GROUP BY Date")
scala> ds.createOrReplaceTempView("visit_stats")
scala> val dwv = spark.sql("SELECT /*+ MERGEJOIN(v) */ w.*, v.AvgVisitDuration from weather w join visit_stats v on w.Date = v.Date")
scala> dwv.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [Date#91, MaxTemp#92, AverageTemp#93, MinTemp#94, Precip#95, Conditions#96, AvgVisitDuration#231]
+- SortMergeJoin [Date#91], [Date#27], Inner
:- Sort [Date#91 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(Date#91, 100), ENSURE_REQUIREMENTS, [id=#531]
: +- Filter isnotnull(Date#91)
: +- FileScan ...
+- Sort [Date#27 ASC NULLS FIRST], false, 0
+- HashAggregate(keys=[Date#27], functions=[avg(cast(VisitDuration#31 as double))])
+- HashAggregate(keys=[Date#27], functions=[partial_avg(cast(VisitDuration#31 as double))])
+- Exchange hashpartitioning(Date#27, 100), REPARTITION_BY_NUM, [id=#524]
+- Filter isnotnull(Date#27)
+- FileScan ...
Here, REPARTITION_BY_NUM shuffle origin dictated the need to have 100 partitions, so Spark optimized the other, ENSURE_REQUIREMENTS, origin to also use a hundred. Thus eliminating the need for another shuffle.
This is just one simple case, but I'm sure there are many other optimizations Spark can apply to DAG that contain shuffles with ENSURE_REQUIREMENTS origin.

How does pyspark join happen on dataframes that are already suitably partitioned?

With the example of join:
A typical workflow of spark join is:
Shuffle the datasets to bring the same keys to the same partitions for the respective dataset
sort
join across partitions
What if I use repartition with same number of partitions and merge_key on both the datasets to be joined beforehand.
Then the join should not do shuffle since I have already achieved that.
How does pyspark know this? Is this told by the user explicitly (in which case what is the way to tell this?) or does pyspark explicitly check this iterating over all the keys on all the partitions once?
Is this true for all wide transformations? If I use repartition beforehand then, how does spark decide to not shuffle?
The following code was used to JOIN 2 DF's already suitably partitioned. And the shuffle.partitions param matching for good measure. In addition I compared Spark 2.4.5 & 3.1.2.
%python
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import rand, randn
df1 = sqlContext.range(0, 10000000)
df2 = sqlContext.range(0, 10000000)
df3 = sqlContext.range(0, 10000000)
# Bump up numbers
df1 = df1.select("id", (20 * rand(seed=10)).cast(IntegerType()).alias("v1"))
df2 = df2.select("id", (50 * randn(seed=27)).cast(IntegerType()).alias("v1"))
df3 = df3.select("id", (50 * randn(seed=27)).cast(IntegerType()).alias("v2"))
df1rc = df1.repartition(23, "v1")
df2rc = df2.repartition(6, "v1")
df3rc = df3.repartition(23, "v2")
spark.sparkContext.setCheckpointDir("/foo/bar")
df1rc = df1rc.checkpoint()
df2rc = df2rc.checkpoint()
df3rc = df3rc.checkpoint()
spark.conf.set("spark.sql.shuffle.partitions", 23)
res = df1rc.join(df3rc, df1rc.v1 == df3rc.v2).explain()
.explain() returns Physical Plan in 2.4.5 as per below, this shows that the correct course to take by Catalyst (non-AQE) to not do a shuffle as both DF's have the same Partitioner (for a different column) and thus same number of Partitions:
== Physical Plan ==
*(3) SortMergeJoin [v1#84], [v2#90], Inner
:- *(1) Sort [v1#84 ASC NULLS FIRST], false, 0
: +- *(1) Filter isnotnull(v1#84)
: +- *(1) Scan ExistingRDD[id#78L,v1#84]
+- *(2) Sort [v2#90 ASC NULLS FIRST], false, 0
+- *(2) Filter isnotnull(v2#90)
+- *(2) Scan ExistingRDD[id#82L,v2#90]
.explain() returns Physical Plan in 3.1.2 as per below, in which we see hash partitioning - a shuffle being applied. To me that seems to be a bug, I think unnecessary shuffles are occurring. ENSURE_REQUIREMENTS seems to add a redundant - in our case shuffle.
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [v1#91], [v2#97], Inner
:- Sort [v1#91 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(v1#91, 23), ENSURE_REQUIREMENTS, [id=#331]
: +- Filter isnotnull(v1#91)
: +- Scan ExistingRDD[id#85L,v1#91]
+- Sort [v2#97 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(v2#97, 23), ENSURE_REQUIREMENTS, [id=#332]
+- Filter isnotnull(v2#97)
+- Scan ExistingRDD[id#89L,v2#97]

spark order by followed by partition by

I have below code snippets and I am little confused about the execution order of "orderBy" and "partitionBy".
MY_DATA_FRAME.orderBy(ORDER_BY_FIELD).coalesce(NUM_OF_PARTITIONS).write.format("parquet").option("compression", "zip").partitionBy(PARTITION_BY_FIELD).option("path",LOCATION).save(FILE_NAME)
May I know after partitionBy and then write to an output file, is this output file still satisfied order by ORDER_BY_FIELD?
Thank you.
Looking at the spark physical plan it seems they don't perform any ordering operation while saving the partition file after order by, Hence i think, the ordering of the rows as specified in order by should be maintained
spark.sql(
"""
|CREATE TABLE IF NOT EXISTS data_source_tab1 (col1 INT, p1 STRING, p2 STRING)
| USING PARQUET PARTITIONED BY (p1, p2)
""".stripMargin).show(false)
val table = spark.sql("select p2, col1 from values ('bob', 1), ('sam', 2), ('bob', 1) T(p2,col1)")
table.createOrReplaceTempView("table")
spark.sql(
"""
|INSERT INTO data_source_tab1 PARTITION (p1 = 'part1', p2)
| SELECT p2, col1 FROM table order by col1
""".stripMargin).explain(true)
Physical plan-
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/.../spark-warehouse/data_source_tab1, Map(p1 -> part1), false, [p1#14, p2#13], Parquet, Map(path -> file:/.../spark-warehouse/data_source_tab1), Append, CatalogTable(
Database: default
Table: data_source_tab1
Created Time: Wed Jun 10 11:25:12 IST 2020
Last Access: Thu Jan 01 05:29:59 IST 1970
Created By: Spark 2.4.5
Type: MANAGED
Provider: PARQUET
Location: file:/.../spark-warehouse/data_source_tab1
Partition Provider: Catalog
Partition Columns: [`p1`, `p2`]
Schema: root
-- col1: integer (nullable = true)
-- p1: string (nullable = true)
-- p2: string (nullable = true)
), org.apache.spark.sql.execution.datasources.CatalogFileIndex#bbb7b43b, [col1, p1, p2]
+- *(1) Project [cast(p2#1 as int) AS col1#12, part1 AS p1#14, cast(col1#2 as string) AS p2#13]
+- *(1) Sort [col1#2 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(col1#2 ASC NULLS FIRST, 2)
+- LocalTableScan [p2#1, col1#2]

why me pyspark command on 100 rows dataframe taking more than 30 mins to execute?

why me pyspark command on 100 rows dataframe taking more than 30 mins to execute?
What all I need to tune. This ouput dataframe is output of featuretools4s. I have extracted only 100 rows and 2 columns and still performing bad.
features_2=features.limit(100)
features_2.groupBy('id').count()
This is my spark test.
df.limit(100).groupBy("id").count().explain
Then,
== Physical Plan ==
*(2) HashAggregate(keys=[id#514], functions=[count(1)])
+- *(2) HashAggregate(keys=[id AS id#514], functions=[partial_count(1)])
+- *(2) GlobalLimit 100
+- Exchange SinglePartition
+- *(1) LocalLimit 100
+- *(1) FileScan parquet table[] Batched: true, Format: Parquet, Location: InMemoryFileIndex[location, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
where the plan indicates that
Scan table
Limit 100
GroupBy
Count
do the limit and aggregation later, so there is no doubt about your query. I think the other problem exists, such as your db connection, or db status.

Does Spark SQL consider limit when join?

I did the following experiment.
Query 1:
select f1, f2 from A where id = 10 limit 1
| f1 | f2 |
--------------
| 1 | 2 |
Query 2:
select * from B as b where b.f1 = 1 and b.f2 = 2 limit 1
Both Query 1 and Query 2 run very fast.
However when I did the following
select B.*
from B join A
on B.f1 = A.f1 and B.f2 = A.f2
where A.id = 10 limit 1
It runs slow with many stages and tasks...
I had assumed the last query will not be much expensive than query 1 and query 2 given 'limit 1'. Its plan is like the following. Does this indicate that the limit 1 is used only after all join are finished...?
== Optimized Logical Plan ==
GlobalLimit 1
+- LocalLimit 1
+- Join Inner, ((obj_id#352L = obj_id#342L) && (obj_type#351 = obj_type#341))
:- Project [uid#350L, obj_type#351, obj_id#352L]
: +- Filter ...
: +- Relation[...] parquet
+- Aggregate [obj_id#342L, obj_type#341], [obj_id#342L, obj_type#341]
+- Project [obj_type#341, obj_id#342L]
+- Filter ...
+- Relation[...] parquet

Resources