Does Spark SQL optimize lower() on both sides? - apache-spark

Say I have this pseudo code in Spark SQL where t1 is a temp view built off of partitioned parquet files in HDFS and t2 is a small lookup file to filter the said temp view
select t1.*
from t1
where exists(select *
from t2
where t1.id=t2.id and
lower(t1.col) like lower(t2.pattern)) --to mimic ilike functionality
Will the optimizer treat lower(t1.col) like lower(t2.pattern) as case insensitive match? Or will it run transformations on these columns before performing the match?
I don't have access to the DAG to see what exactly happens behind the scenes so I am asking here to see if this is a known/documented optimization trick.

I tried to reproduce that case using scala and then I called explain() to get the physical plan (I'm pretty sure sql and scala will have the same physical plan because behind the scene it's the same optimizer named “Catalyst”)
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(("Car", "car"), ("bike", "Rocket"), ("Bus", "BUS"), ("Auto", "Machine") )).toDF("c1", "c2")
df1.filter(lower(col("c1")).equalTo(lower(col("c2")))).explain()
== Physical Plan ==
*(1) Project [_1#3 AS c1#8, _2#4 AS c2#9]
+- *(1) Filter ((isnotnull(_1#3) AND isnotnull(_2#4)) AND (lower(_1#3) = lower(_2#4)))
+- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false, true) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false, true) AS _2#4]
+- Scan[obj#2]
As you can see in the logical plan it will call lower each time to compare the 2 values: lower(_1#3) = lower(_2#4).
Btw I tried same thing joining 2 dataframe, then filtering on lower but I got the same result.
I hope this answer your question.

Related

Applying Pandas UDF without shuffling

I am trying to apply a pandas UDF to each partition of a Spark (3.3.0) DataFrame separately so as to avoid any shuffling requirements. However, when I run the query below, a lot of data is getting shuffled around. The execution plan contains a SORT stage; this might be the culprit.
from pyspark.sql.functions import spark_partition_id
query = df.groupBy(spark_partition_id())\
.applyInPandas(lambda x: pd.DataFrame([x.shape]), "n_rows long, n_cols long")
query.explain()
Output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapGroupsInPandas [SPARK_PARTITION_ID()#1562], <lambda>(id#0L, date#1L, feature#2, partition_id#926)#1561, [nr#1563L, nc#1564L]
+- Sort [SPARK_PARTITION_ID()#1562 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(SPARK_PARTITION_ID()#1562, 200), ENSURE_REQUIREMENTS, [id=#748]
+- Project [SPARK_PARTITION_ID() AS SPARK_PARTITION_ID()#1562, id#0L, date#1L, feature#2, partition_id#926]
+- Scan ExistingRDD[id#0L,date#1L,feature#2,partition_id#926]
In contrast, if I request the execution plan for a very similar query below, the SORT stage is not there and I detect no shuffling upon execution.
df.groupBy(spark_partition_id()).count().explain()
Output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[_nondeterministic#1532], functions=[count(1)])
+- Exchange hashpartitioning(_nondeterministic#1532, 200), ENSURE_REQUIREMENTS, [id=#704]
+- HashAggregate(keys=[_nondeterministic#1532], functions=[partial_count(1)])
+- Project [SPARK_PARTITION_ID() AS _nondeterministic#1532]
+- Scan ExistingRDD[id#0L,date#1L,feature#2,partition_id#926]
What is happening here and how do I achieve the goal I had stated? Thank you!
After some tinkering, it seems I am able to do what I want as follows, although probably it is not ideal.
spark_session.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch",'0')
def get_shape(iterator):
for pdf in iterator:
yield pd.DataFrame([pdf.shape])
df.mapInPandas(get_shape, "nr long, nc long").toPandas()

perform massively parallelizable tasks in pyspark

I often find myself performing massively parallelizable tasks in spark, but for some reason, spark keeps on dying. For instance, right now I have two tables (both stored on s3) that are essentially just collections of (unique) strings. I want to cross join, compute levenshtein distance, and write it out to s3 as a new table. So my code looks like:
OUT_LOC = 's3://<BUCKET>/<PREFIX>/'
if __name__ == '__main__':
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName('my-app') \
.config("hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport() \
.getOrCreate()
tb0 = spark.sql('SELECT col0 FROM db0.table0')
tb1 = spark.sql('SELECT col1 FROM db1.table1')
spark.sql("set spark.sql.crossJoin.enabled=true")
tb0.join(tb1).withColumn("levenshtein_distance",
F.levenshtein(F.col("col0"), F.col("col1"))) \
.write.format('parquet').mode('overwrite') \
.options(path=OUT_LOC, compression='snappy', maxRecordsPerFile=10000) \
.saveAsTable('db2.new_table')
It seems to me that this is massively parallelizable, and spark should be able to chug through this while only reading in a minimal amount of data at a time. But for some reason, the tasks keep on ghost dying.
So my questions are:
Is there a setting I'm missing? Or just more generally, what's going on here?
There's no reason for the whole thing to be stored locally, right?
What are some best practices here that I should consider?
For what it's worth, I have googled around extensively, but couldn't find anyone else with this issue. Maybe my google-foo isn't strong enough, or maybe I'm just doing something stupid.
edit
To #egordoe's advice...
I ran the explain and got back the following...
== Parsed Logical Plan ==
'Project [col0#0, col1#3, levenshtein('col0, 'col1) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Analyzed Logical Plan ==
col0: string, col1: string, levenshtein_distance: int
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Optimized Logical Plan ==
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Relation[col0#0] parquet
+- Relation[col1#3] parquet
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- BroadcastNestedLoopJoin BuildRight, Inner
:- FileScan parquet db0.table0[col0#0] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col0:string>
+- BroadcastExchange IdentityBroadcastMode
+- FileScan parquet db1.table1[col1#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:string>
========== finished
Seems reasonable to me, but the explanation doesn't include the actual writing of the data. I assume that's because it likes to build up a cache of results locally then ship the whole thing to s3 as a table after? That would be pretty lame.
edit 1
I also ran the foreach example you suggested with a simple print statement in there. It hung around for 40 minutes without printing anything before I killed it. I'm now running the job with a function that does nothing (it's just a pass statement) to see if it even finishes.

Spark 1.6 DataFrame optimize join partitioning

I have a question about Spark DataFrame partitioning, I'm currently using Spark 1.6 for project requirements.This is my code excerpt:
sqlContext.getConf("spark.sql.shuffle.partitions") // 6
val df = sc.parallelize(List(("A",1),("A",4),("A",2),("B",5),("C",2),("D",2),("E",2),("B",7),("C",9),("D",1))).toDF("id_1","val_1")
df.rdd.getNumPartitions // 4
val df2 = sc.parallelize(List(("B",1),("E",4),("H",2),("J",5),("C",2),("D",2),("F",2))).toDF("id_2","val_2")
df2.rdd.getNumPartitions // 4
val df3 = df.join(df2,$"id_1" === $"id_2")
df3.rdd.getNumPartitions // 6
val df4 = df3.repartition(3,$"id_1")
df4.rdd.getNumPartitions // 3
df4.explain(true)
The following is the explain plan has been created:
== Parsed Logical Plan ==
'RepartitionByExpression ['id_1], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Analyzed Logical Plan ==
id_1: string, val_1: int, id_2: string, val_2: int
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Optimized Logical Plan ==
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Physical Plan ==
TungstenExchange hashpartitioning(id_1#42,3), None
+- SortMergeJoin [id_1#42], [id_2#46]
:- Sort [id_1#42 ASC], false, 0
: +- TungstenExchange hashpartitioning(id_1#42,6), None
: +- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- Scan ExistingRDD[_1#40,_2#41]
+- Sort [id_2#46 ASC], false, 0
+- TungstenExchange hashpartitioning(id_2#46,6), None
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- Scan ExistingRDD[_1#44,_2#45]
As far I know, DataFrame represent an abstraction interface over RDD, so partitioning should be delegated to the Catalyst optimizer.
Infact compared to RDD where many transformations accept a number of partitions parameter, in order to optimize co-partitioning and co-locating whenever possible, with DataFrame the only chance to alter partitioning, is invoking the method repartition, otherwise the number of partitions for join and aggregations is inferred using the configuration param spark.sql.shuffle.partitions.
From what I can see and understand from the explain plan above it seems there is an useless repartition(so shuffle indeed) to 6 (the default value) after then repartitioning again to the final value imposed by the method repartition.
I believe the Optimizer could change the number of partitions of the join to the final value of 3.
Could someone help me to clarify that point? Maybe I missing something.
If you use spark sql, your shuffle partitions is always equal to spark.sql.shufle.partitions.But if you enable this spark.sql.adaptive.enabled it will add EchangeCoordinator.Right now, the work of this coordinator is to determine the number of post-shuffle partitions for a stage that needs to fetch shuffle data from one or multiple stages.

Spark Plan for collect_list

This is with reference Jacek's answer to How to get the size of result generated using concat_ws?
The DSL query in the answer calls collect_list twice with concat and size separately.
input.groupBy($"col1").agg(
concat_ws(",", collect_list($"COL2".cast("string"))) as "concat",
size(collect_list($"COL2".cast("string"))) as "size"
)
With anoptimizedPlan like :
Aggregate [COL1#9L],
[COL1#9L,
concat_ws(,,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#470a4e26),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS concat#13,
size((hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#2602f45),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS size#14]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
How will it be different performance-wise, if I use collect_list only once and then withColumn API to generate the other two columns?
input
.groupBy("COL1")
.agg(collect_list($"COL2".cast("string")).as("list") )
.withColumn("concat", concat_ws("," , $"list"))
.withColumn("size", size($"list"))
.drop("list")
Which has an optimizedPlan like :
Project [COL1#9L,
concat_ws(,,list#17) AS concat#18,
size(list#17) AS size#19]
+- Aggregate [COL1#9L],
[COL1#9L,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList#5cb88b6b),
cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false) AS list#17]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
I see collect_list being called twice in the former example but just wanted to know if there are any significant differences apart from that. Using Spark 1.6.

What is Project node in execution query plan?

What is the meaning of Project node in Sparks execution plan?
I have a plan containing the following:
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, dm_country#population#country#839, population#17 AS dm_country#population#population#844]
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12 AS dm_country#population#country#839, population#17]
+- Project [6a1ad864-235f-4761-9a6d-0ca2a2b40686#22 AS dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12, population#17]
+- RepartitionByExpression [country#12], 1000
+- Union
:- Project [ind#7 AS 6a1ad864-235f-4761-9a6d-0ca2a2b40686#22, country#12, population#17]
: +- Project [ind#7, country#12, population#2 AS population#17]
: +- Project [ind#7, country#1 AS country#12, population#2]
: +- Project [ind#0 AS ind#7, country#1, population#2]
: +- Relation[ind#0,country#1,population#2] JDBCRelation(schema_dbadmin.t_350) [numPartitions=100]
+- LogicalRDD [ind#45, country#46, population#47]
NOTE: Since the plan uses RepartitionByExpression node it must be a logical query plan.
Project node in a logical query plan stands for Project unary logical operator and is created whenever you use some kind of projection explicitly or implicitly.
Quoting Wikipedia's Projection (relational algebra):
In practical terms, it can be roughly thought of as picking a subset of all available columns.
A Project node can appear in a logical query plan explicitly for the following:
Dataset operators, i.e. joinWith, select, unionByName
KeyValueGroupedDataset operators, i.e. keys, mapValues
SQL's SELECT queries
A Project node can also appear analysis and optimization phases.
In Spark SQL, the Dataset API gives the high-level operators, e.g. select, filter or groupBy, that ultimately build a Catalyst logical plan of a structured query. In other words, this simple-looking Dataset.select operator is just to create a LogicalPlan with Project node.
val query = spark.range(4).select("id")
scala> println(query.queryExecution.logical)
'Project [unresolvedalias('id, None)]
+- Range (0, 4, step=1, splits=Some(8))
(You could have used query.explain(extended = true) for the above but that would have given you all the 4 plans which may have hidden the point)
You could have a look at the code of Dataset.select operator.
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
This simple-looking select operator is a mere wrapper around Catalyst operators to build a Catalyst tree of logical operators that gives a logical plan.
NOTE What's nice about Spark SQL's Catalyst is that it uses this recursive LogicalPlan abstraction that represents an logical operator or a tree of logical operator.
NOTE The same applies to the good ol' SQL where after being parsed the SQL text is transformed to a AST of logical operators. See the example below.
Project can come and go since projection is for the number of columns in the output and may or may not appear in your plans and queries.
Catalyst DSL
You can use Spark SQL's Catalyst DSL (in org.apache.spark.sql.catalyst.dsl package object) for constructing Catalyst data structures using Scala implicit conversions. That could be very useful if you are into Spark testing.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.catalyst.dsl.plans._ // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`
Good ol' SQL
scala> spark.range(4).createOrReplaceTempView("nums")
scala> spark.sql("SHOW TABLES").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | nums| true|
+--------+---------+-----------+
scala> spark.sql("SELECT * FROM nums").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=8)
scala> spark.sql("SELECT * FROM nums").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `nums`
== Analyzed Logical Plan ==
id: bigint
Project [id#40L]
+- SubqueryAlias nums
+- Range (0, 4, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 4, step=1, splits=Some(8))
== Physical Plan ==
*Range (0, 4, step=1, splits=8)

Resources