How does spark calculate the number of reducers in a hash shuffle? - apache-spark

I am trying to understand hash shuffle in Spark. I am reading this article
Hash Shuffle:
Each mapper task creates separate file for each separate reducer, resulting in M * R total files on the cluster, where M is the number of “mappers” and R is the number of “reducers”. With high amount of mappers and reducers this causes big problems, both with the output buffer size, amount of open files on the filesystem, speed of creating and dropping all these files.
The logic of this shuffler is pretty dumb: it calculates the amount of “reducers” as the amount of partitions on the “reduce” side
Can you help me understand the emboldened part? How does it know the amount of partitions on the reduce side or, what does "amount of partitions on the reduce side" even mean? Is it equal to spark.sql.shuffle.partitions? If it is indeed equal to that, then what is even there to calculate? A very small example would be very helpful.

spark.sql.shuffle.partitions is just the default used when number of partitions for a shuffle isn't set explicitly. So the "calculation", at a minimum, would involve a check whether specific number of partitions was requested or should Spark use the default.
Quick example:
scala> df.repartition(400,col("key")).groupBy("key").avg("value").explain()
== Physical Plan ==
*(2) HashAggregate(keys=[key#178], functions=[avg(value#164)])
+- *(2) HashAggregate(keys=[key#178], functions=[partial_avg(value#164)])
+- Exchange hashpartitioning(key#178, 400) <<<<< shuffle across increased number of partitions
+- *(1) Project [key#178, value#164]
+- *(1) FileScan parquet adb.atable[value#164,key#178,othercolumns...] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://ns1/hive/adb/atable/key=123..., PartitionCount: 3393, PartitionFilters: [isnotnull(key#178), (cast(key#178 as string) > 100)], PushedFilters: [], ReadSchema: struct<value:double,othercolumns...>
scala>
In Spark 3 and up, Adaptive Query Engine could also interject and revise that number, in attempts to optimize execution by coalescing, preserving (e.g. ENSURE_REQUIREMENTS) or increasing partitions.
EDIT: A side note -- your article is quite old (2015 was ages ago :)) and talks about pre-SparkSQL/pre-dataframe times. I'd try to find something more relevant.
EDIT 2: ...But even there, in the comments section, author rightfully says: In fact, here the question is more general. For most of the transformations in Spark you can manually specify the desired amount of output partitions, and this would be your amount of “reducers”...

Related

Why would finding an aggregate of a partition column in Spark 3 take very long time?

I'm trying to query the MIN(dt) in a table partitioned by dt column using the following query in both Spark2 and Spark3:
SELECT MIN(dt) FROM table_name
The table is stored in parquet format in S3, where each dt is a separate folder, so this seems like a pretty simple operation. There's about 3,200 days of data.
In Spark2, this query completes in ~1 minute, while in Spark3, the query takes over an hour (not sure how long exactly since it hasn't finished yet).
In Spark3, the execution plan is:
AdaptiveSparkPlan (10)
+- == Current Plan ==
HashAggregate (6)
+- ShuffleQueryStage (5)
+- Exchange (4)
+- * HashAggregate (3)
+- * ColumnarToRow (2)
+- Scan parquet table_name (1)
+- == Initial Plan ==
HashAggregate (9)
+- Exchange (8)
+- HashAggregate (7)
+- Scan parquet table_name (1)
It's confusing to me how this would take a long time, as the data is already partitioned by dt. Spark only needs to determine which partitions have any rows and return the min of those.
What you're suggesting was implemented once as OptimizeMetadataOnly query optimizer rule, via JIRA SPARK-15752 "Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators".
However, it was found to cause correctness issues sometimes, when some of the partitions contained zero-row files, see JIRA SPARK-26709 "OptimizeMetadataOnlyQuery does not correctly handle the files with zero record".
Along with the fix, an internal Spark config spark.sql.optimizer.metadataOnly was added to provide a way to circumvent full-table scans "at your own risk", i.e. when you are certain that all your partitions aren't empty. Possibly, in your Spark 2 you have it set to true (or your Spark 2 doesn't include a fix at all). See also SPARK-34194 for additional discussion around it.
SPARK 3.0 deprecated this config (SPARK-31647), so most likely it is set to false in your environment, which causes Spark to scan all table partitions before aggregating the result to find min. But for the time being, you can still try setting it to true to speed up your query, just beware of the consequences.

Is a groupby transformation on data that is already partitioned wide or narrow?

My understanding of Narrow and Wide transformations is as follows:
Narrow transformation - The data within a given partition is all that is needed to apply this transformation to the said partition and hence these transformations don't require data shuffle. example: map, filter
Wide transformation - The data within a given partition is not all that is needed to apply this transformation to the said partition and hence these transformations require data shuffle. example: sort
Question:
If I already have my dataset partitioned then apart from sort what transformation is wide? I keep reading that groupby is wide but I don't see how. If I have all the data with a given key on a given partition (Which is how it would be if the dataset is already partitioned) then I do not need data from other partitions to apply groupby. What am I missing here?
You can get a better idea of what Spark is doing by using the explain method on a DataFrame.
Using a small example:
case class T(a: String, b: Int)
val df = Seq(T("a", 1), T("b", 2), T("a", 1)).toDF
df.groupBy("a").sum().explain
Looking at the output:
== Physical Plan ==
*(2) HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint))], output=[a#11, sum(b)#21L])
+- Exchange hashpartitioning(a#11, 200), true, [id=#13]
+- *(1) HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint))], output=[a#11, sum#25L])
+- *(1) LocalTableScan [a#11, b#12]
The HashAggregate line is summing the values of b by key. More relevant for us is the line Exchange hashpartitioning,... which tells us that Spark is going to distribute the results. Spark doesn't know that the data is already partitioned, so it plans to hash the keys and collect them on a single partition. If your data is already partitioned then the hashing step won't actually result in any data being moved.
Other common methods which result in data shuffling are joins and aggregation methods (collect_list, count, sum, etc)

Should Spark JDBC partitionColumns be recognized as DataFrame partitions?

I've used partitionColumn options to read a 300 million row table, hoping to achieve low memory/disk requirements for my ETL job (in Spark 3.0.1).
However, the explain plan shows at the start/leaf:
+- Exchange hashpartitioning[partitionCol#1, 200), true, [id=#201]
+- *(1) Scan JDBCRelation(table)[numPartitions=200] (partitionCol#1, time#2)...
I would have expected that shuffling was not necessary here, since the partitionCol was specified in the JDBC option.
There's a whole lot going on in the full plan, but every window operation partitions by partitionCol first and then other columns.
I've tried:
Ensuring my columns are declared not-null (since I saw Sort[partitionCol#1 ASC NULLS FIRST...] being injected and thought that might be an issue)
Checking dataframe partitioning: jdbcDF.rdd.partitioner is None (which seems to confirm it's not understood)
How to join two JDBC tables and avoid Exchange? leads to the Datasource v2 partitioning reporting interface (fixed in 2.3.1), but perhaps that doesn't extend to jdbc loading?

Is there a way to get the optimization of dataframe.writer.partitionBy at the dataframe level?

If I am understanding the documentation correctly, partitioning a dataframe, vs partitioning a hive or other on-disk table, seem to be different. For on-disk storage, partitioning by, say, date creates a set of partitions for each date which occurs in my dataset. This seems useful; if I query records for a given date, every node in my cluster processes only partitions corresponding to the date I want.
Dataframe.repartition, on the other hand, creates one partition for each date which occurs in my dataset. If I search for records from a specific date, they will all be found in a single partition and thus all processed by a single node.
Is this right? If so, what is the use case? What is the way to get the speed advantage of on-disk partitioning schemes in the context of a dataframe?
For what it's worth, I need the advantage after I do an aggregation of on-disk data, so the on-disk partitioning doesn't necessarily help me even with delayed execution.
In your example, Spark will be able to recover very quickly all the records linked to that date. That's an improvement.
In the following piece of code, you can see that the filter has been categorized as partition filter.
inputRdd = sc.parallelize([("fish", 1), ("cats",2), ("dogs",3)])
schema = StructType([StructField("animals", StringType(), True),
StructField("ID", IntegerType(), True)])
my_dataframe = inputRdd.toDF(schema)
my_dataframe.write.partitionBy('animals').parquet("home")
sqlContext.read.parquet('home').filter(col('animals') == 'fish').explain()
== Physical Plan ==
*(1) FileScan parquet [ID#35,animals#36] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/home], PartitionCount: 1, PartitionFilters: [isnotnull(animals#36), (animals#36 = fish)], PushedFilters: [], ReadSchema: struct<ID:int>
For a deeper insight, you may want to have a look at this.
I am actually not sure about your other question. You are probably right, in my example df.rdd.getNumPartitions() gives 1. And with one partition performances are not so great (but you have already read from the disk at this point). For the following steps calling repartition(n) will fix the problem but it may be quite costly.
Another possible improvement is related to joining two data frames that share the same partitioning (with the join keys being the partition columns), you will avoid a lot of shuffles in the join phase.

Disable spark catalyst optimizer

To give some background, I am trying to run TPCDS benchmark on Spark with and without Spark's catalyst optimizer. For complicated queries on smaller datasets, we might be spending more time optimizing the plans than actually executing the plans. Hence wanted to measure the performance impact of optimizers on overall execution of the query
Is there a way to disable some or all of the spark catalyst optimization rules?
This ability has been added as part of Spark-2.4.0 in SPARK-24802.
val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
.doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
"specified by their rule names and separated by comma. It is not guaranteed that all the " +
"rules in this configuration will eventually be excluded, as some rules are necessary " +
"for correctness. The optimizer will log the rules that have indeed been excluded.")
.stringConf
.createOptional
You could find the list of optimizer rules here.
But ideally, we shouldn't be disabling the rules, since most of them provide performance benefits. We should identify the rule that consumes time and check if is not useful for the query and then disable them.
I know it's not the exact answer but it can help you.
Assuming your driver is not multithreaded. (hint for optimization if Catalyst is slow? :) )
If you want to measure time spent in Catalyst, just go to Spark UI and check how much time your executors are idle, or check the list of stages/jobs.
If you have a Job started at 15:30 with duration 30seconds, and next one starts at 15:32, it probably means catalyst is taking 1:30 to optimize (assuming no driver-heavy work is done).
Or even better, just put logs before calling every action in Spark and then just check how much time passes until the task is actually sent to the executor.
Just for completing, I asked on this pull request SPARK-24802 how to do it and Takeshi Yamamuro kindly answered:
scala> Seq("abc", "def").toDF("v").write.saveAsTable("t")
scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND EndsWith(v#18, bc))
^^^^^^^^
+- *(1) ColumnarToRow
+- FileScan parquet default.t[v#18] ...
scala> sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LikeSimplification")
scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND v#18 LIKE %bc)
^^^^
+- *(1) ColumnarToRow
+- FileScan parquet default.t[v#18] ...
I hope this helps.

Resources