Should Spark JDBC partitionColumns be recognized as DataFrame partitions? - apache-spark

I've used partitionColumn options to read a 300 million row table, hoping to achieve low memory/disk requirements for my ETL job (in Spark 3.0.1).
However, the explain plan shows at the start/leaf:
+- Exchange hashpartitioning[partitionCol#1, 200), true, [id=#201]
+- *(1) Scan JDBCRelation(table)[numPartitions=200] (partitionCol#1, time#2)...
I would have expected that shuffling was not necessary here, since the partitionCol was specified in the JDBC option.
There's a whole lot going on in the full plan, but every window operation partitions by partitionCol first and then other columns.
I've tried:
Ensuring my columns are declared not-null (since I saw Sort[partitionCol#1 ASC NULLS FIRST...] being injected and thought that might be an issue)
Checking dataframe partitioning: jdbcDF.rdd.partitioner is None (which seems to confirm it's not understood)
How to join two JDBC tables and avoid Exchange? leads to the Datasource v2 partitioning reporting interface (fixed in 2.3.1), but perhaps that doesn't extend to jdbc loading?

Related

Why would finding an aggregate of a partition column in Spark 3 take very long time?

I'm trying to query the MIN(dt) in a table partitioned by dt column using the following query in both Spark2 and Spark3:
SELECT MIN(dt) FROM table_name
The table is stored in parquet format in S3, where each dt is a separate folder, so this seems like a pretty simple operation. There's about 3,200 days of data.
In Spark2, this query completes in ~1 minute, while in Spark3, the query takes over an hour (not sure how long exactly since it hasn't finished yet).
In Spark3, the execution plan is:
AdaptiveSparkPlan (10)
+- == Current Plan ==
HashAggregate (6)
+- ShuffleQueryStage (5)
+- Exchange (4)
+- * HashAggregate (3)
+- * ColumnarToRow (2)
+- Scan parquet table_name (1)
+- == Initial Plan ==
HashAggregate (9)
+- Exchange (8)
+- HashAggregate (7)
+- Scan parquet table_name (1)
It's confusing to me how this would take a long time, as the data is already partitioned by dt. Spark only needs to determine which partitions have any rows and return the min of those.
What you're suggesting was implemented once as OptimizeMetadataOnly query optimizer rule, via JIRA SPARK-15752 "Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators".
However, it was found to cause correctness issues sometimes, when some of the partitions contained zero-row files, see JIRA SPARK-26709 "OptimizeMetadataOnlyQuery does not correctly handle the files with zero record".
Along with the fix, an internal Spark config spark.sql.optimizer.metadataOnly was added to provide a way to circumvent full-table scans "at your own risk", i.e. when you are certain that all your partitions aren't empty. Possibly, in your Spark 2 you have it set to true (or your Spark 2 doesn't include a fix at all). See also SPARK-34194 for additional discussion around it.
SPARK 3.0 deprecated this config (SPARK-31647), so most likely it is set to false in your environment, which causes Spark to scan all table partitions before aggregating the result to find min. But for the time being, you can still try setting it to true to speed up your query, just beware of the consequences.

How does spark calculate the number of reducers in a hash shuffle?

I am trying to understand hash shuffle in Spark. I am reading this article
Hash Shuffle:
Each mapper task creates separate file for each separate reducer, resulting in M * R total files on the cluster, where M is the number of “mappers” and R is the number of “reducers”. With high amount of mappers and reducers this causes big problems, both with the output buffer size, amount of open files on the filesystem, speed of creating and dropping all these files.
The logic of this shuffler is pretty dumb: it calculates the amount of “reducers” as the amount of partitions on the “reduce” side
Can you help me understand the emboldened part? How does it know the amount of partitions on the reduce side or, what does "amount of partitions on the reduce side" even mean? Is it equal to spark.sql.shuffle.partitions? If it is indeed equal to that, then what is even there to calculate? A very small example would be very helpful.
spark.sql.shuffle.partitions is just the default used when number of partitions for a shuffle isn't set explicitly. So the "calculation", at a minimum, would involve a check whether specific number of partitions was requested or should Spark use the default.
Quick example:
scala> df.repartition(400,col("key")).groupBy("key").avg("value").explain()
== Physical Plan ==
*(2) HashAggregate(keys=[key#178], functions=[avg(value#164)])
+- *(2) HashAggregate(keys=[key#178], functions=[partial_avg(value#164)])
+- Exchange hashpartitioning(key#178, 400) <<<<< shuffle across increased number of partitions
+- *(1) Project [key#178, value#164]
+- *(1) FileScan parquet adb.atable[value#164,key#178,othercolumns...] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://ns1/hive/adb/atable/key=123..., PartitionCount: 3393, PartitionFilters: [isnotnull(key#178), (cast(key#178 as string) > 100)], PushedFilters: [], ReadSchema: struct<value:double,othercolumns...>
scala>
In Spark 3 and up, Adaptive Query Engine could also interject and revise that number, in attempts to optimize execution by coalescing, preserving (e.g. ENSURE_REQUIREMENTS) or increasing partitions.
EDIT: A side note -- your article is quite old (2015 was ages ago :)) and talks about pre-SparkSQL/pre-dataframe times. I'd try to find something more relevant.
EDIT 2: ...But even there, in the comments section, author rightfully says: In fact, here the question is more general. For most of the transformations in Spark you can manually specify the desired amount of output partitions, and this would be your amount of “reducers”...

Is a groupby transformation on data that is already partitioned wide or narrow?

My understanding of Narrow and Wide transformations is as follows:
Narrow transformation - The data within a given partition is all that is needed to apply this transformation to the said partition and hence these transformations don't require data shuffle. example: map, filter
Wide transformation - The data within a given partition is not all that is needed to apply this transformation to the said partition and hence these transformations require data shuffle. example: sort
Question:
If I already have my dataset partitioned then apart from sort what transformation is wide? I keep reading that groupby is wide but I don't see how. If I have all the data with a given key on a given partition (Which is how it would be if the dataset is already partitioned) then I do not need data from other partitions to apply groupby. What am I missing here?
You can get a better idea of what Spark is doing by using the explain method on a DataFrame.
Using a small example:
case class T(a: String, b: Int)
val df = Seq(T("a", 1), T("b", 2), T("a", 1)).toDF
df.groupBy("a").sum().explain
Looking at the output:
== Physical Plan ==
*(2) HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint))], output=[a#11, sum(b)#21L])
+- Exchange hashpartitioning(a#11, 200), true, [id=#13]
+- *(1) HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint))], output=[a#11, sum#25L])
+- *(1) LocalTableScan [a#11, b#12]
The HashAggregate line is summing the values of b by key. More relevant for us is the line Exchange hashpartitioning,... which tells us that Spark is going to distribute the results. Spark doesn't know that the data is already partitioned, so it plans to hash the keys and collect them on a single partition. If your data is already partitioned then the hashing step won't actually result in any data being moved.
Other common methods which result in data shuffling are joins and aggregation methods (collect_list, count, sum, etc)

Is there a way to get the optimization of dataframe.writer.partitionBy at the dataframe level?

If I am understanding the documentation correctly, partitioning a dataframe, vs partitioning a hive or other on-disk table, seem to be different. For on-disk storage, partitioning by, say, date creates a set of partitions for each date which occurs in my dataset. This seems useful; if I query records for a given date, every node in my cluster processes only partitions corresponding to the date I want.
Dataframe.repartition, on the other hand, creates one partition for each date which occurs in my dataset. If I search for records from a specific date, they will all be found in a single partition and thus all processed by a single node.
Is this right? If so, what is the use case? What is the way to get the speed advantage of on-disk partitioning schemes in the context of a dataframe?
For what it's worth, I need the advantage after I do an aggregation of on-disk data, so the on-disk partitioning doesn't necessarily help me even with delayed execution.
In your example, Spark will be able to recover very quickly all the records linked to that date. That's an improvement.
In the following piece of code, you can see that the filter has been categorized as partition filter.
inputRdd = sc.parallelize([("fish", 1), ("cats",2), ("dogs",3)])
schema = StructType([StructField("animals", StringType(), True),
StructField("ID", IntegerType(), True)])
my_dataframe = inputRdd.toDF(schema)
my_dataframe.write.partitionBy('animals').parquet("home")
sqlContext.read.parquet('home').filter(col('animals') == 'fish').explain()
== Physical Plan ==
*(1) FileScan parquet [ID#35,animals#36] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/home], PartitionCount: 1, PartitionFilters: [isnotnull(animals#36), (animals#36 = fish)], PushedFilters: [], ReadSchema: struct<ID:int>
For a deeper insight, you may want to have a look at this.
I am actually not sure about your other question. You are probably right, in my example df.rdd.getNumPartitions() gives 1. And with one partition performances are not so great (but you have already read from the disk at this point). For the following steps calling repartition(n) will fix the problem but it may be quite costly.
Another possible improvement is related to joining two data frames that share the same partitioning (with the join keys being the partition columns), you will avoid a lot of shuffles in the join phase.

Understanding Spark Explain: Collect vs Global vs Local Limit

I am trying to see the difference between doing limits in Spark/AWS Glue
I tried using Spark SQL
spark.sql("SELECT * FROM flights LIMIT 10")
The explain looks something like:
CollectLimit 10
+- *FileScan parquet xxxxxx.flights[Id#31,...] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://xxxxxx/flights], PartitionCount: 14509, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
Then I tried using AWS Glue Data Catalog to see if its any faster
gdf = glueContext.create_dynamic_frame.from_catalog(database = "xxxxxx", table_name = "xxxxxx")
df = gdf.toDF()
df = df.limit(10)
df.explain(True)
df.show(10)
The explain looks like:
GlobalLimit 10
+- LocalLimit 10
+- LogicalRDD [Id#70, ...]
The first runs in 5 mins the 2nd runs in 4 mins, not that significant yet but I am thinking it appears that either querying the data catalog is faster or doing a limit in data frame is better than doing the limit in spark SQL?
Whats the difference between a collect vs global vs local limit? I am guessing local limit means it does limit locally then the driver will do the global limit to give the final result. But why is Spark SQL not also doing this optimization?
Does Spark reads all the underlying parquet files before doing any limit? Is there a way to tell spark to read until it gets just 10 rows in this example case?
SQL way , Programmatic Dataset creation -the control flow is the same in both cases, it goes through Spark SQL catalyst. In your case, when the query was run for the first time, it fetches metadata about the table from the metastore and caches it, in subsequent queries, it is reused, this might be the reason for the slowness in the first query.
There is no LogicalPlan node as CollectLimit, there is only CollectLimitExec physicalplan node. And limit is implemented as LocalLimit followed by GlobalLimit(link to code)
Spark performs limit incrementally.
It tries to retrieve the given number of rows using one partition.
If the number of rows is not satisfied, Spark then queries the next 4 partitions(determined by spark.sql.limit.scaleUpFactor, default 4), then 16 and so on until the limit is satisfied or the data is exhausted.

Resources