Understanding Spark Explain: Collect vs Global vs Local Limit - apache-spark

I am trying to see the difference between doing limits in Spark/AWS Glue
I tried using Spark SQL
spark.sql("SELECT * FROM flights LIMIT 10")
The explain looks something like:
CollectLimit 10
+- *FileScan parquet xxxxxx.flights[Id#31,...] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://xxxxxx/flights], PartitionCount: 14509, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
Then I tried using AWS Glue Data Catalog to see if its any faster
gdf = glueContext.create_dynamic_frame.from_catalog(database = "xxxxxx", table_name = "xxxxxx")
df = gdf.toDF()
df = df.limit(10)
df.explain(True)
df.show(10)
The explain looks like:
GlobalLimit 10
+- LocalLimit 10
+- LogicalRDD [Id#70, ...]
The first runs in 5 mins the 2nd runs in 4 mins, not that significant yet but I am thinking it appears that either querying the data catalog is faster or doing a limit in data frame is better than doing the limit in spark SQL?
Whats the difference between a collect vs global vs local limit? I am guessing local limit means it does limit locally then the driver will do the global limit to give the final result. But why is Spark SQL not also doing this optimization?
Does Spark reads all the underlying parquet files before doing any limit? Is there a way to tell spark to read until it gets just 10 rows in this example case?

SQL way , Programmatic Dataset creation -the control flow is the same in both cases, it goes through Spark SQL catalyst. In your case, when the query was run for the first time, it fetches metadata about the table from the metastore and caches it, in subsequent queries, it is reused, this might be the reason for the slowness in the first query.
There is no LogicalPlan node as CollectLimit, there is only CollectLimitExec physicalplan node. And limit is implemented as LocalLimit followed by GlobalLimit(link to code)
Spark performs limit incrementally.
It tries to retrieve the given number of rows using one partition.
If the number of rows is not satisfied, Spark then queries the next 4 partitions(determined by spark.sql.limit.scaleUpFactor, default 4), then 16 and so on until the limit is satisfied or the data is exhausted.

Related

How does spark calculate the number of reducers in a hash shuffle?

I am trying to understand hash shuffle in Spark. I am reading this article
Hash Shuffle:
Each mapper task creates separate file for each separate reducer, resulting in M * R total files on the cluster, where M is the number of “mappers” and R is the number of “reducers”. With high amount of mappers and reducers this causes big problems, both with the output buffer size, amount of open files on the filesystem, speed of creating and dropping all these files.
The logic of this shuffler is pretty dumb: it calculates the amount of “reducers” as the amount of partitions on the “reduce” side
Can you help me understand the emboldened part? How does it know the amount of partitions on the reduce side or, what does "amount of partitions on the reduce side" even mean? Is it equal to spark.sql.shuffle.partitions? If it is indeed equal to that, then what is even there to calculate? A very small example would be very helpful.
spark.sql.shuffle.partitions is just the default used when number of partitions for a shuffle isn't set explicitly. So the "calculation", at a minimum, would involve a check whether specific number of partitions was requested or should Spark use the default.
Quick example:
scala> df.repartition(400,col("key")).groupBy("key").avg("value").explain()
== Physical Plan ==
*(2) HashAggregate(keys=[key#178], functions=[avg(value#164)])
+- *(2) HashAggregate(keys=[key#178], functions=[partial_avg(value#164)])
+- Exchange hashpartitioning(key#178, 400) <<<<< shuffle across increased number of partitions
+- *(1) Project [key#178, value#164]
+- *(1) FileScan parquet adb.atable[value#164,key#178,othercolumns...] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://ns1/hive/adb/atable/key=123..., PartitionCount: 3393, PartitionFilters: [isnotnull(key#178), (cast(key#178 as string) > 100)], PushedFilters: [], ReadSchema: struct<value:double,othercolumns...>
scala>
In Spark 3 and up, Adaptive Query Engine could also interject and revise that number, in attempts to optimize execution by coalescing, preserving (e.g. ENSURE_REQUIREMENTS) or increasing partitions.
EDIT: A side note -- your article is quite old (2015 was ages ago :)) and talks about pre-SparkSQL/pre-dataframe times. I'd try to find something more relevant.
EDIT 2: ...But even there, in the comments section, author rightfully says: In fact, here the question is more general. For most of the transformations in Spark you can manually specify the desired amount of output partitions, and this would be your amount of “reducers”...

Can Spark in Foundry use Partition Pruning

We have a dataset which runs as an incremental build on our Foundry instance.
The dataset is a large time series dataset (56.5 billion rows, 10 columns, 965GB), with timestamps in 1 hour buckets. The dataset grows by around 10GB per day.
In order to optimise the dataset for analysis purposes, we have repartitioned the dataset on two attributes “measure_date” and “measuring_time”.
This reflects the access pattern - the data set is usually accessed by "measure_date". We sub-partition this by "measuring_time" to decrease the size of parquet files being produced, plus filtering on time is a common access pattern as well.
The code which creates the partition is as following:
if ctx.is_incremental:
return df.repartition(24, "measure_date", "measuring_time")
else:
return df.repartition(2200, "measure_date", "measuring_time")
Using the hash partion creates unbalanced file sizes, but this is topic of a different post.
I am now trying to find out, how to make Spark on Foundry utilize the partitions in filter criteria. From what I can see, this is NOT the case.
I created a code workbook and ran the following query on the telemetry data, saving the result to another data set.
SELECT *
FROM telemetry_data
where measure_date = '2022-06-05'
The pyhsical query plan of the build seems to indicate, that Spark is not utilizing any partition, with PartitionFilters being empty in the plan.
Batched: true, BucketedScan: false, DataFilters: [isnotnull(measure_date#170), (measure_date#170 = 19148)],
Format: Parquet, Location: InMemoryFileIndex[sparkfoundry://prodapp06.palantir:8101/datasets/ri.foundry.main.dataset.xxx...,
PartitionFilters: [],
PushedFilters: [IsNotNull(measure_date), EqualTo(measure_date,2022-06-05)],
ReadSchema: struct<xxx,measure_date:date,measuring_time_cet:timestamp,fxxx, ScanMode: RegularMode
How can I make Spark on Foundry use partition pruning?
I believe you need to use
transforms.api.IncrementalTransformOutput.write_dataframe()
with
partitionBy=['measure_date', 'measuring_time']
to achieve what you are looking for.
Check the foundry docs for more.

Should Spark JDBC partitionColumns be recognized as DataFrame partitions?

I've used partitionColumn options to read a 300 million row table, hoping to achieve low memory/disk requirements for my ETL job (in Spark 3.0.1).
However, the explain plan shows at the start/leaf:
+- Exchange hashpartitioning[partitionCol#1, 200), true, [id=#201]
+- *(1) Scan JDBCRelation(table)[numPartitions=200] (partitionCol#1, time#2)...
I would have expected that shuffling was not necessary here, since the partitionCol was specified in the JDBC option.
There's a whole lot going on in the full plan, but every window operation partitions by partitionCol first and then other columns.
I've tried:
Ensuring my columns are declared not-null (since I saw Sort[partitionCol#1 ASC NULLS FIRST...] being injected and thought that might be an issue)
Checking dataframe partitioning: jdbcDF.rdd.partitioner is None (which seems to confirm it's not understood)
How to join two JDBC tables and avoid Exchange? leads to the Datasource v2 partitioning reporting interface (fixed in 2.3.1), but perhaps that doesn't extend to jdbc loading?

Bucketing in Hive Internal Table and SparkSql

I have created a bucketed internal hive table through the dataframe writer saveAsTable api.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1)
.sortBy(col1)
.saveAsTable(hiveTableName);
Now I trigger 2 select queries via spark sql , one on the buckted column and the other on the non bucketed column but I don't see any difference in the execution time.
The queries are :
select * from t1 where col1='123' [t1 is bucketed by col1]
select * from t1 where col2='123' [col2 is not a bucketing column]
My questions are
How can I ascertain that during the query execution full table scan is happening or partial table scan on the relevant is happening?
Can I get any information from the DAG or the physical plan? I have seen both ,but I don't see any difference
This what I see in the physical plan
== Physical Plan ==
*(1) Project [col1#0, col2#1, col3#2, col4#3, col5#4, col6#5, col7#6, col8#7, col9#8, col10#9, col11#10, col12#11]
+- *(1) Filter (isnotnull(col2#1) && (col2#1 = 123))
+- *(1) FileScan parquet default.uk_geocrosswalk[col1#0,col2#1,col3#2,col4#3,col5#4,col6#5,col7#6,col8#7,col9#8,LSOA_MSOA_WEIGHT#9,col11#10,col12#11] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://url/a.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(col2), EqualTo(col2,123)], ReadSchema: struct
In the physical plan why is it doing a file scan ? Should not it be doing a HiveTableScan as the table has been created as a hive table?
Are there certain config parameters that I can use to tune my queries while using sparksql ?
I see each time I run the query for the first time in spark sql it takes considerably long time. Is there a way I can warmup the executors before the query gets executed ?
Parquet is columnar. Parquet is very fast from my experience. The columnar aspect may well explain the same performance - whether key or not, the data format physically is columnar.
It is a Hive table, but using Parquet and Bucketing and not accessible to Hive / Impala. As it is Parquet, Hive Table Scan is not appropriate. A Hive table can have many physical formats, text, Parquet, ORC.
You can see the filtering: PartitionFilters: [], PushedFilters: [IsNotNull(col2), EqualTo(col2,123)],
No warmup as such. You could .cache things but I have tested and seen tests whereby there is not much difference in terms of caching Parquet tables, but it depends on the test.

How to know the number of Spark jobs and stages in (broadcast) join query?

I use Spark 2.1.2.
I am trying to understand various spark UI tab displays vis-a-vis as a job runs. I use spark-shell --master local and doing the following join query:
val df = Seq(
(55, "Canada", -1, "", 0),
(77, "Ontario", 55, "/55", 1),
(100, "Toronto", 77, "/55/77", 2),
(104, "Brampton", 100, "/55/77/100", 3)
).toDF("id", "name", "parentId", "path", "depth")
val dfWithPar = df.as("df1").
join(df.as("df2"), $"df1.parentId" === $"df2.Id", "leftouter").
select($"df1.*", $"df2.name" as "parentName")
dfWithPar.show
This is the physical query plan:
== Physical Plan ==
*Project [Id#11, name#12, parentId#13, path#14, depth#15, name#25 AS parentName#63]
+- *BroadcastHashJoin [parentId#13], [Id#24], LeftOuter, BuildRight
:- LocalTableScan [Id#11, name#12, parentId#13, path#14, depth#15]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- LocalTableScan [Id#24, name#25]
I've got two questions about the query execution.
Why are there two jobs for the query?
Why are the stage view shown for both jobs identical? Below is a screenshot of the stage view of job id 1 which is exactly the same of job id 0. Why?
I use Spark 2.3.0 to answer your question (2.3.1-SNAPSHOT actually) since it is the latest and greatest at the time of this writing. That changes very little about query execution (if anything important) as the physical query plans in your 2.1.2 and my 2.3.0 are exactly the same (except the per-query codegen stage ID in round brackets).
After dfWithPar.show the structured query (that you built using Spark SQL's Dataset API for Scala) is optimized to the following physical query plan (I'm including it in my answer for better comprehension).
scala> dfWithPar.explain
== Physical Plan ==
*(1) Project [Id#11, name#12, parentId#13, path#14, depth#15, name#24 AS parentName#58]
+- *(1) BroadcastHashJoin [parentId#13], [Id#23], LeftOuter, BuildRight
:- LocalTableScan [Id#11, name#12, parentId#13, path#14, depth#15]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [Id#23, name#24]
Number of Spark Jobs
Why are there two jobs for the query?
I'd say there are even three Spark jobs.
tl;dr One Spark job is for BroadcastHashJoinExec physical operator whereas the other two are for Dataset.show.
In order to understand the query execution and the number of Spark jobs of a structured query, it is important to understand the difference between structured queries (described using Dataset API) and RDD API.
Spark SQL's Datasets and Spark Core's RDDs both describe distributed computations in Spark. RDDs are the Spark "assembler" language (akin to the JVM bytecode) while Datasets are higher-level descriptions of structured queries using SQL-like language (akin to JVM languages like Scala or Java as compared to the JVM bytecode I used earlier).
What's important is that structured queries using Dataset API eventually end up as a RDD-based distributed computation (which could be compared to how the Java or Scala compilers transform the higher-level languages to the JVM bytecode).
Dataset API is an abstraction over RDD API and when you call an action on a DataFrame or Dataset that action transforms them to RDDs.
With that, you should not be surprised that Dataset.show will in the end call RDD action that in turn will run zero, one or many Spark jobs.
Dataset.show (with numRows equals 20 by default) in the end calls showString that take(numRows + 1) to get an Array[Row].
val takeResult = newDf.select(castCols: _*).take(numRows + 1)
In other words, dfWithPar.show() is equivalent of dfWithPar.take(21) which in turn is equivalent to dfWithPar.head(21) as far as the number of Spark jobs are concerned.
You can see them and their number of jobs in the SQL tab. They should all be equal.
show or take or head all lead to collectFromPlan that triggers the Spark jobs (by calling executeCollect).
You should be certain that to answer your question about the number of jobs is to know how all the physical operators in the query work. You simply have to know their behaviour at runtime and whether they trigger Spark jobs at all.
BroadcastHashJoin and BroadcastExchangeExec Physical Operators
BroadcastHashJoinExec binary physical operator is used when the right side of a join can be broadcast (which is exactly spark.sql.autoBroadcastJoinThreshold that is 10M by default).
BroadcastExchangeExec unary physical operator is used to broadcast rows (of a relation) to worker nodes (to support BroadcastHashJoinExec).
When BroadcastHashJoinExec is executed (to generate a RDD[InternalRow]), it creates a broadcast variable that in turn executes BroadcastExchangeExec (on a separate thread).
That's why the run at ThreadPoolExecutor.java:1149 Spark job 0 was run.
You could see the single Spark job 0 ran if you executed the following:
// Just a single Spark job for the broadcast variable
val r = dfWithPar.rdd
That requires that the structured query is executed to produce a RDD that is then the target of the action to give the final result.
You would not have had the Spark job if you had not ended up with a broadcast join query.
RDD.take Operator
What I missed the very first moment when I answered the question was that the Dataset operators, i.e. show, take and head, will in the end lead to RDD.take.
take(num: Int): Array[T] Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
Please note when take says "It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit." That's the key to understand the number of Spark jobs in your broadcast join query.
Every iteration (in the description above) is a separate Spark job starting with the very first partition and 4 times as many every following iteration:
// RDD.take
def take(num: Int): Array[T] = withScope {
...
while (buf.size < num && partsScanned < totalParts) {
...
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
...
}
}
Have a look at the following RDD.take with 21 rows.
// The other two Spark jobs
r.take(21)
You will get 2 Spark jobs as in your query.
Guess how many Spark jobs you will have if you executed dfWithPar.show(1).
Why Are Stages Identical?
Why are the stage view shown for both jobs identical? Below is a screenshot of the stage view of job id 1 which is exactly the same of job id 0. Why?
That's easy to answer since both Spark jobs are from RDD.take(20).
The first Spark job is to scan the first partition and since it had not enough rows led to another Spark job to scan more partitions.

Resources