Usage of Repartition in Spark SQL Queries - apache-spark

I am new to Spark-SQL. I read somewhere about using REPARTITION() before Joins in SparkSQL queries to achieve better performance.
However, I use plain Spark SQL queries (not PySparkSQL) and I am struggling to find out the equivalent usage syntax of REPARTITION in such plain queries like the sample shown below.
/* how to use repartition() here ? */
select t1.id, t2.name
from table1 t1
inner join table2 t2
on t1.id = t2.id;
Can anyone please share the usage and syntax to be used in the above sample query ? Also, I want to understand in which scenarios should repartition be used for achieving better join performance.
Thanks.

As per spark-24940 From Spark-2.4 you can use repartition,coalesce hints in sql.
Example:
#sample dataframe has 12 partitions
spark.sql(" select * from tmp").rdd.getNumPartitions()
12
#after repartition has 5 partitions
spark.sql(" select /*+ REPARTITION(5) */ * from tmp").rdd.getNumPartitions()
5

Related

Sql Query in Spark--->1. spark.sql("select * from titanic_csv").show() and 2. spark.sql("select count(*) from titanic_csv").show()

spark.sql("select * from titanic_csv").show()
spark.sql("select count(*) from titanic_csv").show()
what will be logical plan and how we can understand?
count(*) dag in spark ui
select * from table dag in spark ui
i would like to know how i can understand the plan of DAG
How to get the plan
You can see physical plan in SparkUI. In sql tab find your query and look for details
You can also use explain method on your df
What is happening in your example
I think that the order of operations here is important, and in your case i think that you first called spark.sql("select * from titanic_csv").show() and then spark.sql("select count(*) from titanic_csv").show() (as stated in title) so i am going to stick to this
In case of select * dag is simple, there is fileScan because Spark need to load the data into memory and then you have mapPartitions which is connected to your show
In count(*) case the left branch is skipped because Spark is not removing imedietely shuffle files so there was no need to compute this part of query as needed data are already there from previous stage
If you go to the details of right branch (so stage 18) u will see that there is something like this
Which means that Spark is reading shuffle files and then in mapPartitions is doing actual count to give you the results you need
Some references
This is topic is huge and not easy to master but imo you can start from this articles:
https://dzone.com/articles/reading-spark-dags
https://www.databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

Apache Spark Partitioning Data Using a SQL Function nTile

I am trying multiple ways to optimize executions of large datasets using partitioning. In particular I'm using a function commonly used with traditional SQL databases called nTile.
The objective is to place a certain number of rows into a bucket using a combination of buckettind and repartitioning. This allows Apache Spark to process data more efficient when processing partitioned datasets or should I say bucketted datasets.
Below is two examples. The first example shows how I've used ntile to split a dataset into two buckets followed by repartitioning the data into 2 partitions on the bucketted nTile called skew_data.
I then follow with the same query but without any bucketing or repartitioning.
The problem is query without the bucketting is faster then the query with bucketting, even the query without bucketting places all the data into one partition whereas the query with bucketting splits the query into 2 partitions.
Can someone let me know why that is.
FYI
I'm running the query on a Apache Spark cluster from Databricks.
The cluster just has one single node with 2 cores and 15Gb memory.
First example with nTile/Bucketting and repartitioning
allin = spark.sql("""
SELECT
t1.make
, t2.model
, NTILE(2) OVER (ORDER BY t2.sale_price) AS skew_data
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
.repartition(2, col("skew_data"), rand())
.drop('skew_data')
The above code splits the data into partitions as follows, with the corresponding partition distribution
Number of partitions: 2
Partitioning distribution: [5556767, 5556797]
The second example: with no nTile/Bucketting or repartitioning
allin_NO_nTile = spark.sql("""
SELECT
t1.make
,t2.model
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
The above code puts all the data into a single partition as shown below:
Number of partitions: 1
Partitioning distribution: [11113564]
My question is, why is it that the second query(without nTile or repartitioning) is faster than query with nTile and repartitioning?
I have gone to great lengths to write this question out as fully as possible, but if you need further explanation please don't hesitate to ask. I really want to get to the bottom of this.
I abandoned my original approached and used the new PySpark function called bucketBy(). If you want to know how to apply bucketBy() to bucket data go to
https://www.youtube.com/watch?v=dv7IIYuQOXI&list=PLOmMQN2IKdjvowfXo_7hnFJHjcE3JOKwu&index=39

Repartition in Spark - SQL API

We use the SQL API of Spark to execute queries on Hive tables on the cluster. How can I perform a REPARTITION on a column in my query in SQL-API ?. Please note that we do not use the Dataframe API but instead we use the SQL API (for e.g SELECT * from table WHERE col = 1).
I understand that PySpark-SQL offers a function for the same in the Dataframe API.
However, I want to know the syntax to specify a REPARTITION (on a specific column) in a SQL query via the SQL-API (thru a SELECT statement).
Consider the following query :
select a.x, b.y
from a
JOIN b
on a.id = b.id
Any help is appreciated.
We use Spark 2.4
Thanks
You can provide hints to enable repartition in spark sql
spark.sql('''SELECT /*+ REPARTITION(colname) */ col1,col2 from table''')
You can use both, but using %sql, use from the manuals:
DISTRIBUTE BY
Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be hashed to the same worker. You cannot use this with ORDER BY or CLUSTER BY.
It all amounts to the same thing. I.e. shuffle occurs, that is to say you cannot eliminate it, just alternative interfaces. Of course, only possible due to 'lazy' evaluation employed.
%sql
SELECT * FROM boxes DISTRIBUTE BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
This is the alternative in %sql approach for hint as per other answer.

CLUSTER BY usage with Spark SQL queries

I recently got introduced to Spark-SQL. I read somewhere about using CLUSTER BY on join columns (before the join) to improve join performance. Example:
create temporary view prod as
select id, name
from product
cluster by id;
create temporary view cust as
select cid, pid, cname
from customer
cluster by pid;
select c.id, p.name, c.name
from prod p
join cust c
on p.id = c.pid;
Can anyone please explain In which scenarios the same should be leveraged ? I understand that for join, data is shuffled. Then what benefits does CLUSTER BY brings in, since it also shuffles the data ?
Thanks.
If you use the SQL interface you can do things without having to use the DF interface.
Cluster By is the same as:
df.repartition($"key", n).sortWithinPartitions()
Due to lazy evaluation, Spark will see the JOIN and know that you indicate you want a repartition by key - via SQL, not like statement directly above - so it is just the interface amounting to the same thing. Makes it easier to stay in SQL mode only. You can intermix.
If you do not do it, then Spark will do it for you (in general) and apply the current shuffle partitions parameter.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df CLUSTER BY key
is the same as:
df.repartition($"key", 2).sortWithinPartitions()
spark.sql('''SELECT /*+ REPARTITION(col,..) */ cols... from table''')
UPDATE
This does not apply to a JOIN in this way:
val df = spark.sql(""" SELECT /*+ REPARTITION(30, c1) */ T1.c1, T1.c2, T2.c3
FROM T1, T2
WHERE T1.c1 = T2.c1
""")
What this does is to repartition after processing the JOIN. The JOIN will use the higher of partitioning nums set on T1 and T2, or shuffle partitions if not set explicitly.
Spark will recognize the cluster by and shuffle the data. However, if you use the same columns in later queries that induce shuffles, Spark might re-use the exchange.

Should we create separate dataframe for each table in a join query in SparkSQL

We need to convert and execute execute hive queries in Spark SQL.The query involves a join between 2 tables.We will create a dataframe and then sparksql queries on top of it.Please find samples hive query along with converted query.
------Hive query
select a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3
-----Spark SQL
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q1=hivecontext.sql("select col1,col2,col3,col4 from table1");
val q2=hivecontext.sql("select col3,col5,col6,col7 from table2");
val q3=q1.join(q2,q1("col3")===q2("col3"));
But it is also possible for us to execute the entire query in a single data frame as below
**
val q5=hivecontext.sql("select
a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3")**
I would like to know which of the 2 approach(single vs multiple dataframe) we is better to use in such situation and the advantages over the other in various parameters like performance and readability.
Second approach seems to be wise in all aspects
When you run SQL on top of Hive data, HiveContext will run the query in hive and returns the result metadata to Spark. So spark just need to store the resultant metadata set.But in the above case it has to store all the data in hive into its RDD's.
Maintaining a single RDD helps in optimizing DAG as well.
If you run as a single query even Spark catalyst will optimize it more.
It looks even better for Readability.
Both the approaches are identical. It doesn't matter really from the performance standpoint. Catalyst optimizer will create the same physical plan for both the queries.
Now however there are other aspects to consider. Writing SQL query is generally easy however you loose the compile time type check. If you have a typo or incorrect column name in the SQL it is impossible to find unless you run that on the cluster. However, if you are using dataframe operation the code won't compile. So it helps faster coding speed.
But again writing complex SQL with dataframe APIs is not trivial tasks. So generally I use Dataframe APIs where the operations are relatively easy and use SQL for complex queries.

Resources