Repartition in Spark - SQL API - apache-spark

We use the SQL API of Spark to execute queries on Hive tables on the cluster. How can I perform a REPARTITION on a column in my query in SQL-API ?. Please note that we do not use the Dataframe API but instead we use the SQL API (for e.g SELECT * from table WHERE col = 1).
I understand that PySpark-SQL offers a function for the same in the Dataframe API.
However, I want to know the syntax to specify a REPARTITION (on a specific column) in a SQL query via the SQL-API (thru a SELECT statement).
Consider the following query :
select a.x, b.y
from a
JOIN b
on a.id = b.id
Any help is appreciated.
We use Spark 2.4
Thanks

You can provide hints to enable repartition in spark sql
spark.sql('''SELECT /*+ REPARTITION(colname) */ col1,col2 from table''')

You can use both, but using %sql, use from the manuals:
DISTRIBUTE BY
Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be hashed to the same worker. You cannot use this with ORDER BY or CLUSTER BY.
It all amounts to the same thing. I.e. shuffle occurs, that is to say you cannot eliminate it, just alternative interfaces. Of course, only possible due to 'lazy' evaluation employed.
%sql
SELECT * FROM boxes DISTRIBUTE BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
This is the alternative in %sql approach for hint as per other answer.

Related

CLUSTER BY usage with Spark SQL queries

I recently got introduced to Spark-SQL. I read somewhere about using CLUSTER BY on join columns (before the join) to improve join performance. Example:
create temporary view prod as
select id, name
from product
cluster by id;
create temporary view cust as
select cid, pid, cname
from customer
cluster by pid;
select c.id, p.name, c.name
from prod p
join cust c
on p.id = c.pid;
Can anyone please explain In which scenarios the same should be leveraged ? I understand that for join, data is shuffled. Then what benefits does CLUSTER BY brings in, since it also shuffles the data ?
Thanks.
If you use the SQL interface you can do things without having to use the DF interface.
Cluster By is the same as:
df.repartition($"key", n).sortWithinPartitions()
Due to lazy evaluation, Spark will see the JOIN and know that you indicate you want a repartition by key - via SQL, not like statement directly above - so it is just the interface amounting to the same thing. Makes it easier to stay in SQL mode only. You can intermix.
If you do not do it, then Spark will do it for you (in general) and apply the current shuffle partitions parameter.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df CLUSTER BY key
is the same as:
df.repartition($"key", 2).sortWithinPartitions()
spark.sql('''SELECT /*+ REPARTITION(col,..) */ cols... from table''')
UPDATE
This does not apply to a JOIN in this way:
val df = spark.sql(""" SELECT /*+ REPARTITION(30, c1) */ T1.c1, T1.c2, T2.c3
FROM T1, T2
WHERE T1.c1 = T2.c1
""")
What this does is to repartition after processing the JOIN. The JOIN will use the higher of partitioning nums set on T1 and T2, or shuffle partitions if not set explicitly.
Spark will recognize the cluster by and shuffle the data. However, if you use the same columns in later queries that induce shuffles, Spark might re-use the exchange.

Usage of Repartition in Spark SQL Queries

I am new to Spark-SQL. I read somewhere about using REPARTITION() before Joins in SparkSQL queries to achieve better performance.
However, I use plain Spark SQL queries (not PySparkSQL) and I am struggling to find out the equivalent usage syntax of REPARTITION in such plain queries like the sample shown below.
/* how to use repartition() here ? */
select t1.id, t2.name
from table1 t1
inner join table2 t2
on t1.id = t2.id;
Can anyone please share the usage and syntax to be used in the above sample query ? Also, I want to understand in which scenarios should repartition be used for achieving better join performance.
Thanks.
As per spark-24940 From Spark-2.4 you can use repartition,coalesce hints in sql.
Example:
#sample dataframe has 12 partitions
spark.sql(" select * from tmp").rdd.getNumPartitions()
12
#after repartition has 5 partitions
spark.sql(" select /*+ REPARTITION(5) */ * from tmp").rdd.getNumPartitions()
5

What is the best way to join multiple jdbc connection tables in spark?

I'm trying to migrate a query to pyspark and need to join multiple tables in it. All the tables in question are in Redshift and I'm using the jdbc connector to talk to them.
My problem is how do I do these joins optimally without reading too much data in (i.e. load table and join on key) and without just blatantly using:
spark.sql("""join table1 on x=y join table2 on y=z""")
Is there a way to pushdown the queries to Redshift but still use the Spark df API for writing the logic and also utilizing df from spark context without saving them to Redshift just for the joins?
Please find next some points to consider:
The connector will push-down the specified filters only if there is any filter specified in your Spark code e.g select * from tbl where id > 10000. You can confirm that by yourself, just check the responsible Scala code. Also here is the corresponding test which demonstrates exactly that. The test test("buildWhereClause with multiple filters") tries to verify that the variable expectedWhereClause is equal to whereClause generated by the connector. The generated where clause should be:
"""
|WHERE "test_bool" = true
|AND "test_string" = \'Unicode是樂趣\'
|AND "test_double" > 1000.0
|AND "test_double" < 1.7976931348623157E308
|AND "test_float" >= 1.0
|AND "test_int" <= 43
|AND "test_int" IS NOT NULL
|AND "test_int" IS NULL
"""
which has occurred from the Spark-filters specified above.
The driver supports also column filtering. Meaning it will load only the required columns by pushing down the valid columns to redshift. You can again verify that from the corresponding Scala test("DefaultSource supports simple column filtering") and test("query with pruned and filtered scans").
Although in your case, you haven't specified any filters in your join query hence Spark can not leverage the two previous optimisations. If you are aware of such filters please feel free to apply them though.
Last but not least and as Salim already mentioned, the official Spark connector for redshift can be found here. The Spark connector is built on top of Amazon Redshift JDBC Driver therefore it will try to use it anyway as specified on the connector's code.

Should we create separate dataframe for each table in a join query in SparkSQL

We need to convert and execute execute hive queries in Spark SQL.The query involves a join between 2 tables.We will create a dataframe and then sparksql queries on top of it.Please find samples hive query along with converted query.
------Hive query
select a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3
-----Spark SQL
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q1=hivecontext.sql("select col1,col2,col3,col4 from table1");
val q2=hivecontext.sql("select col3,col5,col6,col7 from table2");
val q3=q1.join(q2,q1("col3")===q2("col3"));
But it is also possible for us to execute the entire query in a single data frame as below
**
val q5=hivecontext.sql("select
a.col1,a.col2,a.col3,b.col4,b.col5,b.col6.b.col7
from table1 a left outer join table2 b
on a.col3=b.col3")**
I would like to know which of the 2 approach(single vs multiple dataframe) we is better to use in such situation and the advantages over the other in various parameters like performance and readability.
Second approach seems to be wise in all aspects
When you run SQL on top of Hive data, HiveContext will run the query in hive and returns the result metadata to Spark. So spark just need to store the resultant metadata set.But in the above case it has to store all the data in hive into its RDD's.
Maintaining a single RDD helps in optimizing DAG as well.
If you run as a single query even Spark catalyst will optimize it more.
It looks even better for Readability.
Both the approaches are identical. It doesn't matter really from the performance standpoint. Catalyst optimizer will create the same physical plan for both the queries.
Now however there are other aspects to consider. Writing SQL query is generally easy however you loose the compile time type check. If you have a typo or incorrect column name in the SQL it is impossible to find unless you run that on the cluster. However, if you are using dataframe operation the code won't compile. So it helps faster coding speed.
But again writing complex SQL with dataframe APIs is not trivial tasks. So generally I use Dataframe APIs where the operations are relatively easy and use SQL for complex queries.

Spark DataFrames: registerTempTable vs not

I just started with DataFrame yesterday and am really liking it so far.
I dont understand one thing though...
(Referring to the example under "Programmatically Specifying the Schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)
In this example the dataframe is registered as a table (I am guessing to provide access to SQL queries..?) but the exact same information that is being accessed can also be done by peopleDataFrame.select("name").
So question is.. When would you want to register a dataframe as a table instead of just using the given dataframe functions? And is one option more efficient than the other?
The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val customerDataFrame = myCodeToCreateOrLoadDataFrame()
customerDataFrame.registerTempTable( "cust" )
val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId"""
val salesPerCustomer: DataFrame = hc.sql( query )
salesPerCustomer.show()
Whether to use SQL or DataFrame methods like select and groupBy is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.
In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.
If you want to use SQL, then you most likely will want to create a HiveContext instead of a regular SQLContext. The Hive query language supports a broader range of SQL than available via a plain SQLContext.
It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:
df.createOrReplaceTempView("myTempView")
Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.
%sql
SELECT * FROM myTempView

Resources