CLUSTER BY usage with Spark SQL queries - apache-spark

I recently got introduced to Spark-SQL. I read somewhere about using CLUSTER BY on join columns (before the join) to improve join performance. Example:
create temporary view prod as
select id, name
from product
cluster by id;
create temporary view cust as
select cid, pid, cname
from customer
cluster by pid;
select c.id, p.name, c.name
from prod p
join cust c
on p.id = c.pid;
Can anyone please explain In which scenarios the same should be leveraged ? I understand that for join, data is shuffled. Then what benefits does CLUSTER BY brings in, since it also shuffles the data ?
Thanks.

If you use the SQL interface you can do things without having to use the DF interface.
Cluster By is the same as:
df.repartition($"key", n).sortWithinPartitions()
Due to lazy evaluation, Spark will see the JOIN and know that you indicate you want a repartition by key - via SQL, not like statement directly above - so it is just the interface amounting to the same thing. Makes it easier to stay in SQL mode only. You can intermix.
If you do not do it, then Spark will do it for you (in general) and apply the current shuffle partitions parameter.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df CLUSTER BY key
is the same as:
df.repartition($"key", 2).sortWithinPartitions()
spark.sql('''SELECT /*+ REPARTITION(col,..) */ cols... from table''')
UPDATE
This does not apply to a JOIN in this way:
val df = spark.sql(""" SELECT /*+ REPARTITION(30, c1) */ T1.c1, T1.c2, T2.c3
FROM T1, T2
WHERE T1.c1 = T2.c1
""")
What this does is to repartition after processing the JOIN. The JOIN will use the higher of partitioning nums set on T1 and T2, or shuffle partitions if not set explicitly.

Spark will recognize the cluster by and shuffle the data. However, if you use the same columns in later queries that induce shuffles, Spark might re-use the exchange.

Related

Apache Spark Partitioning Data Using a SQL Function nTile

I am trying multiple ways to optimize executions of large datasets using partitioning. In particular I'm using a function commonly used with traditional SQL databases called nTile.
The objective is to place a certain number of rows into a bucket using a combination of buckettind and repartitioning. This allows Apache Spark to process data more efficient when processing partitioned datasets or should I say bucketted datasets.
Below is two examples. The first example shows how I've used ntile to split a dataset into two buckets followed by repartitioning the data into 2 partitions on the bucketted nTile called skew_data.
I then follow with the same query but without any bucketing or repartitioning.
The problem is query without the bucketting is faster then the query with bucketting, even the query without bucketting places all the data into one partition whereas the query with bucketting splits the query into 2 partitions.
Can someone let me know why that is.
FYI
I'm running the query on a Apache Spark cluster from Databricks.
The cluster just has one single node with 2 cores and 15Gb memory.
First example with nTile/Bucketting and repartitioning
allin = spark.sql("""
SELECT
t1.make
, t2.model
, NTILE(2) OVER (ORDER BY t2.sale_price) AS skew_data
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
.repartition(2, col("skew_data"), rand())
.drop('skew_data')
The above code splits the data into partitions as follows, with the corresponding partition distribution
Number of partitions: 2
Partitioning distribution: [5556767, 5556797]
The second example: with no nTile/Bucketting or repartitioning
allin_NO_nTile = spark.sql("""
SELECT
t1.make
,t2.model
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
The above code puts all the data into a single partition as shown below:
Number of partitions: 1
Partitioning distribution: [11113564]
My question is, why is it that the second query(without nTile or repartitioning) is faster than query with nTile and repartitioning?
I have gone to great lengths to write this question out as fully as possible, but if you need further explanation please don't hesitate to ask. I really want to get to the bottom of this.
I abandoned my original approached and used the new PySpark function called bucketBy(). If you want to know how to apply bucketBy() to bucket data go to
https://www.youtube.com/watch?v=dv7IIYuQOXI&list=PLOmMQN2IKdjvowfXo_7hnFJHjcE3JOKwu&index=39

Repartition in Spark - SQL API

We use the SQL API of Spark to execute queries on Hive tables on the cluster. How can I perform a REPARTITION on a column in my query in SQL-API ?. Please note that we do not use the Dataframe API but instead we use the SQL API (for e.g SELECT * from table WHERE col = 1).
I understand that PySpark-SQL offers a function for the same in the Dataframe API.
However, I want to know the syntax to specify a REPARTITION (on a specific column) in a SQL query via the SQL-API (thru a SELECT statement).
Consider the following query :
select a.x, b.y
from a
JOIN b
on a.id = b.id
Any help is appreciated.
We use Spark 2.4
Thanks
You can provide hints to enable repartition in spark sql
spark.sql('''SELECT /*+ REPARTITION(colname) */ col1,col2 from table''')
You can use both, but using %sql, use from the manuals:
DISTRIBUTE BY
Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be hashed to the same worker. You cannot use this with ORDER BY or CLUSTER BY.
It all amounts to the same thing. I.e. shuffle occurs, that is to say you cannot eliminate it, just alternative interfaces. Of course, only possible due to 'lazy' evaluation employed.
%sql
SELECT * FROM boxes DISTRIBUTE BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
This is the alternative in %sql approach for hint as per other answer.

Usage of Repartition in Spark SQL Queries

I am new to Spark-SQL. I read somewhere about using REPARTITION() before Joins in SparkSQL queries to achieve better performance.
However, I use plain Spark SQL queries (not PySparkSQL) and I am struggling to find out the equivalent usage syntax of REPARTITION in such plain queries like the sample shown below.
/* how to use repartition() here ? */
select t1.id, t2.name
from table1 t1
inner join table2 t2
on t1.id = t2.id;
Can anyone please share the usage and syntax to be used in the above sample query ? Also, I want to understand in which scenarios should repartition be used for achieving better join performance.
Thanks.
As per spark-24940 From Spark-2.4 you can use repartition,coalesce hints in sql.
Example:
#sample dataframe has 12 partitions
spark.sql(" select * from tmp").rdd.getNumPartitions()
12
#after repartition has 5 partitions
spark.sql(" select /*+ REPARTITION(5) */ * from tmp").rdd.getNumPartitions()
5

SELECT DISTINCT Cassandra in Spark

I need a query that lists out the the unique Composite Partition Keys inside of spark.
The query in CASSANDRA: SELECT DISTINCT key1, key2, key3 FROM schema.table; is quite fast, however putting the same sort of data filter in a RDD or spark.sql retrieves results incredibly slowly in comparison.
e.g.
---- SPARK ----
var t1 = sc.cassandraTable("schema","table").select("key1", "key2", "key3").distinct()
var t2 = spark.sql("SELECT DISTINCT key1, key2, key3 FROM schema.table")
t1.count // takes 20 minutes
t2.count // takes 20 minutes
---- CASSANDRA ----
// takes < 1 minute while also printing out all results
SELECT DISTINCT key1, key2, key3 FROM schema.table;
where the table format is like:
CREATE TABLE schema.table (
key1 text,
key2 text,
key3 text,
ckey1 text,
ckey2 text,
v1 int,
PRIMARY KEY ((key1, key2, key3), ckey1, ckey2)
);
Doesn't spark use cassandra optimisations in its' queries?
How can I retreive this information efficiently?
Quick Answers
Doesn't spark use cassandra optimisations in its' queries?
Yes. But with SparkSQL only column pruning and predicate pushdowns. In RDDs it is manual.
How can I retreive this information efficiently?
Since your request returns quickly enough, I would just use the Java Driver directly to get this result set.
Long Answers
While Spark SQL can provide some C* based optimizations these are usually limited to predicate pushdowns when using the DataFrame interface. This is because the framework only provides limited information to the datasource. We can see this by doing an explain on the query you have written.
Lets start with the SparkSQL example
scala> spark.sql("SELECT DISTINCT key1, key2, key3 FROM test.tab").explain
== Physical Plan ==
*HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
+- Exchange hashpartitioning(key1#30, key2#31, key3#32, 200)
+- *HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation test.tab[key1#30,key2#31,key3#32] ReadSchema: struct<key1:string,key2:string,key3:string>
So your Spark example will actually be broken into several steps.
Scan : Read all the data from this table. This is means serializing every value from the C machine to the Spark Executor JVM, in other words lots of work.
*HashAggregate/Exchange/Hash Aggregate: Take the values from each executor, hash them locally then exchange the data between machines and hash again to ensure uniqueness. In layman's terms this means creating large hash structures, serializing them, running a complicated distributed sortmerge, then running a
hash again. (Expensive)
Why doesn't any of this get pushed down to C*? This is because Datasource (The CassandraSourceRelation in this case) is not given the information about the Distinct part of the query. This is just part of how Spark currently works. Docs on what is pushable
So what about the RDD version?
With RDDS we give a direct set of instructions to Spark. This means if you want to push something down it must be manually specified. Let's see the debug output of the RDD request
scala> sc.cassandraTable("test","tab").distinct.toDebugString
res2: String =
(13) MapPartitionsRDD[7] at distinct at <console>:45 []
| ShuffledRDD[6] at distinct at <console>:45 []
+-(13) MapPartitionsRDD[5] at distinct at <console>:45 []
| CassandraTableScanRDD[4] at RDD at CassandraRDD.scala:19 []
Here the issue is that your "distinct" call is a generic operation on an RDD and not specific to Cassandra. Since RDDs require all optimizations to be explicit (what you type is what you get) Cassandra never hears about this need for "Distinct" and we get a plan that is almost identical to our Spark SQL version. Do a full scan, serialize all of the data from Cassandra to Spark. Do a Shuffle and then return the results.
So what can we do about this?
With SparkSQL this is about as good as we can get without adding new rules to Catalyst (the SparkSQL/Dataframes Optimizer) to let it know that Cassandra can handle some distinct calls at the server level. It would then need to be implemented for the CassandraRDD subclasses.
For RDDs we would need to add a function like the already existing where, select, and limit, calls to the Cassandra RDD. A new Distinct call could be added here although it would only be allowable in specific situations. This is a function that currently does not exist in the SCC but could be added relatively easily since all it would do is prepend DISTINCT to requests and probably add some checking to make sure it is a DISTINCT that makes sense.
What can we do right now today without modifying the underlying connector?
Since we know the exact CQL request that we would like to make we can always use the Cassandra driver directly to get this information. The Spark Cassandra connector provides a driver pool we can use or we could just use the Java Driver natively. To use the pool we would do something like
import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(sc.getConf).withSessionDo{ session =>
session.execute("SELECT DISTINCT key1, key2, key3 FROM test.tab;").all()
}
And then parallelize the results if they are needed for further Spark work. If we really wanted to distribute this it would be necessary to most likely add the function to the Spark Cassandra Connector as I described above.
As long as we are selecting the partition key, we can use the .perPartitionLimit function of the CassandraRDD:
val partition_keys = sc.cassandraTable("schema","table").select("key1", "key2", "key3").perPartitionLimit(1)
This works because, per SPARKC-436
select key from some_table per partition limit 1
gives the same result as
select distinct key from some_table
This feature was introduced in spark-cassandra-connector 2.0.0-RC1
and requires at least C* 3.6
Distinct has a bad performance.
Here there is a good answer with some alternatives:
How to efficiently select distinct rows on an RDD based on a subset of its columns`
You can make use of toDebugString to have an idea of how many data your code shuffles.

Spark SQL broadcast hash join

I'm trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL%20%26%20DataFrames/05%20BroadcastHashJoin%20-%20scala.html
In that example, the (small) DataFrame is persisted via saveAsTable and then there's a join via spark SQL (i.e. via sqlContext.sql("..."))
The problem I have is that I need to use the sparkSQL API to construct my SQL (I am left joining ~50 tables with an ID list, and don't want to write the SQL by hand).
How do I tell spark to use the broadcast hash join via the API? The issue is that if I load the ID list (from the table persisted via `saveAsTable`) into a `DataFrame` to use in the join, it isn't clear to me if Spark can apply the broadcast hash join.
You can explicitly mark the DataFrame as small enough for broadcasting
using broadcast function:
Python:
from pyspark.sql.functions import broadcast
small_df = ...
large_df = ...
large_df.join(broadcast(small_df), ["foo"])
or broadcast hint (Spark >= 2.2):
large_df.join(small_df.hint("broadcast"), ["foo"])
Scala:
import org.apache.spark.sql.functions.broadcast
val smallDF: DataFrame = ???
val largeDF: DataFrame = ???
largeDF.join(broadcast(smallDF), Seq("foo"))
or broadcast hint (Spark >= 2.2):
largeDF.join(smallDF.hint("broadcast"), Seq("foo"))
SQL
You can use hints (Spark >= 2.2):
SELECT /*+ MAPJOIN(small) */ *
FROM large JOIN small
ON large.foo = small.foo
or
SELECT /*+ BROADCASTJOIN(small) */ *
FROM large JOIN small
ON large.foo = small.foo
or
SELECT /*+ BROADCAST(small) */ *
FROM large JOIN small
ON larger.foo = small.foo
R (SparkR):
With hint (Spark >= 2.2):
join(large, hint(small, "broadcast"), large$foo == small$foo)
With broadcast (Spark >= 2.3)
join(large, broadcast(small), large$foo == small$foo)
Note:
Broadcast join is useful if one of structures is relatively small. Otherwise it can be significantly more expensive than a full shuffle.
jon_rdd = sqlContext.sql( "select * from people_in_india p
join states s
on p.state = s.name")
jon_rdd.toDebugString() / join_rdd.explain() :
shuffledHashJoin :
all the data for the India will be shuffled into only 29 keys for each of the states.
Problems:
uneven sharding.
Limited parallelism with 29 output partitions.
broadcaseHashJoin:
broadcast the small RDD to all worker nodes.
parallelism of the large rdd is still maintained and shuffle is not even
required.
PS: Image may ugly but informative.
With a broadcast join one side of the join equation is being materialized and send to all mappers. It is therefore considered as a map-side join.
As the data set is getting materialized and send over the network it does only bring significant performance improvement, if it considerable small.
So if you are trying to perform smallDF.join(largeDF)
Wait..!!! another constraint is that it also needs to fit completely into the memory of each executor.It also needs to fit into the memory of the Driver!
Broadcast variables are shared among executors using the Torrent protocol i.e.Peer-to-Peer protocol and the advantage of the Torrent protocol is that peers share blocks of a file among each other not relying on a central entity holding all the blocks.
Above mentioned example is sufficient enough to start playing with broadcast join.
Note:
Cannot modify value after creation.
If you try, change will only be on one&node

Resources