Spark Clustered By/Bucket by dataset not using memory - apache-spark

I recently came across Spark bucketby/clusteredby here.
I tried to mimic this for a 1.1TB source file from S3 (already in parquet). Plan is to completely avoid shuffle as most of the datasets are always joined on "id" column. Here are is what I am doing:
.option("mode", "DROPMALFORMED")
.option("compression", "snappy")
On a different EMR cluster, I create a table and access it.
CREATE TABLE newtable_on_diff_cluster (id string, day date, col1 double, col2 double) USING PARQUET OPTIONS (
path "s3://my-bucket/folder/1year_data_bucketed/"
Create a scala dataframe and join it with another table of same 20 buckets of id column.
val myTableBucketedDf = spark.table("newtable_on_diff_cluster")
val myDimTableBucketedDf = spark.table("another_table_with_same_bucketing")
val joinedOutput = myTableBucketedDf.join(myDimTableBucketedDf, "id")
Here are my questions:
I see that even with repartition, shuffle is still removed in the explain plan, which is good. Is there any issue with using repartition, partition, bucketBy in the above fashion?
The above join is not looking like it is using memory on my EMR cluster from Ganglia. When joining Regular files in parquet format without bucketing, they seem to be running faster in memory for smaller number of day partitions. I havent tested it for more days. How exactly is the join processed here? Is there anyway to avoid the CREATE TABLE sql statement and instead use parquet metadata to define the table schema using scala? I dont want to repeat the column names, data types when they are actually available in parquet.
What is the ideal number of buckets or individual file size after bucket by in terms of available memory on the executor? If the unique number of values in ID column is in ~100 MM range, then if I understand correctly, 20 buckets will divide each bucket as 5MM unique IDs. I understand that the sortBy in here is not respected due to multiple files being produced by Spark for BucketBy. What is the recommendation for repartition/end file sizes/number of buckets in this case.


Avoid data shuffle and coalesce-numPartitions is not applied to individual partition while doing left anti-join in spark dataframe

I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!
I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \

Apache Spark Partitioning Data Using a SQL Function nTile

I am trying multiple ways to optimize executions of large datasets using partitioning. In particular I'm using a function commonly used with traditional SQL databases called nTile.
The objective is to place a certain number of rows into a bucket using a combination of buckettind and repartitioning. This allows Apache Spark to process data more efficient when processing partitioned datasets or should I say bucketted datasets.
Below is two examples. The first example shows how I've used ntile to split a dataset into two buckets followed by repartitioning the data into 2 partitions on the bucketted nTile called skew_data.
I then follow with the same query but without any bucketing or repartitioning.
The problem is query without the bucketting is faster then the query with bucketting, even the query without bucketting places all the data into one partition whereas the query with bucketting splits the query into 2 partitions.
Can someone let me know why that is.
I'm running the query on a Apache Spark cluster from Databricks.
The cluster just has one single node with 2 cores and 15Gb memory.
First example with nTile/Bucketting and repartitioning
allin = spark.sql("""
, t2.model
, NTILE(2) OVER (ORDER BY t2.sale_price) AS skew_data
ON t1.engine_size = t2.engine_size2
.repartition(2, col("skew_data"), rand())
The above code splits the data into partitions as follows, with the corresponding partition distribution
Number of partitions: 2
Partitioning distribution: [5556767, 5556797]
The second example: with no nTile/Bucketting or repartitioning
allin_NO_nTile = spark.sql("""
ON t1.engine_size = t2.engine_size2
The above code puts all the data into a single partition as shown below:
Number of partitions: 1
Partitioning distribution: [11113564]
My question is, why is it that the second query(without nTile or repartitioning) is faster than query with nTile and repartitioning?
I have gone to great lengths to write this question out as fully as possible, but if you need further explanation please don't hesitate to ask. I really want to get to the bottom of this.
I abandoned my original approached and used the new PySpark function called bucketBy(). If you want to know how to apply bucketBy() to bucket data go to

Date partition size 10GB read efficiently

We are using Cassandra DataStax 6.0 and Spark enabled. We have 10GB of data coming every day. All queries are based on date. We have one huge table with 40 columns. We are planning to generate reports using Spark. What is the best way to setup this data. Since we keep getting data every day and save data for around 1 year in one table.
We tried to use different partition but most of our keys are based on date.
No code just need suggestion
Our query should be fast enough. We have 256GB Ram with 9 nodes. 44 core CPU.
Having the data organized in the daily partitions isn't very good design - in this case, only RF nodes will be active during the day writing the data, and then at the time of the report generation.
Because you'll be accessing that data only from Spark, you can use following approach - have some bucket field as partition key, for example, with uniformly generated random number, and timestamp as a clustering column, and maybe another uuid column for uniqueness guarantee of records, something like this:
create table test.sdtest (
b int,
ts timestamp,
uid uuid,
v1 int,
primary key(b, ts, uid));
Where maximum value for generatio of b should be selected to have not too very big and not very small partitions, so we can effectively read them.
And then we can run Spark code like this:
import org.apache.spark.sql.cassandra._
val data ="sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T00:00:00+0000' as timestamp) AND ts < cast('2019-03-11T00:00:00+0000' as timestamp)")
The trick here is that we distribute data across the nodes by using the random partition key, so the all nodes will handle the load during writing the data and during the report generation.
If we look into physical plan for that Spark code (formatted for readability):
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [b#23,ts#24,v1#25]
PushedFilters: [*GreaterThanOrEqual(ts,2019-03-10 00:00:00.0),
*LessThan(ts,2019-03-11 00:00:00.0)], ReadSchema: struct<b:int,ts:timestamp,v1:int>
We can see that both conditions will be pushed to DSE on the CQL level - this means, that Spark won't load all data into memory and filter them, but instead all filtering will happen in Cassandra, and only necessary data will be returned back. And because we're spreading requests between multiple nodes, the reading could be faster (need to test) than reading one giant partition. Another benefit of this design, is that it will be easy to perform deletion of the old data using Spark, with something like this:
val toDel = sc.cassandraTable("test", "sdtest").where("ts < '2019-08-10T00:00:00+0000'")
toDel.deleteFromCassandra("test", "sdtest", keyColumns = SomeColumns("b", "ts"))
In this case, Spark will perform very effective range/row deletion that will generate less tombstones.
P.S. it's recommended to use DSE's version of the Spark connector as it may have more optimizations.
P.P.S. theoretically, we can merge ts and uid into one timeuuid column, but I'm not sure that it will work with Dataframes.

In Apache Spark's `bucketBy`, how do you generate 1 file per bucket instead of 1 file per bucket per partition?

I am trying to use Spark's bucketBy feature on a pretty large dataset.
.bucketBy(500, bucketColumn1, bucketColumn2)
.option("path", "s3://my-bucket")
The problem is that my Spark cluster has about 500 partitions/tasks/executors (not sure the terminology), so I end up with files that look like:
That's 500x500=250000 bucketed parquet files! It takes forever for the FileOutputCommitter to commit that to S3.
Is there a way to generate one file per bucket, like in Hive? Or is there a better way to deal with this problem? As of now it seems like I have to choose between lowering the parallelism of my cluster (reduce number of writers) or reducing the parallelism of my parquet files (reduce number of buckets).
In order to get 1 file per final bucket do the following. Right before writing the dataframe as table repartition it using exactly same columns as ones you are using for bucketing and set the number of new partitions to be equal to number of buckets you will use in bucketBy (or a smaller number which is a divisor of number of buckets, though I don't see a reason to use a smaller number here).
In your case that would probably look like this:
dataframe.repartition(500, bucketColumn1, bucketColumn2)
.bucketBy(500, bucketColumn1, bucketColumn2)
.option("path", "s3://my-bucket")
In the cases when you're saving to an existing table you need to make sure the types of columns are matching exactly (e.g. if your column X is INT in dataframe, but BIGINT in the table you're inserting into your repartitioning by X into 500 buckets won't match repartitioning by X treated as BIGINT and you'll end up with each of 500 executors writing 500 files again).
Just to be 100% clear - this repartitioning will add another step into your execution which is to gather the data for each bucket on 1 executor (so one full data reshuffle if the data was not partitioned same way before). I'm assuming that is exactly what you want.
It was also mentioned in comments to another answer that you'll need to be prepared for possible issues if your bucketing keys are skewed. It is true, but default Spark behavior doesn't exactly help you much if the first thing you do after loading the table is to aggregate/join on the same columns you bucketed by (which seems like a very possible scenario for someone who chose to bucket by these columns). Instead you will get a delayed issue and only see the skewness when try to load the data after the writing.
In my opinion it would be really nice if Spark offered a setting to always repartition your data before writing a bucketed table (especially when inserting into existing tables).
This should solve it.
.bucketBy(1, bucketColumn1, bucketColumn2)
.option("path", "s3://my-bucket")
Modify the Input Parameter for the BucketBy Function to 1.
You can look at the code of bucketBy from spark's git repository -
The first split part-00001, part-00002 is based on the number of parallel tasks running when you save the bucketed table. In your case you had 500 parallel tasks running. The number of files inside each part file is decided based on the input you provide for the bucketBy function.
To learn more about Spark tasks, partitions, executors, view my Medium articles -

How to auto calculate numRepartition while using spark dataframe write

When I tried to write dataframe to Hive Parquet Partitioned Table
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
val df = birthYears.toDF("year", "name")
By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5"), so Spark will create 5 partitions and 5 files will be written to HDFS.
