How to control partition size in Spark SQL

How to control partition size in Spark SQL - apache-spark

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContext to take number of partitions parameter.
Repartitioning of the RDD causes shuffling and results in more processing time.
>
val result = sqlContext.sql("select * from bt_st_ent")
Has the log output of:
Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)
I would like to know is there any way to increase the partitions size of the SQL output.

Spark < 2.0:
You can use Hadoop configuration options:
mapred.min.split.size.
mapred.max.split.size
as well as HDFS block size to control partition size for filesystem based formats*.
val minSplit: Int = ???
val maxSplit: Int = ???
sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)
Spark 2.0+:
You can use spark.sql.files.maxPartitionBytes configuration:
spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)
In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.
* Other input formats can use different settings. See for example
Partitioning in spark while reading from RDBMS via JDBC
Difference between mapreduce split and spark paritition
Furthermore Datasets created from RDDs will inherit partition layout from their parents.
Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset partition.

A very common and painful problem. You should look for a key which distributes the data in uniform partitions. The you can use the DISTRIBUTE BY and CLUSTER BY operators to tell spark to group rows in a partition. This will incur some overhead on the query itself. But will result in evenly sized partitions. Deepsense has a very good tutorial on this.

If your SQL performs a shuffle (for example it has a join, or some sort of group by), you can set the number of partitions by setting the 'spark.sql.shuffle.partitions' property
sqlContext.setConf( "spark.sql.shuffle.partitions", 64)
Following up on what Fokko suggests, you could use a random variable to cluster by.
val result = sqlContext.sql("""
select * from (
select *,random(64) as rand_part from bt_st_ent
) cluster by rand_part""")

Related

How is a Spark Dataframe partitioned by default?

I know that an RDD is partitioned based on the key values using the HashPartitioner. But how is a Spark Dataframe partitioned by default as it does not have the concept of key/value.

A Dataframe is partitioned dependent on the number of tasks that run to create it.
There is no "default" partitioning logic applied. Here are some examples how partitions are set:
A Dataframe created through val df = Seq(1 to 500000: _*).toDF() will have only a single partition.
A Dataframe created through val df = spark.range(0,100).toDF() has as many partitions as the number of available cores (e.g. 4 when your master is set to local[4]). Also, see remark below on the "default parallelism" that comes into effect for operations like parallelize with no parent RDD.
A Dataframe derived from an RDD (spark.createDataFrame(rdd, schema)) will have the same amount of partitions as the underlying RDD. In my case, as I have locally 6 cores, the RDD got created with 6 partitions.
A Dataframe consuming from a Kafka topic will have the amount of partitions matching with the partitions of the topic because it can use as many cores/slots as the topic has partitions to consume the topic.
A Dataframe created by reading a file e.g. from HDFS will have the amount of partitions matching them of the file unless individual files have to be splitted into multiple partitions based on spark.sql.files.maxPartitionBytes which defaults to 128MB.
A Dataframe derived from a transformation requiring a shuffle will have the configurable amount of partitions set by spark.sql.shuffle.partitions (200 by default).
...
One of the major disctinctions between RDD and Structured API is that you do not have as much control over the partitions as you have with RDDs where you can even define a custom partitioner. This is not possible with Dataframes.
Default Parallelism
The documentation of the Execution Behavior configuration spark.default.parallelism explains:
For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

How to auto calculate numRepartition while using spark dataframe write

When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")

First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
)
val df = birthYears.toDF("year", "name")
df.repartition($"year")

By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5"), so Spark will create 5 partitions and 5 files will be written to HDFS.

How to set multiple spark configurations for same spark job

I am dealing with a weird situation , where I have small tables and big tables to process using spark and it must be a single spark job.
To achieve best performance targets, I need to set a property called
spark.sql.shuffle.partitions = 12 for small tables and
spark.sql.shuffle.partitions = 500 for bigger tables
I want to know how can I change these properties dynamically in spark ?
Can I have multiple configuration files and call it within the program ?

apache-spark documentation describes this property:
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.
according to above description you need just to load your data to rdds and when you want to apply the join or aggregation on them you can change partition numbers using repartition method.
As mentioned in documentation the default value is 200 then you need one increasing and one decreasing in partition numbers. Then you should use repartition and coalesce method. Repartition is useful for both decreasing and increasing but in decreasing is has a shuffle overhead but coalesce does not have this overhead and is more optimized than repartition.

If you have the tables in parquet files for instance, you could read files in advance and know if is small table or big table and change the value of your shuffle partitions.
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val shuffle12 = "spark.sql.shuffle.partitions = 12"
val shuffle500= "spark.sql.shuffle.partitions = 500"
val total = hdfs.getContentSummary(new Path(pathTable)).getLength
if (total < x) shuffle12 else shuffle500

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.
I know that for a RDD, while creating it we can mention the number of partitions like below.
val RDD1 = sc.textFile("path" , 6)
But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.
Only possibility i think is, after creating dataframe we can use repartition API.
df.repartition(4)
So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.
In general:
Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.

Default number of shuffle partitions in spark dataframe(200)
Default number of partitions in rdd(10)

Spark SQL(Hive query through HiveContext) always creating 31 partitions

I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions.
I using this code snippet to execute hive query:
var pairedRDD = hqlContext.sql(hql).rdd.map(...)
I am using Spark 1.3.1
Thanks,
Nitin

The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: the coalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product.
To increase number of partitions
Use the repartition transformation, which will trigger a shuffle.
Configure your InputFormat to create more splits.
Write the input data out to HDFS with a smaller block size.
This link here has good explanation of how the number of partitions are defined and how to increase the number of partitions.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string