How is a Spark Dataframe partitioned by default? - apache-spark

I know that an RDD is partitioned based on the key values using the HashPartitioner. But how is a Spark Dataframe partitioned by default as it does not have the concept of key/value.

A Dataframe is partitioned dependent on the number of tasks that run to create it.
There is no "default" partitioning logic applied. Here are some examples how partitions are set:
A Dataframe created through val df = Seq(1 to 500000: _*).toDF() will have only a single partition.
A Dataframe created through val df = spark.range(0,100).toDF() has as many partitions as the number of available cores (e.g. 4 when your master is set to local[4]). Also, see remark below on the "default parallelism" that comes into effect for operations like parallelize with no parent RDD.
A Dataframe derived from an RDD (spark.createDataFrame(rdd, schema)) will have the same amount of partitions as the underlying RDD. In my case, as I have locally 6 cores, the RDD got created with 6 partitions.
A Dataframe consuming from a Kafka topic will have the amount of partitions matching with the partitions of the topic because it can use as many cores/slots as the topic has partitions to consume the topic.
A Dataframe created by reading a file e.g. from HDFS will have the amount of partitions matching them of the file unless individual files have to be splitted into multiple partitions based on spark.sql.files.maxPartitionBytes which defaults to 128MB.
A Dataframe derived from a transformation requiring a shuffle will have the configurable amount of partitions set by spark.sql.shuffle.partitions (200 by default).
...
One of the major disctinctions between RDD and Structured API is that you do not have as much control over the partitions as you have with RDDs where you can even define a custom partitioner. This is not possible with Dataframes.
Default Parallelism
The documentation of the Execution Behavior configuration spark.default.parallelism explains:
For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Related

Spark group by Key and partitioning the data

I have a large csv file with data in following format.
cityId1,name,address,.......,zip
cityId2,name,address,.......,zip
cityId1,name,address,.......,zip
........
cityIdN,name,address,.......,zip
I am performing following operation on the above csv file:
Group by cityId as key and list of resources as value
df1.groupBy($"cityId").agg(collect_list(struct(cols.head, cols.tail: _*)) as "resources")
Change it to jsonRDD
val jsonDataRdd2 = df2.toJSON.rdd
Iterate through each Partition and upload to s3 per key
I can not use dataframe partitionby write because of business logic constraints (how other services read from S3 )
My Questions:
What is the default size of a spark partition?
Let's say default size of partition is X MBs and there is one large record present in the dataFrame with key having Y MBs of data (Y > X) , what would happen in this scenario?
Do I need to worry about having the same key in different partitions in that case?
In answer to your questions:
When reading from secondary storage (S3, HDFS) the partitions are equal to block size of file system, 128MB or 256MB; but you can repartition RDDs immediately, not Data Frames. (For JDBC and Spark Structured Streaming the partitions are dynamic in size.)
When applying 'wide transformations' and re-partitioning the number and size of partitions most likely change. The size of a given partition has a maximum value. In Spark 2.4.x the partition size increased to 8GB. So, if any transformation (e.g. collect_list in combination with groupBy) gens more than this maximum size, you will get an error and the program aborts. So you need to partition wisely or in your case have sufficient number of partitions for aggregation - see spark.sql.shuffle.partitions parameter.
The parallel model for processing by Spark relies on 'keys' being allocated via hash, range partitioning, etc. being distributed to one and only one partition - shuffling. So, iterating through a partition foreachPartition, mapPartitions there is no issue.

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.
In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)
I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.
I know that for a RDD, while creating it we can mention the number of partitions like below.
val RDD1 = sc.textFile("path" , 6)
But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.
Only possibility i think is, after creating dataframe we can use repartition API.
df.repartition(4)
So can anyone please let me know if we can specify the number of partitions while creating a dataframe.
You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.
In general:
Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
Default number of shuffle partitions in spark dataframe(200)
Default number of partitions in rdd(10)

Spark SQL(Hive query through HiveContext) always creating 31 partitions

I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions.
I using this code snippet to execute hive query:
var pairedRDD = hqlContext.sql(hql).rdd.map(...)
I am using Spark 1.3.1
Thanks,
Nitin
The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: the coalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product.
To increase number of partitions
Use the repartition transformation, which will trigger a shuffle.
Configure your InputFormat to create more splits.
Write the input data out to HDFS with a smaller block size.
This link here has good explanation of how the number of partitions are defined and how to increase the number of partitions.

RDD and partition in Apache Spark

So, in Spark when an application is started then an RDD containing the dataset for the application (e.g. words dataset for WordCount) is created.
So far what I understand is that RDD is a collection of those words in WordCount and the operations that have been done to those dataset (e.g. map, reduceByKey, etc...)
However, afaik, Spark also has HadoopPartition (or in general: partition) which is read by every executor from HDFS. And I believe that an RDD in driver also contains all of these partitions.
So, what is getting divided among executors in Spark? Does every executor get those sub-dataset as a single RDD which contains less data compared to RDD in the driver or does every executor only deals with these partitions and read them directly from HDFS? Also, when are the partitions created? On the RDD creation?
Partitions are configurable provided the RDD is key-value based.
There are 3 main partition's property:
Tuples in the same partition are guaranteed to be in the same
machine.
Each node in a cluster can contain more than one partition.
The total number of partitions are configurable, by default it is
set to the total number of cores on all the executor nodes.
Spark supports two types of partitioning:
Hash Partitioning
Range Partitioning
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file.
When you call rdd.repartition(x) it would perform a shuffle of the data from N partitions you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
Please see more details here and here
Your RDD have rows in it. If it is a text file, it have lines separated by \n.
Those rows are getting divided into partitions across different nodes in Spark cluster.

Resources