Spark group by Key and partitioning the data - apache-spark

I have a large csv file with data in following format.
cityId1,name,address,.......,zip
cityId2,name,address,.......,zip
cityId1,name,address,.......,zip
........
cityIdN,name,address,.......,zip
I am performing following operation on the above csv file:
Group by cityId as key and list of resources as value
df1.groupBy($"cityId").agg(collect_list(struct(cols.head, cols.tail: _*)) as "resources")
Change it to jsonRDD
val jsonDataRdd2 = df2.toJSON.rdd
Iterate through each Partition and upload to s3 per key
I can not use dataframe partitionby write because of business logic constraints (how other services read from S3 )
My Questions:
What is the default size of a spark partition?
Let's say default size of partition is X MBs and there is one large record present in the dataFrame with key having Y MBs of data (Y > X) , what would happen in this scenario?
Do I need to worry about having the same key in different partitions in that case?

In answer to your questions:
When reading from secondary storage (S3, HDFS) the partitions are equal to block size of file system, 128MB or 256MB; but you can repartition RDDs immediately, not Data Frames. (For JDBC and Spark Structured Streaming the partitions are dynamic in size.)
When applying 'wide transformations' and re-partitioning the number and size of partitions most likely change. The size of a given partition has a maximum value. In Spark 2.4.x the partition size increased to 8GB. So, if any transformation (e.g. collect_list in combination with groupBy) gens more than this maximum size, you will get an error and the program aborts. So you need to partition wisely or in your case have sufficient number of partitions for aggregation - see spark.sql.shuffle.partitions parameter.
The parallel model for processing by Spark relies on 'keys' being allocated via hash, range partitioning, etc. being distributed to one and only one partition - shuffling. So, iterating through a partition foreachPartition, mapPartitions there is no issue.

Related

Would repartitioning in spark change number of spark partitions whilst reading data from cassandra source?

I am reading a table from cassandra table in spark. I have big partition in cassandra and when partition size of cassandra exceeds 64 MB , in that case cassandra partition is going to be equal to spark partition. Due to large partition I am getting memory issues in spark.
My question is if I do repartition at the beginning after reading data from cassandra, would number of spark partitions change ? and would it not lead to spark memory issues ?
My assumption is at very first place spark would read data from cassandra and hence at this stage cassandra large partition won't split due to repartition . Repartition will work on underlying data loaded from cassandra.
I am just wondering for answer if repartition could change data distribution when reading data from spark , rather than doing partitioning again ?
If you repartition your data using some arbitrary key then yes, it will be redistributed among the Spark partitions.
Technically, Cassandra partitions do not get split into Spark partitions when you retrieve the data but once you're done reading, you can repartition on a different key to break up the rows of a large Cassandra partition.
For the record, it doesn't avoid the memory issues of reading large Cassandra partitions in the first place because ​the default input split size of 64MB is just a notional target that Spark uses to calculate how many Spark partitions are required based on the estimated Cassandra table size and C* partition sizes. But since the calculation is based on estimates, the Spark partitions don't actually end up being 64MB in size.
If you are interested, I've explained in detail how Spark partitions are calculated in this post -- https://community.datastax.com/questions/11500/.
To illustrate with an example, let's say that based on the estimated table size and estimated number of C* partitions, each Spark partition is mapped to 200 token ranges in Cassandra.
For the first Spark partition, the token range might only contain 2 Cassandra partitions of size 3MB and 15MB so the actual size of the data in Sthe park partition is just 18MB.
But in the next Spark partition, the token range contains 28 Cassandra partitions that are mostly 1 to 4MB but there is one partition that is 56MB. The total size of this Spark partition ends up being a lot more than 64MB.
In these 2 cases, one Spark partition was just 18MB in size while the other is bigger than the 64MB target size. I've explained this issue in a bit more detail in this post -- https://community.datastax.com/questions/11565/. Cheers!

How is a Spark Dataframe partitioned by default?

I know that an RDD is partitioned based on the key values using the HashPartitioner. But how is a Spark Dataframe partitioned by default as it does not have the concept of key/value.
A Dataframe is partitioned dependent on the number of tasks that run to create it.
There is no "default" partitioning logic applied. Here are some examples how partitions are set:
A Dataframe created through val df = Seq(1 to 500000: _*).toDF() will have only a single partition.
A Dataframe created through val df = spark.range(0,100).toDF() has as many partitions as the number of available cores (e.g. 4 when your master is set to local[4]). Also, see remark below on the "default parallelism" that comes into effect for operations like parallelize with no parent RDD.
A Dataframe derived from an RDD (spark.createDataFrame(rdd, schema)) will have the same amount of partitions as the underlying RDD. In my case, as I have locally 6 cores, the RDD got created with 6 partitions.
A Dataframe consuming from a Kafka topic will have the amount of partitions matching with the partitions of the topic because it can use as many cores/slots as the topic has partitions to consume the topic.
A Dataframe created by reading a file e.g. from HDFS will have the amount of partitions matching them of the file unless individual files have to be splitted into multiple partitions based on spark.sql.files.maxPartitionBytes which defaults to 128MB.
A Dataframe derived from a transformation requiring a shuffle will have the configurable amount of partitions set by spark.sql.shuffle.partitions (200 by default).
...
One of the major disctinctions between RDD and Structured API is that you do not have as much control over the partitions as you have with RDDs where you can even define a custom partitioner. This is not possible with Dataframes.
Default Parallelism
The documentation of the Execution Behavior configuration spark.default.parallelism explains:
For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Size of a partition or RDD

How can we calculate the size of a partition in a RDD? Is it not a recommended to calculate the partition size ? I want to dynamically set the number of shuffle partition before I call any action, hence need to calculate the partition size and depending on the number of executors want to set the shuffle partition count.
"I want to dynamically set the number of shuffle partition before I call any action"
unfortunately that's challenging todo in spark without diving deep into the low level code. In fact this is something that adaptive execution in spark 3.0 is bringing to the table. What it will do is over partition the dataset and then dynamically combine small partitions to reach a certain threshold.
https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
you can get the RDD partition size using below command:
someRDD.partitions.size
you can use different methods of partitioning like:
based on the columns
based on the (ataset size)/(block size)
based on the cores available

How the number of partitions is decided by Spark when a file is read?

How the number of partitions is decided by Spark when a file is read ?
Suppose we have a 10 GB single file in a hdfs directory and multiple part files of total 10 GB volume a another hdfs location .
If these two files are read in two separate spark data frames what would be their number of partitions and based on what logic ?
Found the information in How to: determine partition
It says:
How is this number determined? The way Spark groups RDDs into stages is described in the previous post. (As a quick reminder, transformations like repartition and reduceByKey induce stage boundaries.) The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage. The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: thecoalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product.
What about RDDs with no parents? RDDs produced by textFile or hadoopFile have their partitions determined by the underlying MapReduce InputFormat that’s used. Typically there will be a partition for each HDFS block being read. Partitions for RDDs produced by parallelize come from the parameter given by the user, or spark.default.parallelism if none is given.
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 79 blocks, which means that the RDD you read from this file would have 79 partitions.
Also, we can pass the number of partitions we want if we are not satisfied by the number of partitions provided by spark by default as shown below:
>>> rdd1 = sc.textFile("statePopulations.csv",10) // 10 is number of partitions

RDD and partition in Apache Spark

So, in Spark when an application is started then an RDD containing the dataset for the application (e.g. words dataset for WordCount) is created.
So far what I understand is that RDD is a collection of those words in WordCount and the operations that have been done to those dataset (e.g. map, reduceByKey, etc...)
However, afaik, Spark also has HadoopPartition (or in general: partition) which is read by every executor from HDFS. And I believe that an RDD in driver also contains all of these partitions.
So, what is getting divided among executors in Spark? Does every executor get those sub-dataset as a single RDD which contains less data compared to RDD in the driver or does every executor only deals with these partitions and read them directly from HDFS? Also, when are the partitions created? On the RDD creation?
Partitions are configurable provided the RDD is key-value based.
There are 3 main partition's property:
Tuples in the same partition are guaranteed to be in the same
machine.
Each node in a cluster can contain more than one partition.
The total number of partitions are configurable, by default it is
set to the total number of cores on all the executor nodes.
Spark supports two types of partitioning:
Hash Partitioning
Range Partitioning
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file.
When you call rdd.repartition(x) it would perform a shuffle of the data from N partitions you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
Please see more details here and here
Your RDD have rows in it. If it is a text file, it have lines separated by \n.
Those rows are getting divided into partitions across different nodes in Spark cluster.

Resources