Is it possible to do repartition after using partitionBy in a spark DF? - apache-spark

I am asking this question because if I specify repartition as 5, than all my data(>200Gigs) are moved to 5 different executors and 98% of the resources is unused. and then the partitionBy is happening which is again creating a lot of shuffle. Is there a way that first the partitionBy happens and then repartition runs on the data?

Although the question is not entirely easy to follow, the following aligns with the other answer and this approach should avoid the issues mentioned on unnecessary shuffling:
val n = [... some calculation for number of partitions / executors based on cluster config and volume of data to process ...]
df.repartition(n, $"field_1", $"field_2", ...)
.sortWithinPartitions("fieldx", "field_y")
.write.partitionBy("field_1", "field_2", ...)
.format("location")
whereby [field_1, field_2, ...] are the same set of fields for repartition and partitionBy.

You can use repartition(5, col("$colName")).
Thus when you will make partitionBy("$colName") you will skip shuffle for '$colName' since it's already been repartitioned.
Also consider to have as many partitions as the product of the number of executors by the number of used cores by 3 (this may vary between 2 and 4 though).
So as we know, Spark can only run 1 concurrent task for every partition of an RDD. Assuming you have 8 cores per executor and 5 executors:
You need to have: 8 * 5 * 3 = 120 partitions

Related

How to create partitions for two dataframes while couple of partitions can be located on the same instance/machine on Spark?

We have two DataFrames: df_A, df_B
Let's say, both has a huge # of rows. And we need to partition them.
How to partition them as couples?
For example, partition number is 5:
df_A partitions: partA_1, partA_2, partA_3, partA_4, partA_5
df_B partitions: partB_1, partB_2, partB_3, partB_4, partB_5
If we have 5 machines:
machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
machine_4: partA_4 and partB_4
machine_5: partA_5 and partB_5
If we have 3 machine:
machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
...(when machines are free up)...
machine_1: partA_4 and partB_4
machine_2: partA_5 and partB_5
Note: If one of DataFrames is small enough, we can use broadcast technique.
What to do(how to partition) when both (or more than two) DataFrames are large enough?
I think we need to take a step back here. Looking at big sizes aspect only, not broadcast.
Spark is a framework that manages things for your App in terms of co-location of dataframe partitions, taking into account resources allocated vs. resources available and the type of Action, and thus if Workers need to acquire partitions for processing.
repartitions are Transformations. When an Action, such as write:
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
occurs then things kick in.
If you have a JOIN, then Spark will work out if re-partitioning and movement is required.
That is to say, if you join on c1 for both DF's, then re-partitioning may most likely well occur for the c1 column, so that occurrences in the DF's for that c1 column are shuffled to the same Nodes where a free Executor resides waiting to serve that JOIN of 2 or more partitions.
That only occurs when an Action is invoked. In this way, if you do unnecessary Transformation, Catalyst can obviate those things.
Also, for number of partitions used, this is a good link imho: spark.sql.shuffle.partitions of 200 default partitions conundrum

spark creating num of partitions in RDD more than the data size

I am a noob and learning Pyspark now. My question about RDD is what happens when we try to create more partitions than the data size. E.g.,
data = sc.parallelize(range(5), partitions = 8)
I understand the intention of partitions is to effectively use the CPU cores of a cluster, and making too small partitions involves scheduling overhead than benefitting from distributed computing. What I am curious about is does spark still create 8 partitions here or optimize it to the number of cores? If it's creating 8 partitions then there is data replication in each partition?
My question about RDD is what happens when we try to create more
partitions than the data size
You can easily see how many partitions a given RDD has by using
data.getNumPartitions. I tried creating RDD you have mentioned and running this command and it shows me there are 8 partitions. 4 partitions had one number each and rest 4 empty.
If it's creating 8 partitions then there is data replication in each
partition?
You can try following code and check the executor output to see how many records are there in each partition. Note the first print statement in the below code. I have to return something as required by API so returning each element multiplied by 2.
data.mapPartitionsWithIndex((x,y) => {println(s"partitions $x has ${y.length} records");y.map(a => a*2)}).collect.foreach(println)
I got following output for the above code -
partitions 0 has 0 records
partitions 1 has 1 records
partitions 2 has 0 records
partitions 3 has 1 records
partitions 4 has 0 records
partitions 5 has 1 records
partitions 6 has 0 records
partitions 7 has 1 records
I am curious about is does spark still create 8 partitions here or
optimize it to the number of cores?
Number of partitions defines how much data you want spark to process in one task. If there are 8 partitions and 4 virtual cores then spark would start running 4 tasks ( corresponding to 4 partitions) at once. As these tasks finishes, it will schedule remaining ones those cores.

Why is Drill fastest with one partition?

My cluster has 6 nodes, each with 2 cores. I have a Spark job saving a Parquet file of the size of ~150MB to HDFS. If I repartition my dataframe to 6 partitions before saving, Drill queries are actually 30-40% slower than when I repartition it to 1 partition. Why is that? Is it expected? Can it indicate an issue with my setup?
Update
Results of the same SQL query in seconds (3 runs per number of partitions)
1 partition: 1.238, 1.29, 1.404
2 partitions: 1.286 1.175 1.259
3 partitions: 1.699 1.8 1.7
6 partitions: 2.223, 1.96, 1.772
12 partitions: 1.311, 1.335, 1.339
24 partitions: 1.261 1.302 1.235
48 partitions: 1.664 1.757 2.133
As you can see 1, 2, 12 and 24 partitions are fast. 3, 6 and 48 partitions are very clearly slower. What could be causing that?
When saving the parquet file in spark using a single partition, you would save the file locally to the partition on a single node. Once that happens, replication needs to kick in and distribute the file over the different nodes.
When saving the partquet file in spark using multiple partitions, spark would save the file distributed already, however, maybe not exactly the way HDFS needs it. Still replication and re-distribution needs to kick-in, but now in a much complexer situation.
Then depending on your spark process, you might have your data already sorted in a different way (1 vs multiple partitions), potentially making it more suitable for the next process in line (Drill).
I really cannot pintpoint a reason, but with such a small difference in time (you are talking seconds), I am not sure if the difference is even distinctive enough.
Then also, we might need to put the test methods in doubt. Java Garbage collection, Background Processes running (including the replicaton process), etc etc.
One suggestion I would have is to leave your HDFS cluster at rest for a while to make sure replication and other processes quiet down before you start with the Drill process.

When should I repartition an RDD?

I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.
Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:
How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each.
Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?
Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
For example :
val rdd= sc.textFile ("file.txt", 5)
The above line of code will create an RDD named textFile with 5 partitions.
Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.
The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the
partitions will process in parallel and the resources will be utilized in an optimal way.
Question : Are all data types (ie .txt, .parquet, etc..) repartitioned
by default if no partitioning is provided?
There will be default no of partitions for every rdd.
to check you can use rdd.partitions.length right after rdd created.
to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.
For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
Note : There is no fixed formula for this. general convention most of them follow is
(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )

Why join in spark in local mode is so slow?

I am using spark in local mode and a simple join is taking too long. I have fetched two dataframes: A (8 columns and 2.3 million rows) and B(8 columns and 1.2 million rows) and joining them using A.join(B,condition,'left') and called an action at last. It creates a single job with three stages, each for two dataframes extraction and one for joining. Surprisingly stage with extraction of dataframe A is taking around 8 minutes and that of dataframe B is taking 1 minute. And join happens within seconds. My important configuration settings are:
spark.master local[*]
spark.driver.cores 8
spark.executor.memory 30g
spark.driver.memory 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions 16
The only executor is driver itself. While extracting dataframes, i have partitioned it in 32(also tried 16,64,50,100,200) parts. I have seen shuffle write memory to be 100 MB for Stage with dataframe A extraction. So to avoid shuffle i made 16 initial partitions for both dataframes and broadcasted dataframe B(smaller), but it is not helping. There is still shuffle write memory. I have used broadcast(B) syntax for this. Am I doing something wrong? Why shuffling is still there? Also when i see event timelines its showing only four cores are processing at any point of time. Although I have a 2core*4 processor machine.Why is that so?
In short, "Join"<=>Shuffling, the big question here is how uniformly are your data distributed over partitions (see for example https://0x0fff.com/spark-architecture-shuffle/ , https://www.slideshare.net/SparkSummit/handling-data-skew-adaptively-in-spark-using-dynamic-repartitioning and just Google the problem).
Few possibilities to improve efficiency:
think more about your data (A and B) and partition data wisely;
analyze, are your data skewed?;
go into UI and look at the tasks timing;
choose such keys for partitions that during "join" only few partitions from dataset A shuffle with few partitions of B;

Resources