How Repartitioning of a data frame on frequently used filter column can be helpful in Spark? - apache-spark

I have been watching some materials here and there about repartitioning and coalesce of Spark data frames. Some said the repartitioning can improve performance if it is done on a filtered column. I don't understand why it is so. I know my question isn't specific since the video I watched didn't elaborate and I couldn't get any responses.
Is that because filtering will result in fewer partitions?
Any insight will be welcomed.

No matter what transformation you are doing, if you do that transformation on a partitionned column, it will be faster because it allows the machine to know where each data is. Therefore, the filtering will be faster because it does not have to scan all you data. It can directly "delete" or "select" only the rows you are interested in.

I wouldn't recommend it. For an instance,
Assume that, on an initial dataset, you repartition on a column and then filter on the same column. This will create a situation where many partitions will then have 0 records after the filter. As your pipeline moves ahead, the tasks for these partitions will end within 0-0.2 secs while a small number of tasks (mapping to the partitions with data) will actually work on the data making the whole pipeline slower.
Rather, I would like the filtered data to be present in all the partitions so that I utilise all the executor cores working on the data.
E.g.;
Dataset
id|category|partition
01|A |1
02|A |1
03|B |1
04|B |1
05|C |2
06|C |2
07|D |2
08|D |2
09|E |2
10|E |2
Assume that you now repartition on category
id|category|partition
01|A |1
02|A |1
03|B |2
04|B |2
05|C |3
06|C |3
07|D |4
08|D |4
09|E |5
10|E |5
Now, when you filter category=2, you then have 0 records in most partitions. Tasks further in your pipeline on these partitions essentially do not do any work and these executor cores are wasted.
Hope this makes sense.

Related

Spark starting job with all workers, but at some point using only 1 executor in a single worker when doing count()

I have an DF which is partitioned, and relatively small.
I try to do a simple count().
At starts all workers and executors participating at the task, but at some point of the job, only 1 worker and 1 core is working. Even though the data is distributed in a balanced way among the workers.
I have tried coalesce to 1 and also repartition to 2*number of cores, still no effect - no matter what kind of action I do on this DF, it always starts with all workers and keep working only on a single 1.
I'd appreciate if anyone has any idea what could be wrong.
Information on the DF:
Total Count:
13065
Partitions:
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 9| 5557|
| 10| 62|
| 11| 167|
| 0| 128|
| 1| 83|
| 2| 110|
| 3| 129|
| 4| 131|
| 5| 78|
| 6| 6429|
| 7| 39|
| 8| 152|
+--------------------+-----+
Screenshots from application master:
DAG:
event timeline:
tasks:
A task always takes place on a single executor: it never gets chopped up into pieces and distributed since it is the most atomic bit of work a Spark executor does.
By looking at the image with the table of tasks we see 12 tasks (are these all the tasks in your stage or are there more?): most of them take <1s but then there is one that takes 4.6min.
Interesting observation indeed! It makes sense that if you have a task that takes MUCH longer than all of the other tasks that you can end up with with a single executor having to calculate alone at the end.
So actually your problem is more of a data skew question: why does this single task take so much longer than the other ones? To answer this, we don't have enough information out of your question. I would start looking at the following:
As you can see in the screenshot of your stage, your stage reads from a shuffledRowRDD. So that means it is partitioned by something. By what is it partitioned? How many shuffled partitions does it read? What is the value of your spark.sql.shuffle.partitions configuration?
In the task table, you can see that you have a GC time of 1s (WAY larger than for the other tasks). Are you counting large objects? Is it possible that the difference in size of your objects can be really big?
Hope this helps you in figuring this one out!

Spark partitionBy | save by column value rather than columnName={value}

I am using scala and spark, my spark version is 2.4.3
My dataframe looks like this, there are other columns which i have not put and is not relavent.
+-----------+---------+---------+
|ts_utc_yyyy|ts_utc_MM|ts_utc_dd|
+-----------+---------+---------+
|2019 |01 |20 |
|2019 |01 |13 |
|2019 |01 |12 |
|2019 |01 |19 |
|2019 |01 |19 |
+-----------+---------+---------+
Basically i want to store the data in a bucketed format like
2019/01/12/data
2019/01/13/data
2019/01/19/data
2019/01/20/data
I am using following code snippet
df.write
.partitionBy("ts_utc_yyyy","ts_utc_MM","ts_utc_dd")
.format("csv")
.save(outputPath)
But the problem is it is getting stored along with the column name like below.
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=12/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=13/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=19/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=20/data
how do i save without column name in the folder name ?
Thanks.
This is the expected behaviour. Spark uses Hive partitioning so it writes using this convention, which enables partition discovery, filtering and pruning. In short, it optimises your queries by ensuring that the minimum amount of data is read.
Spark isn't really designed for the output you need. The easiest way for you to solve this is to have a downstream task that will simply rename the directories by splitting on the equals sign.

transform GroupBy+aggregate to groupByKey

I work on a DF that looks like this :
+-------+-------------+
|A |B |
|1 |"foo" |
|1 |"bar" |
|1 |"foobar" |
|2 |"bar" |
|2 |"foo" |
and I want to transform it to something like this :
+-------+-----------------+
|A |B |
|1 |"foo/bar/foobar" |
|2 |"bar/foo" |
So, I wrote this code to do so :
df.groupby("A")
.agg(concat_ws("/", collect_list(col("B"))))
.collect()
However, since I work on a large DF, groupby+agg is not that good and does a lot of shuffling. I did some research and found that ReduceByKey could be better (less shuffling). So, my question is : how can I replace GrouBy+agg with ReduceByKey ?
Thank you !
You shouldn't replace it. Group By in Spark SQL is not the same as Group By Key in Spark Core. It is more complex operation.
In Spark SQL, groupBy just add a node in Query Plan. The way it will be executed is recognized during Query Plan transform from Logical Plan to Physical Plan. Spark will optimize grouping as much as it can now.
So, for now: use groupBy + agg when you can, it's the fastest solution in the most cases.
One of cases when Spark SQL is less efficient is treeAggregate - currenlty there's no such API in Spark SQL and Spark Core is faster when you need tree Aggregation. However, Community is working now on tree Aggregate also in Datasets and DataFrames
As #user8371915 mentioned in the comment, in your case there is nothing to reduce - groupBy will work exactly the same as RDD.groupByKey, because it can't aggregate values from Dataset or DataFrame. However, the key point is still the same - Spark SQL groupBy will choose how to do grouping

How to sort within partitions (and avoid sort across the partitions) using RDD API?

It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)
I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)
RDD's sortByKey method is doing total ordering
RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.
Is there a direct way to sort within partition but not cross partitions?
You can use Dataset and sortWithinPartitions method:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.
I've never had this need before, but my first guess would be to use any of the *Partition* methods (e.g. foreachPartition or mapPartitions) to do the sorting within every partition.
Since they give you a Scala Iterator, you could use it.toSeq and then apply any of the sorting methods of Seq, e.g. sortBy or sortWith or sorted.

Describe on Dataframe is not displaying the complete resultset

I am using Scala 1.6. The describe on a data frame is not displaying the column header and the values. Please see below:
val data=sc.textFile("/tmp/sample.txt")
data.toDF.describe().show
This gives the below result:
Please let me know why it is not displaying the entire result set.
+-------+
|summary|
+-------+
| count|
| mean|
| stddev|
| min|
| max|
+-------+
I think you just need to use the show method.
sc.textFile("/tmp/sample.txt").toDF.show
As far as displaying the complete RDD, be careful with this as you will need to collect the results on the driver in order to do this. You may want to consider using take instead if the csv file is large.
val data = sc.textFile("/tmp/sample.txt").toDF
data.collect.foreach(println)
or
data.take(100).foreach(println)
This was because, spark 1.6 was considering every filed as String by default and it does not provide summary stats on String type. However, in Spark 2.1, the columns were correctly inferred as their respective data type (Int/String/Double etc.,) and summary stats included all the columns in the file and it was not restricted only to numerical fields.
I feel, df.describe() works more elegantly in Spark 2.1 than Spark 1.6.

Resources