I work on a DF that looks like this :
+-------+-------------+
|A |B |
|1 |"foo" |
|1 |"bar" |
|1 |"foobar" |
|2 |"bar" |
|2 |"foo" |
and I want to transform it to something like this :
+-------+-----------------+
|A |B |
|1 |"foo/bar/foobar" |
|2 |"bar/foo" |
So, I wrote this code to do so :
df.groupby("A")
.agg(concat_ws("/", collect_list(col("B"))))
.collect()
However, since I work on a large DF, groupby+agg is not that good and does a lot of shuffling. I did some research and found that ReduceByKey could be better (less shuffling). So, my question is : how can I replace GrouBy+agg with ReduceByKey ?
Thank you !
You shouldn't replace it. Group By in Spark SQL is not the same as Group By Key in Spark Core. It is more complex operation.
In Spark SQL, groupBy just add a node in Query Plan. The way it will be executed is recognized during Query Plan transform from Logical Plan to Physical Plan. Spark will optimize grouping as much as it can now.
So, for now: use groupBy + agg when you can, it's the fastest solution in the most cases.
One of cases when Spark SQL is less efficient is treeAggregate - currenlty there's no such API in Spark SQL and Spark Core is faster when you need tree Aggregation. However, Community is working now on tree Aggregate also in Datasets and DataFrames
As #user8371915 mentioned in the comment, in your case there is nothing to reduce - groupBy will work exactly the same as RDD.groupByKey, because it can't aggregate values from Dataset or DataFrame. However, the key point is still the same - Spark SQL groupBy will choose how to do grouping
Related
I have been watching some materials here and there about repartitioning and coalesce of Spark data frames. Some said the repartitioning can improve performance if it is done on a filtered column. I don't understand why it is so. I know my question isn't specific since the video I watched didn't elaborate and I couldn't get any responses.
Is that because filtering will result in fewer partitions?
Any insight will be welcomed.
No matter what transformation you are doing, if you do that transformation on a partitionned column, it will be faster because it allows the machine to know where each data is. Therefore, the filtering will be faster because it does not have to scan all you data. It can directly "delete" or "select" only the rows you are interested in.
I wouldn't recommend it. For an instance,
Assume that, on an initial dataset, you repartition on a column and then filter on the same column. This will create a situation where many partitions will then have 0 records after the filter. As your pipeline moves ahead, the tasks for these partitions will end within 0-0.2 secs while a small number of tasks (mapping to the partitions with data) will actually work on the data making the whole pipeline slower.
Rather, I would like the filtered data to be present in all the partitions so that I utilise all the executor cores working on the data.
E.g.;
Dataset
id|category|partition
01|A |1
02|A |1
03|B |1
04|B |1
05|C |2
06|C |2
07|D |2
08|D |2
09|E |2
10|E |2
Assume that you now repartition on category
id|category|partition
01|A |1
02|A |1
03|B |2
04|B |2
05|C |3
06|C |3
07|D |4
08|D |4
09|E |5
10|E |5
Now, when you filter category=2, you then have 0 records in most partitions. Tasks further in your pipeline on these partitions essentially do not do any work and these executor cores are wasted.
Hope this makes sense.
I am using scala and spark, my spark version is 2.4.3
My dataframe looks like this, there are other columns which i have not put and is not relavent.
+-----------+---------+---------+
|ts_utc_yyyy|ts_utc_MM|ts_utc_dd|
+-----------+---------+---------+
|2019 |01 |20 |
|2019 |01 |13 |
|2019 |01 |12 |
|2019 |01 |19 |
|2019 |01 |19 |
+-----------+---------+---------+
Basically i want to store the data in a bucketed format like
2019/01/12/data
2019/01/13/data
2019/01/19/data
2019/01/20/data
I am using following code snippet
df.write
.partitionBy("ts_utc_yyyy","ts_utc_MM","ts_utc_dd")
.format("csv")
.save(outputPath)
But the problem is it is getting stored along with the column name like below.
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=12/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=13/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=19/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=20/data
how do i save without column name in the folder name ?
Thanks.
This is the expected behaviour. Spark uses Hive partitioning so it writes using this convention, which enables partition discovery, filtering and pruning. In short, it optimises your queries by ensuring that the minimum amount of data is read.
Spark isn't really designed for the output you need. The easiest way for you to solve this is to have a downstream task that will simply rename the directories by splitting on the equals sign.
I am wondering how the HAVING clause is working in spark sql without GroupBY or any aggregate function?
1) Can we rely on HAVING without aggregate function?
2) Is there any other way to filter the columns that are generated on that select level?
I have tried executing the below Spark SQL is it working fine but can we rely on this?
spark.sql("""
select 1 as a having a=1
""").show()
spark.sql("""
select 1 as a having a=2
""").show()
+---+
| a|
+---+
| 1|
+---+
+---+
| a|
+---+
+---+
In some databases / engines, when the GROUP BY is not used in conjunction with HAVING, HAVING defaults to a WHERE clause.
Normally the WHERE clause is used.
I would not rely on HAVING without a GROUP BY.
To answer this: 1) Can we rely on HAVING without aggregate function?
No, you cannot rely on this behavior. This illegal SQL is no longer the default behavior as of Spark 2.4. But if you really want to use HAVING like a WHERE clause you can get the old behavior by setting conf spark.sql.legacy.parser.havingWithoutGroupByAsWhere = true
It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)
I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)
RDD's sortByKey method is doing total ordering
RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.
Is there a direct way to sort within partition but not cross partitions?
You can use Dataset and sortWithinPartitions method:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.
I've never had this need before, but my first guess would be to use any of the *Partition* methods (e.g. foreachPartition or mapPartitions) to do the sorting within every partition.
Since they give you a Scala Iterator, you could use it.toSeq and then apply any of the sorting methods of Seq, e.g. sortBy or sortWith or sorted.
I am using Scala 1.6. The describe on a data frame is not displaying the column header and the values. Please see below:
val data=sc.textFile("/tmp/sample.txt")
data.toDF.describe().show
This gives the below result:
Please let me know why it is not displaying the entire result set.
+-------+
|summary|
+-------+
| count|
| mean|
| stddev|
| min|
| max|
+-------+
I think you just need to use the show method.
sc.textFile("/tmp/sample.txt").toDF.show
As far as displaying the complete RDD, be careful with this as you will need to collect the results on the driver in order to do this. You may want to consider using take instead if the csv file is large.
val data = sc.textFile("/tmp/sample.txt").toDF
data.collect.foreach(println)
or
data.take(100).foreach(println)
This was because, spark 1.6 was considering every filed as String by default and it does not provide summary stats on String type. However, in Spark 2.1, the columns were correctly inferred as their respective data type (Int/String/Double etc.,) and summary stats included all the columns in the file and it was not restricted only to numerical fields.
I feel, df.describe() works more elegantly in Spark 2.1 than Spark 1.6.