Spark aggregate on multiple columns within partition without shuffle - apache-spark

I'm trying to aggregate a dataframe on multiple columns. I know that everything I need for the aggregation is within the partition- that is, there's no need for a shuffle because all of the data for the aggregation are local to the partition.
Taking an example, if I have something like
val sales=sc.parallelize(List(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5))).repartition(2)
val tdf = sales.map{ case (store, prod, amt, units) => ((store, prod), (amt, amt, amt, units)) }.
reduceByKey((x, y) => (x._1 + y._1, math.min(x._2, y._2), math.max(x._3, y._3), x._4 + y._4))
println(tdf.toDebugString)
I get a result like
(2) ShuffledRDD[12] at reduceByKey at Test.scala:59 []
+-(2) MapPartitionsRDD[11] at map at Test.scala:58 []
| MapPartitionsRDD[10] at repartition at Test.scala:57 []
| CoalescedRDD[9] at repartition at Test.scala:57 []
| ShuffledRDD[8] at repartition at Test.scala:57 []
+-(1) MapPartitionsRDD[7] at repartition at Test.scala:57 []
| ParallelCollectionRDD[6] at parallelize at Test.scala:51 []
You can see the MapPartitionsRDD, which is good. But then there's the ShuffleRDD, which I want to prevent because I want the per-partition summarization, grouped by column values within the partition.
zero323's suggestion is tantalizingly close, but I need the "group by columns" functionality.
Referring to my sample above, I'm looking for the result that would be produced by
select store, prod, sum(amt), avg(units) from sales group by partition_id, store, prod
(I don't really need the partition id- that's just to illustrate that I want per-partition results)
I've looked at lots of examples but every debug string I've produced has the Shuffle. I really hope to get rid of the shuffle. I guess I'm essentially looking for a groupByKeysWithinPartitions function.

The only way to achieve that is by using mapPartitions and have custom code for grouping and computing your values while iterating the partition.
As you mention the data is already sorted by grouping keys (store, prod), we can efficiently compute your aggregations in a pipelined fashion:
(1) Define helper classes:
:paste
case class MyRec(store: String, prod: String, amt: Double, units: Int)
case class MyResult(store: String, prod: String, total_amt: Double, min_amt: Double, max_amt: Double, total_units: Int)
object MyResult {
def apply(rec: MyRec): MyResult = new MyResult(rec.store, rec.prod, rec.amt, rec.amt, rec.amt, rec.units)
def aggregate(result: MyResult, rec: MyRec) = {
new MyResult(result.store,
result.prod,
result.total_amt + rec.amt,
math.min(result.min_amt, rec.amt),
math.max(result.max_amt, rec.amt),
result.total_units + rec.units
)
}
}
(2) Define pipelined aggregator:
:paste
def pipelinedAggregator(iter: Iterator[MyRec]): Iterator[Seq[MyResult]] = {
var prev: MyResult = null
var res: Seq[MyResult] = Nil
for (crt <- iter) yield {
if (prev == null) {
prev = MyResult(crt)
}
else if (prev.prod != crt.prod || prev.store != crt.store) {
res = Seq(prev)
prev = MyResult(crt)
}
else {
prev = MyResult.aggregate(prev, crt)
}
if (!iter.hasNext) {
res = res ++ Seq(prev)
}
res
}
}
(3) Run aggregation:
:paste
val sales = sc.parallelize(
List(MyRec("West", "Apple", 2.0, 10),
MyRec("West", "Apple", 3.0, 15),
MyRec("West", "Orange", 5.0, 15),
MyRec("South", "Orange", 3.0, 9),
MyRec("South", "Orange", 6.0, 18),
MyRec("East", "Milk", 5.0, 5),
MyRec("West", "Apple", 7.0, 11)), 2).toDS
sales.mapPartitions(iter => Iterator(iter.toList)).show(false)
val result = sales
.mapPartitions(recIter => pipelinedAggregator(recIter))
.flatMap(identity)
result.show
result.explain
Output:
+-------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------+
|[[West,Apple,2.0,10], [West,Apple,3.0,15], [West,Orange,5.0,15]] |
|[[South,Orange,3.0,9], [South,Orange,6.0,18], [East,Milk,5.0,5], [West,Apple,7.0,11]]|
+-------------------------------------------------------------------------------------+
+-----+------+---------+-------+-------+-----------+
|store| prod|total_amt|min_amt|max_amt|total_units|
+-----+------+---------+-------+-------+-----------+
| West| Apple| 5.0| 2.0| 3.0| 25|
| West|Orange| 5.0| 5.0| 5.0| 15|
|South|Orange| 9.0| 3.0| 6.0| 27|
| East| Milk| 5.0| 5.0| 5.0| 5|
| West| Apple| 7.0| 7.0| 7.0| 11|
+-----+------+---------+-------+-------+-----------+
== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).store, true) AS store#31, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).prod, true) AS prod#32, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).total_amt AS total_amt#33, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).min_amt AS min_amt#34, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).max_amt AS max_amt#35, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).total_units AS total_units#36]
+- MapPartitions <function1>, obj#30: $line14.$read$$iw$$iw$MyResult
+- MapPartitions <function1>, obj#20: scala.collection.Seq
+- Scan ExternalRDDScan[obj#4]
sales: org.apache.spark.sql.Dataset[MyRec] = [store: string, prod: string ... 2 more fields]
result: org.apache.spark.sql.Dataset[MyResult] = [store: string, prod: string ... 4 more fields]

If this is the output your looking for
+-----+------+--------+----------+
|store|prod |max(amt)|avg(units)|
+-----+------+--------+----------+
|South|Orange|6.0 |13.5 |
|West |Orange|5.0 |15.0 |
|East |Milk |5.0 |5.0 |
|West |Apple |3.0 |12.5 |
+-----+------+--------+----------+
Spark Dataframe has all the functionality your asking for with generic concise shorthand syntax
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object TestJob2 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.show(false)
rawDf.printSchema
val aggDf = rawDf
.groupBy("store", "prod")
.agg(
max(col("amt")),
avg(col("units"))
// in case you need to retain more info
// , collect_list(struct("*")).as("horizontal")
)
aggDf.printSchema
aggDf.show(false)
}
}
uncomment the collect_list line to aggregate everything
+-----+------+--------+----------+---------------------------------------------------+
|store|prod |max(amt)|avg(units)|horizontal
|
+-----+------+--------+----------+---------------------------------------------------+
|South|Orange|6.0 |13.5 |[[South, Orange, 3.0, 9], [South, Orange, 6.0, 18]]|
|West |Orange|5.0 |15.0 |[[West, Orange, 5.0, 15]]
|
|East |Milk |5.0 |5.0 |[[East, Milk, 5.0, 5]]
|
|West |Apple |3.0 |12.5 |[[West, Apple, 2.0, 10], [West, Apple, 3.0, 15]] |
+-----+------+--------+----------+---------------------------------------------------+

The maximum and average aggregations you specify are over multiple rows.
If you want to keep all the original rows use a Window function which will partition.
If you want to reduce the rows in each partition you must specify a reduction logic or filter.
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object TestJob7 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.show(false)
rawDf.printSchema
val storeProdWindow = Window
.partitionBy("store", "prod")
val aggDf = rawDf
.withColumn("max(amt)", max("amt").over(storeProdWindow))
.withColumn("avg(units)", avg("units").over(storeProdWindow))
aggDf.printSchema
aggDf.show(false)
}
}
Here is the result take note that its already grouped (the window shuffles into partitions)
+-----+------+---+-----+--------+----------+
|store|prod |amt|units|max(amt)|avg(units)|
+-----+------+---+-----+--------+----------+
|South|Orange|3.0|9 |6.0 |13.5 |
|South|Orange|6.0|18 |6.0 |13.5 |
|West |Orange|5.0|15 |5.0 |15.0 |
|East |Milk |5.0|5 |5.0 |5.0 |
|West |Apple |2.0|10 |3.0 |12.5 |
|West |Apple |3.0|15 |3.0 |12.5 |
+-----+------+---+-----+--------+----------+

Aggregate functions reduce values of rows for specified columns within the group. Yo can perform multiple different aggregations resulting in new columns with values from the input rows in one iteration, exclusively using Dataframe functionality. If you wish to retain other row values you need to implement reduction logic that specifies a row from which each value comes from. For instance keep all values of the first row with the maximum value of age. To this end you can use a UDAF (user defined aggregate function) to reduce rows within the group. In the example I also aggregate max amt and average units using standard aggregate functions in the same iteration.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object ReduceAggJob {
def main (args: Array[String]): Unit = {
val appName = this.getClass.getName.replace("$", "")
println(s"appName: $appName")
val sparkSession = SparkSession
.builder()
.appName(appName)
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("West", "Orange", 17.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.printSchema
rawDf.show(false)
// Create an instance of UDAF GeometricMean.
val maxAmtUdaf = new KeepRowWithMaxAmt
// Keep the row with max amt
val aggDf = rawDf
.groupBy("store", "prod")
.agg(
max("amt"),
avg("units"),
maxAmtUdaf(
col("store"),
col("prod"),
col("amt"),
col("units")).as("KeepRowWithMaxAmt")
)
aggDf.printSchema
aggDf.show(false)
}
}
The UDAF
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class KeepRowWithMaxAmt extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the internal fields you keep for computing your aggregate.
override def bufferSchema: StructType = StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the output type of your aggregation function.
override def dataType: DataType =
StructType((Array(
StructField("store", StringType),
StructField("prod", StringType),
StructField("amt", DoubleType),
StructField("units", IntegerType)
)))
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
buffer(1) = ""
buffer(2) = 0.0
buffer(3) = 0
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val amt = buffer.getAs[Double](2)
val candidateAmt = input.getAs[Double](2)
amt match {
case a if a < candidateAmt =>
buffer(0) = input.getAs[String](0)
buffer(1) = input.getAs[String](1)
buffer(2) = input.getAs[Double](2)
buffer(3) = input.getAs[Int](3)
case _ =>
}
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer2.getAs[String](0)
buffer1(1) = buffer2.getAs[String](1)
buffer1(2) = buffer2.getAs[Double](2)
buffer1(3) = buffer2.getAs[Int](3)
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
buffer
}
}

Related

Merging rows to map type based on max value in a column

I am doing a small POC to ingest the user events(CSV file) from a website. Below is the sample input:
Input Schema:
Output should be in the format as below
The logic required is to group by the id column and merge the
name and value columns to a Map type where the name column represents the key
and the value column represent the value in the Map type. The value to be picked for each key in the Map is the one with the highest value in the timestamp column.
I was able to achieve some part where it needs to be grouped by id and extract maximum of the timestamp column.I am facing difficulty with selecting one value(from corresponding max timestamp) for each id) and merge with other names(using map).
Below is my code
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val schema = StructType(List(
StructField("id", LongType, nullable = true),
StructField("name", StringType, nullable = true),
StructField("value", StringType, nullable = true),
StructField("timestamp", LongType, nullable = true)))
val myDF = spark.read.schema(schema).option("header", "true").option("delimiter", ",").csv("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/tru.csv")
val df = myDF.toDF("id","name","value","timestamp")
//df.groupBy("id","name","value").agg(max("timestamp")).show()
val windowSpecAgg = Window.partitionBy("id")
df.withColumn("max", max(col("timestamp")).over(windowSpecAgg)).where(col("timestamp") === col("max")).drop("max").show()
Use window function and filter out latest data by partitioning on "id","name"
later use map_from_arrays,to_json functions to recreate the desired json.
Example:
df.show()
//sample data
//+---+----+-------+---------+
//| id|name| value|timestamp|
//+---+----+-------+---------+
//| 1| A| Exited| 3201|
//| 1| A|Running| 5648|
//| 1| C| Exited| 3547|
//| 2| C|Success| 3612|
//+---+----+-------+---------+
val windowSpecAgg = Window.partitionBy("id","name").orderBy(desc("timestamp"))
df.withColumn("max", row_number().over(windowSpecAgg)).filter(col("max")===1).
drop("max").
groupBy("id").
agg(to_json(map_from_arrays(collect_list(col("name")),collect_list(col("value")))).as("settings")).
show(10,false)
//+---+----------------------------+
//|id |settings |
//+---+----------------------------+
//|1 |{"A":"Running","C":"Exited"}|
//|2 |{"C":"Success"} |
//+---+----------------------------+
You can use ranking function - row_number() to get the latest records per partition.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df = Seq((1, "A", "Exited", 1546333201),
(3, "B", "Failed", 1546334201),
(2, "C", "Success", 1546333612),
(3, "B", "Hold", 1546333444),
(1, "A", "Running", 1546335648),
(1, "C", "Exited", 1546333547)).toDF("id", "name", "value", "timestamp")
df.withColumn("rn",
row_number().over(Window.partitionBy("id", "name").orderBy('timestamp.desc_nulls_last)))
.where('rn === 1)
.drop("rn")
.groupBy("id")
.agg(collect_list(map('name, 'value)).as("settings"))
.show(false)
/*
+---+-------------------------------+
|id |settings |
+---+-------------------------------+
|1 |[[A -> Running], [C -> Exited]]|
|3 |[[B -> Failed]] |
|2 |[[C -> Success]] |
+---+-------------------------------+ */

Need to get fields from lookup file with calculated values using spark dataframe

I have a products file as below and the formula for cost price and discount in the lookup file. we need get the discount and cost price fields from lookup file.
product_id,product_name,marked_price,selling_price,profit
101,AAAA,5500,5400,500
102,ABCS,7000,6500,1000
103,GHMA,6500,5600,700
104,PSNLA,8450,8000,800
105,GNBC,1250,1200,600
lookup file:
key,value
cost_price,(selling_price+profit)
discount,(marked_price-selling_price)
Final output:
product_id,product_name,marked_price,selling_price,profit,cost_price,discount
101,AAAA,5500,5400,500,5900,100
102,ABCS,7000,6500,1000,7500,500
103,GHMA,6500,5600,700,6300,900
104,PSNLA,8450,8000,800,8800,450
105,GNBC,1250,1200,600,1800,50
First, you should make a Map out of your lookup-file, then you can add the columns using expr:
val lookup = Map(
"cost_price" -> "selling_price+profit",
"discount" -> "(marked_price-selling_price)"
)
val df = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000)
)
.toDF("product_id", "product_name", "marked_price", "selling_price", "profit")
df
.withColumn("cost_price",expr(lookup("cost_price")))
.withColumn("discount",expr(lookup("discount")))
.show()
gives :
+----------+------------+------------+-------------+------+----------+--------+
|product_id|product_name|marked_price|selling_price|profit|cost_price|discount|
+----------+------------+------------+-------------+------+----------+--------+
| 101| AAAA| 5500| 5400| 500| 5900| 100|
| 102| ABCS| 7000| 6500| 1000| 7500| 500|
+----------+------------+------------+-------------+------+----------+--------+
you can also iterate your lookup :
val finalDF = lookup.foldLeft(df){case (df,(k,v)) => df.withColumn(k,expr(v))}
import spark.implicits._
val dataheader = Seq("product_id", "product_name", "marked_price", "selling_price", "profit")
val data = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000),
(103, "GHMA", 6500, 5600, 700),
(104, "PSNLA", 8450, 8000, 800),
(105, "GNBC", 1250, 1200, 600))
val dataDF = data.toDF(dataheader: _*)
val lookupMap = Seq(
("cost_price", "(selling_price+profit)"),
("discount", "(marked_price-selling_price)")) //You can read your data from look up file and construct a dataframe
.toDF("key", "value").select("key", "value").as[(String, String)].collect.toMap
lookupMap.foldLeft(dataDF)((df, look) => {
df.withColumn(look._1, expr(look._2))
}).show()

Flatten rows in sliding window using Spark

I'm processing a large number of rows from either a database or a file using Apache Spark. Part of the processing creates a sliding window of 3 rows where the rows need to flattened and additional calculations performed on the flattened rows. Below is a simplified example of what is trying to be done.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.desc
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.expressions.Window
object Main extends App {
val ss = SparkSession.builder().appName("DataSet Test")
.master("local[*]").getOrCreate()
import ss.implicits._
case class Foo(a:Int, b:String )
// rows from database or file
val foos = Seq(Foo(-18, "Z"),
Foo(-11, "G"),
Foo(-8, "A"),
Foo(-4, "C"),
Foo(-1, "F")).toDS()
// work on 3 rows
val sliding_window_spec = Window.orderBy(desc("a")).rowsBetween( -2, 0)
// flattened object with example computations
case class FooResult(a1:Int, b1:String, a2:Int, b2:String, a3:Int, b3:String, computation1:Int, computation2:String )
// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
// expected results
val fooResults = Seq(FooResult( -1, "F", -4, "C", -8, "A", -5, "FCA" ),
FooResult( -4, "C", -8, "A", -11, "G", -12, "CAG" ),
FooResult( -8, "A", -11, "G", -18, "Z", -19, "AGZ" )).toDS()
ss.stop()
}
How can I convert the foos into the fooResults? I'm using Apache Spark 2.3.0
// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
You can simply use collect_list inbuilt function using the window function you've already defined and then by defining a udf function, you can do the computation part and flattening part. finally you can filter and expand the struct column to get your final desired result as
def slidingUdf = udf((list1: Seq[Int], list2:Seq[String])=> {
if(list1.size < 3) null
else {
val zipped = list1.zip(list2)
FooResult(zipped(0)._1, zipped(0)._2, zipped(1)._1, zipped(1)._2, zipped(2)._1, zipped(2)._2, zipped(0)._1+zipped(1)._1, zipped(0)._2+zipped(1)._2+zipped(2)._2)
}
})
foos.select(slidingUdf(collect_list("a").over(sliding_window_spec), collect_list("b").over(sliding_window_spec)).as("test"))
.filter(col("test").isNotNull)
.select(col("test.*"))
.show(false)
which should give you
+---+---+---+---+---+---+------------+------------+
|a1 |b1 |a2 |b2 |a3 |b3 |computation1|computation2|
+---+---+---+---+---+---+------------+------------+
|-1 |F |-4 |C |-8 |A |-5 |FCA |
|-4 |C |-8 |A |-11|G |-12 |CAG |
|-8 |A |-11|G |-18|Z |-19 |AGZ |
+---+---+---+---+---+---+------------+------------+
Note: Remember that the case classes should be defined outside the scope of the current session

Spark GroupBy Aggregate functions

case class Step (Id : Long,
stepNum : Long,
stepId : Int,
stepTime: java.sql.Timestamp
)
I have a Dataset[Step] and I want to perform a groupBy operation on the "Id" col.
My output should look like Dataset[(Long, List[Step])]. How do I do this?
lets say variable "inquiryStepMap" is of type Dataset[Step] then we can do this with RDDs as follows
val inquiryStepGrouped: RDD[(Long, Iterable[Step])] = inquiryStepMap.rdd.groupBy(x => x.Id)
It seems you need groupByKey:
Sample:
import java.sql.Timestamp
val t = new Timestamp(2017, 5, 1, 0, 0, 0, 0)
val ds = Seq(Step(1L, 21L, 1, t), Step(1L, 20L, 2, t), Step(2L, 10L, 3, t)).toDS()
groupByKey and then mapGroups:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList))
// res18: org.apache.spark.sql.Dataset[(Long, List[Step])] = [_1: bigint, _2: array<struct<Id:bigint,stepNum:bigint,stepId:int,stepTime:timestamp>>]
And the result looks like:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)).show()
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[[1,21,1,3917-06-...|
| 2|[[2,10,3,3917-06-...|
+---+--------------------+

How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0

First, I am very new to SPARK
I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am getting correct results but I need all columns in my resultset.
Dataset<Row> resultset = studentDataSet.select("*").groupBy("name").max("age");
resultset.show(1000,false);
I am getting only name and max(age) in my resultset dataset.
For your solution you have to try different approach. You was almost there for solution but let me help you understand.
Dataset<Row> resultset = studentDataSet.groupBy("name").max("age");
now what you can do is you can join the resultset with studentDataSet
Dataset<Row> joinedDS = studentDataset.join(resultset, "name");
The problem with groupBy this that after applying groupBy you get RelationalGroupedDataset so it depends on what next operation you perform like sum, min, mean, max etc then the result of these operation joined with groupBy
As in you case name column is joined with the max of age so it will return only two columns but if use apply groupBy on age and then apply max on 'age' column you will get two column one is age and second is max(age).
Note :- code is not tested please make changes if needed
Hope this clears you query
The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.
Let's create a sample data set and test the code:
val df = Seq(
("bob", 20, "blah"),
("bob", 40, "blah"),
("karen", 21, "hi"),
("monica", 43, "candy"),
("monica", 99, "water")
).toDF("name", "age", "another_column")
This code should run faster with large DataFrames.
df
.groupBy("name")
.agg(
max("name").as("name1_dup"),
max("another_column").as("another_column"),
max("age").as("age")
).drop(
"name1_dup"
).show()
+------+--------------+---+
| name|another_column|age|
+------+--------------+---+
|monica| water| 99|
| karen| hi| 21|
| bob| blah| 40|
+------+--------------+---+
What your trying to achieve is
group rows by age
reduce each group to 1 row with maximum age
This alternative achieves this output without use of aggregate
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object TestJob5 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("Moe", "Slap", 7.9, 118),
("Larry", "Spank", 8.0, 115),
("Curly", "Twist", 6.0, 113),
("Laurel", "Whimper", 7.53, 119),
("Hardy", "Laugh", 6.0, 118),
("Charley", "Ignore", 9.7, 115),
("Moe", "Spank", 6.8, 118),
("Larry", "Twist", 6.0, 115),
("Charley", "fall", 9.0, 115)
).toDF("name", "requisite", "funniness_of_requisite", "age")
rawDf.show(false)
rawDf.printSchema
val nameWindow = Window
.partitionBy("name")
val aggDf = rawDf
.withColumn("id", monotonically_increasing_id)
.withColumn("maxFun", max("funniness_of_requisite").over(nameWindow))
.withColumn("count", count("name").over(nameWindow))
.withColumn("minId", min("id").over(nameWindow))
.where(col("maxFun") === col("funniness_of_requisite") && col("minId") === col("id") )
.drop("maxFun")
.drop("minId")
.drop("id")
aggDf.printSchema
aggDf.show(false)
}
}
bear in mind that a group could potentially have more than 1 row with max age so you need to pick one by some logic. In the example I assume it doesn't matter so I just assign a unique number to choose
Noting that a subsequent join is extra shuffling and some of the other solutions seem inaccurate in the returns or even turn the Dataset into Dataframes, I sought a better solution. Here is mine:
case class People(name: String, age: Int, other: String)
val df = Seq(
People("Rob", 20, "cherry"),
People("Rob", 55, "banana"),
People("Rob", 40, "apple"),
People("Ariel", 55, "fox"),
People("Vera", 43, "zebra"),
People("Vera", 99, "horse")
).toDS
val oldestResults = df
.groupByKey(_.name)
.mapGroups{
case (nameKey, peopleIter) => {
var oldestPerson = peopleIter.next
while(peopleIter.hasNext) {
val nextPerson = peopleIter.next
if(nextPerson.age > oldestPerson.age) oldestPerson = nextPerson
}
oldestPerson
}
}
oldestResults.show
The following produces:
+-----+---+------+
| name|age| other|
+-----+---+------+
|Ariel| 55| fox|
| Rob| 55|banana|
| Vera| 99| horse|
+-----+---+------+
You need to remember that aggregate functions reduce the rows and therefore you need to specify which of the rows age you want with a reducing function. If you want to retain all rows of a group (warning! this can cause explosions or skewed partitions) you can collect them as a list. You can then use a UDF (user defined function) to reduce them by your criteria, in this example funniness_of_requisite. And then expand columns belonging to the reduced row from the single reduced row with another UDF .
For the purpose of this answer I assume you wish to retain the age of the person who has the max funniness_of_requisite.
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType}
import scala.collection.mutable
object TestJob4 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
(1, "Moe", "Slap", 7.9, 118),
(2, "Larry", "Spank", 8.0, 115),
(3, "Curly", "Twist", 6.0, 113),
(4, "Laurel", "Whimper", 7.53, 119),
(5, "Hardy", "Laugh", 6.0, 18),
(6, "Charley", "Ignore", 9.7, 115),
(2, "Moe", "Spank", 6.8, 118),
(3, "Larry", "Twist", 6.0, 115),
(3, "Charley", "fall", 9.0, 115)
).toDF("id", "name", "requisite", "funniness_of_requisite", "age")
rawDf.show(false)
rawDf.printSchema
val rawSchema = rawDf.schema
val fUdf = udf(reduceByFunniness, rawSchema)
val nameUdf = udf(extractAge, IntegerType)
val aggDf = rawDf
.groupBy("name")
.agg(
count(struct("*")).as("count"),
max(col("funniness_of_requisite")),
collect_list(struct("*")).as("horizontal")
)
.withColumn("short", fUdf($"horizontal"))
.withColumn("age", nameUdf($"short"))
.drop("horizontal")
aggDf.printSchema
aggDf.show(false)
}
def reduceByFunniness= (x: Any) => {
val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
val red = d.reduce((r1, r2) => {
val funniness1 = r1.getAs[Double]("funniness_of_requisite")
val funniness2 = r2.getAs[Double]("funniness_of_requisite")
val r3 = funniness1 match {
case a if a >= funniness2 =>
r1
case _ =>
r2
}
r3
})
red
}
def extractAge = (x: Any) => {
val d = x.asInstanceOf[GenericRowWithSchema]
d.getAs[Int]("age")
}
}
d.getAs[String]("name")
}
}
here is the output
+-------+-----+---------------------------+-------------------------------+---+
|name |count|max(funniness_of_requisite)|short
|age|
+-------+-----+---------------------------+-------------------------------+---+
|Hardy |1 |6.0 |[5, Hardy, Laugh, 6.0, 18]
|18 |
|Moe |2 |7.9 |[1, Moe, Slap, 7.9, 118]
|118|
|Curly |1 |6.0 |[3, Curly, Twist, 6.0, 113]
|113|
|Larry |2 |8.0 |[2, Larry, Spank, 8.0, 115]
|115|
|Laurel |1 |7.53 |[4, Laurel, Whimper, 7.53, 119]|119|
|Charley|2 |9.7 |[6, Charley, Ignore, 9.7, 115] |115|
+-------+-----+---------------------------+-------------------------------+---+

Resources