How to use dataset to groupby - apache-spark

I have a request to use rdd to do so:
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)
The result is that:
(New York,List(Jack))
(Detroit,List(Michael, Peter, George))
(Los Angeles,List(Tom))
(Houston,List(John))
(Chicago,List(David, Andrew))
How to do it use dataset with spark2.0?
I have a way to use a custom function, but the feeling is so complicated, there is no simple point method?

I would suggest you to start with creating a case class as
case class Monkey(city: String, firstName: String)
This case class should be defined outside the main class. Then you can just use toDS function and use groupBy and aggregation function called collect_list as below
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test)
.map(row => Monkey(row._1, row._2))
.toDS()
.groupBy("city")
.agg(collect_list("firstName") as "list")
.show(false)
You will have output as
+-----------+------------------------+
|city |list |
+-----------+------------------------+
|Los Angeles|[Tom] |
|Detroit |[Michael, Peter, George]|
|Chicago |[David, Andrew] |
|Houston |[John] |
|New York |[Jack] |
+-----------+------------------------+
You can always convert back to RDD by just calling .rdd function

To create a data set first define a case class outside your class as
case class Employee(city: String, name: String)
Then you can convert the list to Dataset as
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
).toDF("city", "name")
val data = test.as[Employee]
Or
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
val data = test.map(r => Employee(r._1, r._2)).toDS()
Now you can groupby and perform any aggregation as
data.groupBy("city").count().show
data.groupBy("city").agg(collect_list("name")).show
Hope this helps!

First i would turn your RDD into a DataSet:
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val testDs = test.toDS()
Here you get your col names :) Use it wise !
testDs.schema.fields.foreach(x => println(x))
In the end you only need to use a groupBy:
testDs.groupBy("City?", "Name?")
RDD-s are not really the 2.0 version way I think.
If you have any question please just ask.

Related

Explode array values into multiple columns using PySpark

I am new to pyspark and I want to explode array values in such a way that each value gets assigned to a new column. I tried using explode but I couldn't get the desired output. Below is my output
this is the code
from pyspark.sql import *
from pyspark.sql.functions import explode
if __name__ == "__main__":
spark = SparkSession.builder \
.master("local[3]") \
.appName("DataOps") \
.getOrCreate()
dataFrameJSON = spark.read \
.option("multiLine", True) \
.option("mode", "PERMISSIVE") \
.json("data.json")
dataFrameJSON.printSchema()
sub_DF = dataFrameJSON.select(explode("values.line").alias("new_values"))
sub_DF.printSchema()
sub_DF2 = sub_DF.select("new_values.*")
sub_DF2.printSchema()
sub_DF.show(truncate=False)
new_DF = sub_DF2.select("id", "period.*", "property")
new_DF.show(truncate=False)
new_DF.printSchema()
this is data:
{
"values" : {
"line" : [
{
"id" : 1,
"period" : {
"start_ts" : "2020-01-01T00:00:00",
"end_ts" : "2020-01-01T00:15:00"
},
"property" : [
{
"name" : "PID",
"val" : "P120E12345678"
},
{
"name" : "EngID",
"val" : "PANELID00000000"
},
{
"name" : "TownIstat",
"val" : "12058091"
},
{
"name" : "ActiveEng",
"val" : "5678.1"
}
]
}
}
Could you include the data instead of screenshots ?
Meanwhile, assuming that df is the dataframe being used, what we need to do, is to create a new dataframe, while exrtracting the vals from the previous property array to new columns, and droping the property column at last :
from pyspark.sql.functions import col
output_df = df.withColumn("PID", col("property")[0].val).withColumn("EngID", col("property")[1].val).withColumn("TownIstat", col("property")[2].val).withColumn("ActiveEng", col("property")[3].val).drop("property")
In case the elementwas of type ArrayType use the following :
from pyspark.sql.functions import col
output_df = df.withColumn("PID", col("property")[0][1]).withColumn("EngID", col("property")[1][1]).withColumn("TownIstat", col("property")[2][1]).withColumn("ActiveEng", col("property")[3][1]).drop("property")
Explode will explode the arrays into new Rows, not columns, see this : pyspark explode
This is a general solution and works even when the JSONs are messy (different ordering of elements or if some of the elements are missing)
You got to flatten first, regexp_replace to split the 'property' column and finally pivot. This also avoids hard coding of the new column names.
Constructing your dataframe:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
schema = StructType([StructField("id", IntegerType()), StructField("start_ts", StringType()), StructField("end_ts", StringType()), \
StructField("property", ArrayType(StructType( [StructField("name", StringType()), StructField("val", StringType())] )))])
data = [[1, "2010", "2020", [["PID", "P123"], ["Eng", "PA111"], ["Town", "999"], ["Act", "123.1"]]],\
[2, "2011", "2012", [["PID", "P456"], ["Eng", "PA222"], ["Town", "777"], ["Act", "234.1"]]]]
df = spark.createDataFrame(data,schema=schema)
df.show(truncate=False)
+---+--------+------+------------------------------------------------------+
|id |start_ts|end_ts|property |
+---+--------+------+------------------------------------------------------+
|1 |2010 |2020 |[[PID, P123], [Eng, PA111], [Town, 999], [Act, 123.1]]|
|2 |2011 |2012 |[[PID, P456], [Eng, PA222], [Town, 777], [Act, 234.1]]|
+---+--------+------+------------------------------------------------------+
Flattening and pivoting:
df_flatten = df.rdd.flatMap(lambda x: [(x[0],x[1], x[2], y) for y in x[3]]).toDF(['id', 'start_ts', 'end_ts', 'property'])\
.select('id', 'start_ts', 'end_ts', col("property").cast("string"))
df_split = df_flatten.select('id', 'start_ts', 'end_ts', regexp_replace(df_flatten.property, "[\[\]]", "").alias("replacced_col"))\
.withColumn("arr", split(col("replacced_col"), ", "))\
.select(col("arr")[0].alias("col1"), col("arr")[1].alias("col2"), 'id', 'start_ts', 'end_ts')
final_df = df_split.groupby(df_split.id,)\
.pivot("col1")\
.agg(first("col2"))\
.join(df,'id').drop("property")
Output:
final_df.show()
+---+-----+-----+----+----+--------+------+
| id| Act| Eng| PID|Town|start_ts|end_ts|
+---+-----+-----+----+----+--------+------+
| 1|123.1|PA111|P123| 999| 2010| 2020|
| 2|234.1|PA222|P456| 777| 2011| 2012|
+---+-----+-----+----+----+--------+------+

Need to get fields from lookup file with calculated values using spark dataframe

I have a products file as below and the formula for cost price and discount in the lookup file. we need get the discount and cost price fields from lookup file.
product_id,product_name,marked_price,selling_price,profit
101,AAAA,5500,5400,500
102,ABCS,7000,6500,1000
103,GHMA,6500,5600,700
104,PSNLA,8450,8000,800
105,GNBC,1250,1200,600
lookup file:
key,value
cost_price,(selling_price+profit)
discount,(marked_price-selling_price)
Final output:
product_id,product_name,marked_price,selling_price,profit,cost_price,discount
101,AAAA,5500,5400,500,5900,100
102,ABCS,7000,6500,1000,7500,500
103,GHMA,6500,5600,700,6300,900
104,PSNLA,8450,8000,800,8800,450
105,GNBC,1250,1200,600,1800,50
First, you should make a Map out of your lookup-file, then you can add the columns using expr:
val lookup = Map(
"cost_price" -> "selling_price+profit",
"discount" -> "(marked_price-selling_price)"
)
val df = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000)
)
.toDF("product_id", "product_name", "marked_price", "selling_price", "profit")
df
.withColumn("cost_price",expr(lookup("cost_price")))
.withColumn("discount",expr(lookup("discount")))
.show()
gives :
+----------+------------+------------+-------------+------+----------+--------+
|product_id|product_name|marked_price|selling_price|profit|cost_price|discount|
+----------+------------+------------+-------------+------+----------+--------+
| 101| AAAA| 5500| 5400| 500| 5900| 100|
| 102| ABCS| 7000| 6500| 1000| 7500| 500|
+----------+------------+------------+-------------+------+----------+--------+
you can also iterate your lookup :
val finalDF = lookup.foldLeft(df){case (df,(k,v)) => df.withColumn(k,expr(v))}
import spark.implicits._
val dataheader = Seq("product_id", "product_name", "marked_price", "selling_price", "profit")
val data = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000),
(103, "GHMA", 6500, 5600, 700),
(104, "PSNLA", 8450, 8000, 800),
(105, "GNBC", 1250, 1200, 600))
val dataDF = data.toDF(dataheader: _*)
val lookupMap = Seq(
("cost_price", "(selling_price+profit)"),
("discount", "(marked_price-selling_price)")) //You can read your data from look up file and construct a dataframe
.toDF("key", "value").select("key", "value").as[(String, String)].collect.toMap
lookupMap.foldLeft(dataDF)((df, look) => {
df.withColumn(look._1, expr(look._2))
}).show()

How to use Dataset to group by, but entire rows

Reading the this post I wonder how can we group the a Dataset but with multiple columns.
Like:
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
I would like to get
Chicago, [( "David", "ff"), ("Andrew", "ddd")]
Create a case class as below
case class TestData (location: String, name: String, value: String)
Dummy Data
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
//change each row to TestData object
.map(x => TestData(x._1, x._2, x._3))
.toDS() // create dataset from above data
Output as you require
test.groupBy($"location")
.agg(collect_list(struct("name", "value")).as("data"))
.show(false)
Output:
+-----------+--------------------------------------------+
|location |data |
+-----------+--------------------------------------------+
|Los Angeles|[[Tom,ff]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Chicago |[[David,ff], [Andrew,ddd]] |
|Houston |[[John,dd]] |
|New York |[[Jack,jdhj]] |
+-----------+--------------------------------------------+
I have suggested a case class way in the link that you have provided in the question. Here's something different.
RDD way
You can simply do the following
val rdd = sc.parallelize(test) //creating rdd from test
val resultRdd = rdd.groupBy(x => x._1) //grouping by the first element
.mapValues(x => x.map(y => (y._2, y._3))) //collecting the second and third element in the grouped datset
resultRdd.foreach(println) should give you
(New York,List((Jack,jdhj)))
(Houston,List((John,dd)))
(Chicago,List((David,ff), (Andrew,ddd)))
(Detroit,List((Michael,fff), (Peter,dd), (George,dkdjkd)))
(Los Angeles,List((Tom,ff)))
Converting rdd to dataframe
If you require output in table format you can just call .toDF() after some manipulation as
val df = resultRdd.map(x => (x._1, x._2.toArray)).toDF()
df.show(false) should give you
+-----------+--------------------------------------------+
|_1 |_2 |
+-----------+--------------------------------------------+
|New York |[[Jack,jdhj]] |
|Houston |[[John,dd]] |
|Chicago |[[David,ff], [Andrew,ddd]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Los Angeles|[[Tom,ff]] |
+-----------+--------------------------------------------+

Spark aggregate on multiple columns within partition without shuffle

I'm trying to aggregate a dataframe on multiple columns. I know that everything I need for the aggregation is within the partition- that is, there's no need for a shuffle because all of the data for the aggregation are local to the partition.
Taking an example, if I have something like
val sales=sc.parallelize(List(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5))).repartition(2)
val tdf = sales.map{ case (store, prod, amt, units) => ((store, prod), (amt, amt, amt, units)) }.
reduceByKey((x, y) => (x._1 + y._1, math.min(x._2, y._2), math.max(x._3, y._3), x._4 + y._4))
println(tdf.toDebugString)
I get a result like
(2) ShuffledRDD[12] at reduceByKey at Test.scala:59 []
+-(2) MapPartitionsRDD[11] at map at Test.scala:58 []
| MapPartitionsRDD[10] at repartition at Test.scala:57 []
| CoalescedRDD[9] at repartition at Test.scala:57 []
| ShuffledRDD[8] at repartition at Test.scala:57 []
+-(1) MapPartitionsRDD[7] at repartition at Test.scala:57 []
| ParallelCollectionRDD[6] at parallelize at Test.scala:51 []
You can see the MapPartitionsRDD, which is good. But then there's the ShuffleRDD, which I want to prevent because I want the per-partition summarization, grouped by column values within the partition.
zero323's suggestion is tantalizingly close, but I need the "group by columns" functionality.
Referring to my sample above, I'm looking for the result that would be produced by
select store, prod, sum(amt), avg(units) from sales group by partition_id, store, prod
(I don't really need the partition id- that's just to illustrate that I want per-partition results)
I've looked at lots of examples but every debug string I've produced has the Shuffle. I really hope to get rid of the shuffle. I guess I'm essentially looking for a groupByKeysWithinPartitions function.
The only way to achieve that is by using mapPartitions and have custom code for grouping and computing your values while iterating the partition.
As you mention the data is already sorted by grouping keys (store, prod), we can efficiently compute your aggregations in a pipelined fashion:
(1) Define helper classes:
:paste
case class MyRec(store: String, prod: String, amt: Double, units: Int)
case class MyResult(store: String, prod: String, total_amt: Double, min_amt: Double, max_amt: Double, total_units: Int)
object MyResult {
def apply(rec: MyRec): MyResult = new MyResult(rec.store, rec.prod, rec.amt, rec.amt, rec.amt, rec.units)
def aggregate(result: MyResult, rec: MyRec) = {
new MyResult(result.store,
result.prod,
result.total_amt + rec.amt,
math.min(result.min_amt, rec.amt),
math.max(result.max_amt, rec.amt),
result.total_units + rec.units
)
}
}
(2) Define pipelined aggregator:
:paste
def pipelinedAggregator(iter: Iterator[MyRec]): Iterator[Seq[MyResult]] = {
var prev: MyResult = null
var res: Seq[MyResult] = Nil
for (crt <- iter) yield {
if (prev == null) {
prev = MyResult(crt)
}
else if (prev.prod != crt.prod || prev.store != crt.store) {
res = Seq(prev)
prev = MyResult(crt)
}
else {
prev = MyResult.aggregate(prev, crt)
}
if (!iter.hasNext) {
res = res ++ Seq(prev)
}
res
}
}
(3) Run aggregation:
:paste
val sales = sc.parallelize(
List(MyRec("West", "Apple", 2.0, 10),
MyRec("West", "Apple", 3.0, 15),
MyRec("West", "Orange", 5.0, 15),
MyRec("South", "Orange", 3.0, 9),
MyRec("South", "Orange", 6.0, 18),
MyRec("East", "Milk", 5.0, 5),
MyRec("West", "Apple", 7.0, 11)), 2).toDS
sales.mapPartitions(iter => Iterator(iter.toList)).show(false)
val result = sales
.mapPartitions(recIter => pipelinedAggregator(recIter))
.flatMap(identity)
result.show
result.explain
Output:
+-------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------+
|[[West,Apple,2.0,10], [West,Apple,3.0,15], [West,Orange,5.0,15]] |
|[[South,Orange,3.0,9], [South,Orange,6.0,18], [East,Milk,5.0,5], [West,Apple,7.0,11]]|
+-------------------------------------------------------------------------------------+
+-----+------+---------+-------+-------+-----------+
|store| prod|total_amt|min_amt|max_amt|total_units|
+-----+------+---------+-------+-------+-----------+
| West| Apple| 5.0| 2.0| 3.0| 25|
| West|Orange| 5.0| 5.0| 5.0| 15|
|South|Orange| 9.0| 3.0| 6.0| 27|
| East| Milk| 5.0| 5.0| 5.0| 5|
| West| Apple| 7.0| 7.0| 7.0| 11|
+-----+------+---------+-------+-------+-----------+
== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).store, true) AS store#31, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).prod, true) AS prod#32, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).total_amt AS total_amt#33, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).min_amt AS min_amt#34, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).max_amt AS max_amt#35, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).total_units AS total_units#36]
+- MapPartitions <function1>, obj#30: $line14.$read$$iw$$iw$MyResult
+- MapPartitions <function1>, obj#20: scala.collection.Seq
+- Scan ExternalRDDScan[obj#4]
sales: org.apache.spark.sql.Dataset[MyRec] = [store: string, prod: string ... 2 more fields]
result: org.apache.spark.sql.Dataset[MyResult] = [store: string, prod: string ... 4 more fields]
If this is the output your looking for
+-----+------+--------+----------+
|store|prod |max(amt)|avg(units)|
+-----+------+--------+----------+
|South|Orange|6.0 |13.5 |
|West |Orange|5.0 |15.0 |
|East |Milk |5.0 |5.0 |
|West |Apple |3.0 |12.5 |
+-----+------+--------+----------+
Spark Dataframe has all the functionality your asking for with generic concise shorthand syntax
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object TestJob2 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.show(false)
rawDf.printSchema
val aggDf = rawDf
.groupBy("store", "prod")
.agg(
max(col("amt")),
avg(col("units"))
// in case you need to retain more info
// , collect_list(struct("*")).as("horizontal")
)
aggDf.printSchema
aggDf.show(false)
}
}
uncomment the collect_list line to aggregate everything
+-----+------+--------+----------+---------------------------------------------------+
|store|prod |max(amt)|avg(units)|horizontal
|
+-----+------+--------+----------+---------------------------------------------------+
|South|Orange|6.0 |13.5 |[[South, Orange, 3.0, 9], [South, Orange, 6.0, 18]]|
|West |Orange|5.0 |15.0 |[[West, Orange, 5.0, 15]]
|
|East |Milk |5.0 |5.0 |[[East, Milk, 5.0, 5]]
|
|West |Apple |3.0 |12.5 |[[West, Apple, 2.0, 10], [West, Apple, 3.0, 15]] |
+-----+------+--------+----------+---------------------------------------------------+
The maximum and average aggregations you specify are over multiple rows.
If you want to keep all the original rows use a Window function which will partition.
If you want to reduce the rows in each partition you must specify a reduction logic or filter.
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object TestJob7 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.show(false)
rawDf.printSchema
val storeProdWindow = Window
.partitionBy("store", "prod")
val aggDf = rawDf
.withColumn("max(amt)", max("amt").over(storeProdWindow))
.withColumn("avg(units)", avg("units").over(storeProdWindow))
aggDf.printSchema
aggDf.show(false)
}
}
Here is the result take note that its already grouped (the window shuffles into partitions)
+-----+------+---+-----+--------+----------+
|store|prod |amt|units|max(amt)|avg(units)|
+-----+------+---+-----+--------+----------+
|South|Orange|3.0|9 |6.0 |13.5 |
|South|Orange|6.0|18 |6.0 |13.5 |
|West |Orange|5.0|15 |5.0 |15.0 |
|East |Milk |5.0|5 |5.0 |5.0 |
|West |Apple |2.0|10 |3.0 |12.5 |
|West |Apple |3.0|15 |3.0 |12.5 |
+-----+------+---+-----+--------+----------+
Aggregate functions reduce values of rows for specified columns within the group. Yo can perform multiple different aggregations resulting in new columns with values from the input rows in one iteration, exclusively using Dataframe functionality. If you wish to retain other row values you need to implement reduction logic that specifies a row from which each value comes from. For instance keep all values of the first row with the maximum value of age. To this end you can use a UDAF (user defined aggregate function) to reduce rows within the group. In the example I also aggregate max amt and average units using standard aggregate functions in the same iteration.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object ReduceAggJob {
def main (args: Array[String]): Unit = {
val appName = this.getClass.getName.replace("$", "")
println(s"appName: $appName")
val sparkSession = SparkSession
.builder()
.appName(appName)
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("West", "Orange", 17.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.printSchema
rawDf.show(false)
// Create an instance of UDAF GeometricMean.
val maxAmtUdaf = new KeepRowWithMaxAmt
// Keep the row with max amt
val aggDf = rawDf
.groupBy("store", "prod")
.agg(
max("amt"),
avg("units"),
maxAmtUdaf(
col("store"),
col("prod"),
col("amt"),
col("units")).as("KeepRowWithMaxAmt")
)
aggDf.printSchema
aggDf.show(false)
}
}
The UDAF
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class KeepRowWithMaxAmt extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the internal fields you keep for computing your aggregate.
override def bufferSchema: StructType = StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the output type of your aggregation function.
override def dataType: DataType =
StructType((Array(
StructField("store", StringType),
StructField("prod", StringType),
StructField("amt", DoubleType),
StructField("units", IntegerType)
)))
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
buffer(1) = ""
buffer(2) = 0.0
buffer(3) = 0
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val amt = buffer.getAs[Double](2)
val candidateAmt = input.getAs[Double](2)
amt match {
case a if a < candidateAmt =>
buffer(0) = input.getAs[String](0)
buffer(1) = input.getAs[String](1)
buffer(2) = input.getAs[Double](2)
buffer(3) = input.getAs[Int](3)
case _ =>
}
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer2.getAs[String](0)
buffer1(1) = buffer2.getAs[String](1)
buffer1(2) = buffer2.getAs[Double](2)
buffer1(3) = buffer2.getAs[Int](3)
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
buffer
}
}

How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0

First, I am very new to SPARK
I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am getting correct results but I need all columns in my resultset.
Dataset<Row> resultset = studentDataSet.select("*").groupBy("name").max("age");
resultset.show(1000,false);
I am getting only name and max(age) in my resultset dataset.
For your solution you have to try different approach. You was almost there for solution but let me help you understand.
Dataset<Row> resultset = studentDataSet.groupBy("name").max("age");
now what you can do is you can join the resultset with studentDataSet
Dataset<Row> joinedDS = studentDataset.join(resultset, "name");
The problem with groupBy this that after applying groupBy you get RelationalGroupedDataset so it depends on what next operation you perform like sum, min, mean, max etc then the result of these operation joined with groupBy
As in you case name column is joined with the max of age so it will return only two columns but if use apply groupBy on age and then apply max on 'age' column you will get two column one is age and second is max(age).
Note :- code is not tested please make changes if needed
Hope this clears you query
The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.
Let's create a sample data set and test the code:
val df = Seq(
("bob", 20, "blah"),
("bob", 40, "blah"),
("karen", 21, "hi"),
("monica", 43, "candy"),
("monica", 99, "water")
).toDF("name", "age", "another_column")
This code should run faster with large DataFrames.
df
.groupBy("name")
.agg(
max("name").as("name1_dup"),
max("another_column").as("another_column"),
max("age").as("age")
).drop(
"name1_dup"
).show()
+------+--------------+---+
| name|another_column|age|
+------+--------------+---+
|monica| water| 99|
| karen| hi| 21|
| bob| blah| 40|
+------+--------------+---+
What your trying to achieve is
group rows by age
reduce each group to 1 row with maximum age
This alternative achieves this output without use of aggregate
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object TestJob5 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("Moe", "Slap", 7.9, 118),
("Larry", "Spank", 8.0, 115),
("Curly", "Twist", 6.0, 113),
("Laurel", "Whimper", 7.53, 119),
("Hardy", "Laugh", 6.0, 118),
("Charley", "Ignore", 9.7, 115),
("Moe", "Spank", 6.8, 118),
("Larry", "Twist", 6.0, 115),
("Charley", "fall", 9.0, 115)
).toDF("name", "requisite", "funniness_of_requisite", "age")
rawDf.show(false)
rawDf.printSchema
val nameWindow = Window
.partitionBy("name")
val aggDf = rawDf
.withColumn("id", monotonically_increasing_id)
.withColumn("maxFun", max("funniness_of_requisite").over(nameWindow))
.withColumn("count", count("name").over(nameWindow))
.withColumn("minId", min("id").over(nameWindow))
.where(col("maxFun") === col("funniness_of_requisite") && col("minId") === col("id") )
.drop("maxFun")
.drop("minId")
.drop("id")
aggDf.printSchema
aggDf.show(false)
}
}
bear in mind that a group could potentially have more than 1 row with max age so you need to pick one by some logic. In the example I assume it doesn't matter so I just assign a unique number to choose
Noting that a subsequent join is extra shuffling and some of the other solutions seem inaccurate in the returns or even turn the Dataset into Dataframes, I sought a better solution. Here is mine:
case class People(name: String, age: Int, other: String)
val df = Seq(
People("Rob", 20, "cherry"),
People("Rob", 55, "banana"),
People("Rob", 40, "apple"),
People("Ariel", 55, "fox"),
People("Vera", 43, "zebra"),
People("Vera", 99, "horse")
).toDS
val oldestResults = df
.groupByKey(_.name)
.mapGroups{
case (nameKey, peopleIter) => {
var oldestPerson = peopleIter.next
while(peopleIter.hasNext) {
val nextPerson = peopleIter.next
if(nextPerson.age > oldestPerson.age) oldestPerson = nextPerson
}
oldestPerson
}
}
oldestResults.show
The following produces:
+-----+---+------+
| name|age| other|
+-----+---+------+
|Ariel| 55| fox|
| Rob| 55|banana|
| Vera| 99| horse|
+-----+---+------+
You need to remember that aggregate functions reduce the rows and therefore you need to specify which of the rows age you want with a reducing function. If you want to retain all rows of a group (warning! this can cause explosions or skewed partitions) you can collect them as a list. You can then use a UDF (user defined function) to reduce them by your criteria, in this example funniness_of_requisite. And then expand columns belonging to the reduced row from the single reduced row with another UDF .
For the purpose of this answer I assume you wish to retain the age of the person who has the max funniness_of_requisite.
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType}
import scala.collection.mutable
object TestJob4 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
(1, "Moe", "Slap", 7.9, 118),
(2, "Larry", "Spank", 8.0, 115),
(3, "Curly", "Twist", 6.0, 113),
(4, "Laurel", "Whimper", 7.53, 119),
(5, "Hardy", "Laugh", 6.0, 18),
(6, "Charley", "Ignore", 9.7, 115),
(2, "Moe", "Spank", 6.8, 118),
(3, "Larry", "Twist", 6.0, 115),
(3, "Charley", "fall", 9.0, 115)
).toDF("id", "name", "requisite", "funniness_of_requisite", "age")
rawDf.show(false)
rawDf.printSchema
val rawSchema = rawDf.schema
val fUdf = udf(reduceByFunniness, rawSchema)
val nameUdf = udf(extractAge, IntegerType)
val aggDf = rawDf
.groupBy("name")
.agg(
count(struct("*")).as("count"),
max(col("funniness_of_requisite")),
collect_list(struct("*")).as("horizontal")
)
.withColumn("short", fUdf($"horizontal"))
.withColumn("age", nameUdf($"short"))
.drop("horizontal")
aggDf.printSchema
aggDf.show(false)
}
def reduceByFunniness= (x: Any) => {
val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
val red = d.reduce((r1, r2) => {
val funniness1 = r1.getAs[Double]("funniness_of_requisite")
val funniness2 = r2.getAs[Double]("funniness_of_requisite")
val r3 = funniness1 match {
case a if a >= funniness2 =>
r1
case _ =>
r2
}
r3
})
red
}
def extractAge = (x: Any) => {
val d = x.asInstanceOf[GenericRowWithSchema]
d.getAs[Int]("age")
}
}
d.getAs[String]("name")
}
}
here is the output
+-------+-----+---------------------------+-------------------------------+---+
|name |count|max(funniness_of_requisite)|short
|age|
+-------+-----+---------------------------+-------------------------------+---+
|Hardy |1 |6.0 |[5, Hardy, Laugh, 6.0, 18]
|18 |
|Moe |2 |7.9 |[1, Moe, Slap, 7.9, 118]
|118|
|Curly |1 |6.0 |[3, Curly, Twist, 6.0, 113]
|113|
|Larry |2 |8.0 |[2, Larry, Spank, 8.0, 115]
|115|
|Laurel |1 |7.53 |[4, Laurel, Whimper, 7.53, 119]|119|
|Charley|2 |9.7 |[6, Charley, Ignore, 9.7, 115] |115|
+-------+-----+---------------------------+-------------------------------+---+

Resources