How to use Dataset to group by, but entire rows - apache-spark

Reading the this post I wonder how can we group the a Dataset but with multiple columns.
Like:
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
I would like to get
Chicago, [( "David", "ff"), ("Andrew", "ddd")]

Create a case class as below
case class TestData (location: String, name: String, value: String)
Dummy Data
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
//change each row to TestData object
.map(x => TestData(x._1, x._2, x._3))
.toDS() // create dataset from above data
Output as you require
test.groupBy($"location")
.agg(collect_list(struct("name", "value")).as("data"))
.show(false)
Output:
+-----------+--------------------------------------------+
|location |data |
+-----------+--------------------------------------------+
|Los Angeles|[[Tom,ff]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Chicago |[[David,ff], [Andrew,ddd]] |
|Houston |[[John,dd]] |
|New York |[[Jack,jdhj]] |
+-----------+--------------------------------------------+

I have suggested a case class way in the link that you have provided in the question. Here's something different.
RDD way
You can simply do the following
val rdd = sc.parallelize(test) //creating rdd from test
val resultRdd = rdd.groupBy(x => x._1) //grouping by the first element
.mapValues(x => x.map(y => (y._2, y._3))) //collecting the second and third element in the grouped datset
resultRdd.foreach(println) should give you
(New York,List((Jack,jdhj)))
(Houston,List((John,dd)))
(Chicago,List((David,ff), (Andrew,ddd)))
(Detroit,List((Michael,fff), (Peter,dd), (George,dkdjkd)))
(Los Angeles,List((Tom,ff)))
Converting rdd to dataframe
If you require output in table format you can just call .toDF() after some manipulation as
val df = resultRdd.map(x => (x._1, x._2.toArray)).toDF()
df.show(false) should give you
+-----------+--------------------------------------------+
|_1 |_2 |
+-----------+--------------------------------------------+
|New York |[[Jack,jdhj]] |
|Houston |[[John,dd]] |
|Chicago |[[David,ff], [Andrew,ddd]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Los Angeles|[[Tom,ff]] |
+-----------+--------------------------------------------+

Related

Compare and check out differences between two dataframes using pySpark

Let's suppose that we have a dataframe with the following schema
root
|-- AUTHOR_ID: integer (nullable = false)
|-- NAME: string (nullable = true)
|-- Books: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- BOOK_ID: integer (nullable = false)
| | |-- Chapters: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- NAME: string (nullable = true)
| | | | |-- NUMBER_PAGES: integer (nullable = true)
As you can see, we nested struct objects
How can we compare two dataframes with the same schema and calculate or bring out the deltas(differences) ?
Let's suppose that the following changes are occured:
the name of the first chapter of book with id=1 was changed, thus we can imagine the following comparison output
{
"AUTHOR_ID": 1,
"Books": [
{
"BOOK_ID": 1,
"Chapters": [
{
"id": 1,
"NAME": {
"before": "Etranger",
"after": "L'étranger"
}
}
]
}
]
}
Note: we will show only the Ids and the changed values for the relevant items
Here is a sample code to join by authorId and then compare.
from pyspark.sql.functions import collect_list, struct, col
from operator import itemgetter
from pyspark.sql.types import ArrayType, StructType, StringType, StructField
# construct data
data = [('AA_1', 'S1', "10", "1", "Introduction to Quadratic Equation"),
('AA_1', 'S1', "10", "2", "Fundamentals"),
('AA_1', 'S1', "11", "1", "Preface"),
('AA_1', 'S1', "11", "2", "Wading in to the waters"),
('AA_2', 'S2', "100", "1", "Introduction"),
('AA_2', 'S2', "100", "2", "Fundamentals"),
('AA_2', 'S2', "110", "1", "Prologue"),
('AA_2', 'S2', "110", "2", "Epilogue"),
]
data2 = [('AA_1', 'S1', "10", "1", "Introduction to Linear Algebra"),
('AA_1', 'S1', "10", "2", "Fundamentals"),
('AA_1', 'S1', "11", "1", "Preface"),
('AA_1', 'S1', "11", "2", "Wading in to the waters"),
('AA_2', 'S2', "100", "1", "Introduction"),
('AA_2', 'S2', "100", "2", "Fundamentals2"),
('AA_2', 'S2', "110", "1", "Prologue"),
('AA_2', 'S2', "110", "2", "Epilogue"),
]
df = spark.createDataFrame(data, ["authorId", "name", "bookId", "chapterId", "chapterName"]).groupBy(['authorId', 'name', 'bookId']).agg(collect_list(struct("chapterId", "chapterName")).alias("chapters")).groupBy(['authorId', 'name']).agg(collect_list(struct('bookId', 'chapters')).alias('books'))
df2 = spark.createDataFrame(data2, ["authorId", "name", "bookId", "chapterId", "chapterName"]).groupBy(['authorId', 'name', 'bookId']).agg(collect_list(struct("chapterId", "chapterName")).alias("chapters")).groupBy(['authorId', 'name']).agg(collect_list(struct('bookId', 'chapters')).alias('books'))
df2 = df2.select(col('authorId').alias('authorId2'),col('name').alias('name2'), col('books').alias('books2') )
# join on authorId
df3 = df.join(df2, [df.authorId == df2.authorId2])
# UDF to compare, needs additional checks on books and chapters lengths and Null checks
#udf(ArrayType(StructType([StructField("bookId", StringType()), StructField("chapters", ArrayType(StructType([StructField("chapterId", StringType()), StructField("name", StructType([StructField("before", StringType()), StructField("after", StringType())]))])))])))
def get_book_diff(b1, b2):
if (len(b1) != len(b2)):
return None
b1.sort(key = itemgetter('bookId'))
b2.sort(key = itemgetter('bookId'))
list_data = []
i=0
for book in b1:
data = {}
if book.bookId == b2[i].bookId:
data['bookId']=book.bookId
book.chapters.sort(key = itemgetter('chapterId'))
b2[i].chapters.sort(key = itemgetter('chapterId'))
data['chapters']=[]
j=0
for chap in book.chapters:
if chap.chapterId == b2[i].chapters[j].chapterId:
if chap.chapterName != b2[i].chapters[j].chapterName:
data['chapters'].append({'chapterId':chap.chapterId, 'name': {"before": chap.chapterName, "after": b2[i].chapters[j].chapterName}})
j+=1
i+=1
list_data.append(data)
return list_data
df3 = df3.withColumn('book_diff', get_book_diff('books', 'books2'))
#df3.select('authorId', 'book_diff').show(truncate=False)
display(df3.select('authorId', 'book_diff'))
I think here the unfortunate requirement is we need to flatten the struct into columns to allow comparison.
import pyspark.sql.functions as F
columns = ["AUTHOR_ID","NAME","Books"] # lazy partial naming
#Original
data = [(1, "James,,Smith",[(1,[(1,"The beggining", 12, "It was a great day")])]), (2, "Stephen King", [(2,[(1,"The start", 12, "It was a great day")])])]
#Update
# Bookid 1 --> added a chapter, fixed a typo in the first chapter.
# Bookid 2 --> Changed nothing
data_after = [(1, "James,,Smith",[(1,[(1,"The begining", 12, "It was a great day"),(2,"The end", 1, "It was a great night")])]), (2, "Stephen King", [(2,[(1,"The start", 12, "It was an a great day")])])]
df = spark.createDataFrame(data=data,schema=columns)
df2 = spark.createDataFrame(data=data_after,schema=columns)
#flatten the struct into columns Could have use withColumn.
df_flat = df.select("*", F.posexplode(F.col("Books")).alias("pos","Book")).select( "*", F.col("Book._1").alias("BookId"), F.posexplode(F.col("Book._2")).alias("pos","Chapter") ).select("*", F.col("Chapter.*"), F.lit("Original").alias("source") );
df2_flat = df2.select("*", F.posexplode(F.col("Books")).alias("pos","Book")).select( "*", F.col("Book._1").alias("BookId"), F.posexplode(F.col("Book._2")).alias("pos","Chapter") ).select("*", F.col("Chapter.*"), F.lit("Update").alias("source") );
#use a union to pull all data together
all = df_flat.union(df2_flat).withColumnRenamed("_1", "Chapter_id")\
.withColumnRenamed("_2", "text")
#Find things that don't have a match these are the additions/updates/deletion
all.groupBy("AUTHOR_ID","BookId","Chapter_id","text").agg(F.first("source"),F.count("text").alias("count")).where(F.col("count") != 2).show()
+---------+------+----------+-------------+--------------------+-----+
|AUTHOR_ID|BookId|Chapter_id| text|first(source, false)|count|
+---------+------+----------+-------------+--------------------+-----+
| 1| 1| 2| The end| Update| 1|
| 1| 1| 1| The begining| Update| 1|
| 1| 1| 1|The beggining| Original| 1|
+---------+------+----------+-------------+--------------------+-----+
From here you need to do a little more work. (Think 1 more groupBy down to the Author/Bookid/chapterid, (count the chapterid) then select [with/otherwise] logic on source)
If a chapter exists in the Update and Original it's an edit. (count of 2)
Only exists in Update it's addition. (count of 1)
Only exists in Original it's a deletion. (count of 1)
Building back to your struct back from here is up to you, but I think this demonstrates the idea of what's required. Using the page number of the chapter might actually be a good method to detect change. It's certainly cheaper than comparing strings likely not as accurate.
Give a try to:
from gresearch.spark.diff import *
left.diff(right)
See https://github.com/G-Research/spark-extension/blob/master/DIFF.md

Need to get fields from lookup file with calculated values using spark dataframe

I have a products file as below and the formula for cost price and discount in the lookup file. we need get the discount and cost price fields from lookup file.
product_id,product_name,marked_price,selling_price,profit
101,AAAA,5500,5400,500
102,ABCS,7000,6500,1000
103,GHMA,6500,5600,700
104,PSNLA,8450,8000,800
105,GNBC,1250,1200,600
lookup file:
key,value
cost_price,(selling_price+profit)
discount,(marked_price-selling_price)
Final output:
product_id,product_name,marked_price,selling_price,profit,cost_price,discount
101,AAAA,5500,5400,500,5900,100
102,ABCS,7000,6500,1000,7500,500
103,GHMA,6500,5600,700,6300,900
104,PSNLA,8450,8000,800,8800,450
105,GNBC,1250,1200,600,1800,50
First, you should make a Map out of your lookup-file, then you can add the columns using expr:
val lookup = Map(
"cost_price" -> "selling_price+profit",
"discount" -> "(marked_price-selling_price)"
)
val df = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000)
)
.toDF("product_id", "product_name", "marked_price", "selling_price", "profit")
df
.withColumn("cost_price",expr(lookup("cost_price")))
.withColumn("discount",expr(lookup("discount")))
.show()
gives :
+----------+------------+------------+-------------+------+----------+--------+
|product_id|product_name|marked_price|selling_price|profit|cost_price|discount|
+----------+------------+------------+-------------+------+----------+--------+
| 101| AAAA| 5500| 5400| 500| 5900| 100|
| 102| ABCS| 7000| 6500| 1000| 7500| 500|
+----------+------------+------------+-------------+------+----------+--------+
you can also iterate your lookup :
val finalDF = lookup.foldLeft(df){case (df,(k,v)) => df.withColumn(k,expr(v))}
import spark.implicits._
val dataheader = Seq("product_id", "product_name", "marked_price", "selling_price", "profit")
val data = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000),
(103, "GHMA", 6500, 5600, 700),
(104, "PSNLA", 8450, 8000, 800),
(105, "GNBC", 1250, 1200, 600))
val dataDF = data.toDF(dataheader: _*)
val lookupMap = Seq(
("cost_price", "(selling_price+profit)"),
("discount", "(marked_price-selling_price)")) //You can read your data from look up file and construct a dataframe
.toDF("key", "value").select("key", "value").as[(String, String)].collect.toMap
lookupMap.foldLeft(dataDF)((df, look) => {
df.withColumn(look._1, expr(look._2))
}).show()

Convert dataframe into array of nested json object in pyspark

I have created dataframe as follows :
+----+-------+-------+
| age| number|name |
+----+-------+-------+
| 16| 12|A |
| 16| 13|B |
| 17| 16|E |
| 17| 17|F |
+----+-------+-------+
How to convert it into following json:
{
'age' : 16,
'values' : [{‘number’: ‘12’ , ‘name’ : 'A'},{‘number’: ‘12’ , ‘name’ : 'A'} ]
},{
'age' : 17,
'values' : [{‘number’: ‘16’ , ‘name’ : 'E'},{‘number’: ‘17’ , ‘name’ : 'F'} ]
}
assuming df is your dataframe,
from pyspark.sql import functions as F
new_df = df.select(
"age",
F.struct(
F.col("number"),
F.col("name"),
).alias("values")
).groupBy(
"age"
).agg(
F.collect_list("values").alias("values")
)
new_df.toJSON()
# or
new_df.write.json(...)
You can convert the DF to RDD and apply your transformations:
NewSchema = StructType([StructField("age", IntegerType())
, StructField("values", StringType())
])
res_df = df.rdd.map(lambda row: (row[0], ([{'number':row[1], 'name':row[2]}])))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda row: (row[0], json.dumps(row[1])))\
.toDF(NewSchema)
res_df.show(20, False)
Show res_df:
+---+------------------------------------------------------------+
|age|values |
+---+------------------------------------------------------------+
|16 |[{"number": 12, "name": "A"}, [{"number": 13, "name": "B"}] |
|17 |[{"number": 17, "name": "F"}, [{"number": 16, "name": "E"}] |
+---+------------------------------------------------------------+
Saving the DF as JSON File:
res_df.coalesce(1).write.format('json').save('output.json')

Spark aggregate on multiple columns within partition without shuffle

I'm trying to aggregate a dataframe on multiple columns. I know that everything I need for the aggregation is within the partition- that is, there's no need for a shuffle because all of the data for the aggregation are local to the partition.
Taking an example, if I have something like
val sales=sc.parallelize(List(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5))).repartition(2)
val tdf = sales.map{ case (store, prod, amt, units) => ((store, prod), (amt, amt, amt, units)) }.
reduceByKey((x, y) => (x._1 + y._1, math.min(x._2, y._2), math.max(x._3, y._3), x._4 + y._4))
println(tdf.toDebugString)
I get a result like
(2) ShuffledRDD[12] at reduceByKey at Test.scala:59 []
+-(2) MapPartitionsRDD[11] at map at Test.scala:58 []
| MapPartitionsRDD[10] at repartition at Test.scala:57 []
| CoalescedRDD[9] at repartition at Test.scala:57 []
| ShuffledRDD[8] at repartition at Test.scala:57 []
+-(1) MapPartitionsRDD[7] at repartition at Test.scala:57 []
| ParallelCollectionRDD[6] at parallelize at Test.scala:51 []
You can see the MapPartitionsRDD, which is good. But then there's the ShuffleRDD, which I want to prevent because I want the per-partition summarization, grouped by column values within the partition.
zero323's suggestion is tantalizingly close, but I need the "group by columns" functionality.
Referring to my sample above, I'm looking for the result that would be produced by
select store, prod, sum(amt), avg(units) from sales group by partition_id, store, prod
(I don't really need the partition id- that's just to illustrate that I want per-partition results)
I've looked at lots of examples but every debug string I've produced has the Shuffle. I really hope to get rid of the shuffle. I guess I'm essentially looking for a groupByKeysWithinPartitions function.
The only way to achieve that is by using mapPartitions and have custom code for grouping and computing your values while iterating the partition.
As you mention the data is already sorted by grouping keys (store, prod), we can efficiently compute your aggregations in a pipelined fashion:
(1) Define helper classes:
:paste
case class MyRec(store: String, prod: String, amt: Double, units: Int)
case class MyResult(store: String, prod: String, total_amt: Double, min_amt: Double, max_amt: Double, total_units: Int)
object MyResult {
def apply(rec: MyRec): MyResult = new MyResult(rec.store, rec.prod, rec.amt, rec.amt, rec.amt, rec.units)
def aggregate(result: MyResult, rec: MyRec) = {
new MyResult(result.store,
result.prod,
result.total_amt + rec.amt,
math.min(result.min_amt, rec.amt),
math.max(result.max_amt, rec.amt),
result.total_units + rec.units
)
}
}
(2) Define pipelined aggregator:
:paste
def pipelinedAggregator(iter: Iterator[MyRec]): Iterator[Seq[MyResult]] = {
var prev: MyResult = null
var res: Seq[MyResult] = Nil
for (crt <- iter) yield {
if (prev == null) {
prev = MyResult(crt)
}
else if (prev.prod != crt.prod || prev.store != crt.store) {
res = Seq(prev)
prev = MyResult(crt)
}
else {
prev = MyResult.aggregate(prev, crt)
}
if (!iter.hasNext) {
res = res ++ Seq(prev)
}
res
}
}
(3) Run aggregation:
:paste
val sales = sc.parallelize(
List(MyRec("West", "Apple", 2.0, 10),
MyRec("West", "Apple", 3.0, 15),
MyRec("West", "Orange", 5.0, 15),
MyRec("South", "Orange", 3.0, 9),
MyRec("South", "Orange", 6.0, 18),
MyRec("East", "Milk", 5.0, 5),
MyRec("West", "Apple", 7.0, 11)), 2).toDS
sales.mapPartitions(iter => Iterator(iter.toList)).show(false)
val result = sales
.mapPartitions(recIter => pipelinedAggregator(recIter))
.flatMap(identity)
result.show
result.explain
Output:
+-------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------+
|[[West,Apple,2.0,10], [West,Apple,3.0,15], [West,Orange,5.0,15]] |
|[[South,Orange,3.0,9], [South,Orange,6.0,18], [East,Milk,5.0,5], [West,Apple,7.0,11]]|
+-------------------------------------------------------------------------------------+
+-----+------+---------+-------+-------+-----------+
|store| prod|total_amt|min_amt|max_amt|total_units|
+-----+------+---------+-------+-------+-----------+
| West| Apple| 5.0| 2.0| 3.0| 25|
| West|Orange| 5.0| 5.0| 5.0| 15|
|South|Orange| 9.0| 3.0| 6.0| 27|
| East| Milk| 5.0| 5.0| 5.0| 5|
| West| Apple| 7.0| 7.0| 7.0| 11|
+-----+------+---------+-------+-------+-----------+
== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).store, true) AS store#31, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).prod, true) AS prod#32, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).total_amt AS total_amt#33, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).min_amt AS min_amt#34, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).max_amt AS max_amt#35, assertnotnull(input[0, $line14.$read$$iw$$iw$MyResult, true]).total_units AS total_units#36]
+- MapPartitions <function1>, obj#30: $line14.$read$$iw$$iw$MyResult
+- MapPartitions <function1>, obj#20: scala.collection.Seq
+- Scan ExternalRDDScan[obj#4]
sales: org.apache.spark.sql.Dataset[MyRec] = [store: string, prod: string ... 2 more fields]
result: org.apache.spark.sql.Dataset[MyResult] = [store: string, prod: string ... 4 more fields]
If this is the output your looking for
+-----+------+--------+----------+
|store|prod |max(amt)|avg(units)|
+-----+------+--------+----------+
|South|Orange|6.0 |13.5 |
|West |Orange|5.0 |15.0 |
|East |Milk |5.0 |5.0 |
|West |Apple |3.0 |12.5 |
+-----+------+--------+----------+
Spark Dataframe has all the functionality your asking for with generic concise shorthand syntax
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object TestJob2 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.show(false)
rawDf.printSchema
val aggDf = rawDf
.groupBy("store", "prod")
.agg(
max(col("amt")),
avg(col("units"))
// in case you need to retain more info
// , collect_list(struct("*")).as("horizontal")
)
aggDf.printSchema
aggDf.show(false)
}
}
uncomment the collect_list line to aggregate everything
+-----+------+--------+----------+---------------------------------------------------+
|store|prod |max(amt)|avg(units)|horizontal
|
+-----+------+--------+----------+---------------------------------------------------+
|South|Orange|6.0 |13.5 |[[South, Orange, 3.0, 9], [South, Orange, 6.0, 18]]|
|West |Orange|5.0 |15.0 |[[West, Orange, 5.0, 15]]
|
|East |Milk |5.0 |5.0 |[[East, Milk, 5.0, 5]]
|
|West |Apple |3.0 |12.5 |[[West, Apple, 2.0, 10], [West, Apple, 3.0, 15]] |
+-----+------+--------+----------+---------------------------------------------------+
The maximum and average aggregations you specify are over multiple rows.
If you want to keep all the original rows use a Window function which will partition.
If you want to reduce the rows in each partition you must specify a reduction logic or filter.
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object TestJob7 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.show(false)
rawDf.printSchema
val storeProdWindow = Window
.partitionBy("store", "prod")
val aggDf = rawDf
.withColumn("max(amt)", max("amt").over(storeProdWindow))
.withColumn("avg(units)", avg("units").over(storeProdWindow))
aggDf.printSchema
aggDf.show(false)
}
}
Here is the result take note that its already grouped (the window shuffles into partitions)
+-----+------+---+-----+--------+----------+
|store|prod |amt|units|max(amt)|avg(units)|
+-----+------+---+-----+--------+----------+
|South|Orange|3.0|9 |6.0 |13.5 |
|South|Orange|6.0|18 |6.0 |13.5 |
|West |Orange|5.0|15 |5.0 |15.0 |
|East |Milk |5.0|5 |5.0 |5.0 |
|West |Apple |2.0|10 |3.0 |12.5 |
|West |Apple |3.0|15 |3.0 |12.5 |
+-----+------+---+-----+--------+----------+
Aggregate functions reduce values of rows for specified columns within the group. Yo can perform multiple different aggregations resulting in new columns with values from the input rows in one iteration, exclusively using Dataframe functionality. If you wish to retain other row values you need to implement reduction logic that specifies a row from which each value comes from. For instance keep all values of the first row with the maximum value of age. To this end you can use a UDAF (user defined aggregate function) to reduce rows within the group. In the example I also aggregate max amt and average units using standard aggregate functions in the same iteration.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object ReduceAggJob {
def main (args: Array[String]): Unit = {
val appName = this.getClass.getName.replace("$", "")
println(s"appName: $appName")
val sparkSession = SparkSession
.builder()
.appName(appName)
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("West", "Apple", 2.0, 10),
("West", "Apple", 3.0, 15),
("West", "Orange", 5.0, 15),
("West", "Orange", 17.0, 15),
("South", "Orange", 3.0, 9),
("South", "Orange", 6.0, 18),
("East", "Milk", 5.0, 5)
).toDF("store", "prod", "amt", "units")
rawDf.printSchema
rawDf.show(false)
// Create an instance of UDAF GeometricMean.
val maxAmtUdaf = new KeepRowWithMaxAmt
// Keep the row with max amt
val aggDf = rawDf
.groupBy("store", "prod")
.agg(
max("amt"),
avg("units"),
maxAmtUdaf(
col("store"),
col("prod"),
col("amt"),
col("units")).as("KeepRowWithMaxAmt")
)
aggDf.printSchema
aggDf.show(false)
}
}
The UDAF
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class KeepRowWithMaxAmt extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the internal fields you keep for computing your aggregate.
override def bufferSchema: StructType = StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the output type of your aggregation function.
override def dataType: DataType =
StructType((Array(
StructField("store", StringType),
StructField("prod", StringType),
StructField("amt", DoubleType),
StructField("units", IntegerType)
)))
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
buffer(1) = ""
buffer(2) = 0.0
buffer(3) = 0
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val amt = buffer.getAs[Double](2)
val candidateAmt = input.getAs[Double](2)
amt match {
case a if a < candidateAmt =>
buffer(0) = input.getAs[String](0)
buffer(1) = input.getAs[String](1)
buffer(2) = input.getAs[Double](2)
buffer(3) = input.getAs[Int](3)
case _ =>
}
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer2.getAs[String](0)
buffer1(1) = buffer2.getAs[String](1)
buffer1(2) = buffer2.getAs[Double](2)
buffer1(3) = buffer2.getAs[Int](3)
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
buffer
}
}

How to use dataset to groupby

I have a request to use rdd to do so:
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)
The result is that:
(New York,List(Jack))
(Detroit,List(Michael, Peter, George))
(Los Angeles,List(Tom))
(Houston,List(John))
(Chicago,List(David, Andrew))
How to do it use dataset with spark2.0?
I have a way to use a custom function, but the feeling is so complicated, there is no simple point method?
I would suggest you to start with creating a case class as
case class Monkey(city: String, firstName: String)
This case class should be defined outside the main class. Then you can just use toDS function and use groupBy and aggregation function called collect_list as below
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test)
.map(row => Monkey(row._1, row._2))
.toDS()
.groupBy("city")
.agg(collect_list("firstName") as "list")
.show(false)
You will have output as
+-----------+------------------------+
|city |list |
+-----------+------------------------+
|Los Angeles|[Tom] |
|Detroit |[Michael, Peter, George]|
|Chicago |[David, Andrew] |
|Houston |[John] |
|New York |[Jack] |
+-----------+------------------------+
You can always convert back to RDD by just calling .rdd function
To create a data set first define a case class outside your class as
case class Employee(city: String, name: String)
Then you can convert the list to Dataset as
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
).toDF("city", "name")
val data = test.as[Employee]
Or
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
val data = test.map(r => Employee(r._1, r._2)).toDS()
Now you can groupby and perform any aggregation as
data.groupBy("city").count().show
data.groupBy("city").agg(collect_list("name")).show
Hope this helps!
First i would turn your RDD into a DataSet:
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val testDs = test.toDS()
Here you get your col names :) Use it wise !
testDs.schema.fields.foreach(x => println(x))
In the end you only need to use a groupBy:
testDs.groupBy("City?", "Name?")
RDD-s are not really the 2.0 version way I think.
If you have any question please just ask.

Resources