Spark DataFrame Aggregation based on two or more Columns - apache-spark

I want to write a UDAF for some customized aggregation based on more than one column. A simple example would be a dataframe with two columns, c1 and c2. For each row, I take the max of c1 and c2 (let's call it cmax), then I take the sum of cmax.
When I call df.agg(), it does not look like I can pass two or more columns to any aggregation method including UDAF. 1st question, is it true?
For this simple example, I could create another column called cmax, and do the aggregation on cmax. But in reality, I would need to do aggregation based on N combinations of columns and the results would be a collection of size N. I would want to loop the combinations within the update method in my UDAF. Therefore it would require N intermediate columns, which does not seem to be a clean solution to me. 2nd question, I wonder if creating intermediate columns is the way to do it, or if there is a better solution.
I noticed in RDD, the problem is much easier. I can pass the entire record to my aggregation function and I have access to all the data fields.

You can use as many columns in a UDAF as the signature of it's apply function accepts multiple Columns (from it's source code).
def apply(exprs: Column*): Column
You just have to make sure that the inputSchema returns a StructType reflecting the columns that you want to consume as your UDAF input.
For the case of columns c1 and c2 your UDAF has to implement a inputSchema with the following schema:
def inputSchema: StructType = StructType(Array(StructField("c1", DoubleType), StructField("c2", DoubleType)))
However if you want a more general solution, you can always initialize the custom UDAF with arguments that allows returning the right inputSchema. See the example below that allows defining an arbitrary StructType at construction time (Note that we don't verify that StructType is of DoubleType).
class MyMaxUDAF(schema: StructType) extends UserDefinedAggregateFunction {
def inputSchema: StructType = this.schema
def bufferSchema: StructType = StructType(Array(StructField("maxSum", DoubleType)))
def dataType: DataType = DoubleType
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) = 0.0
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getDouble(0) + Array.range(0, input.length).map(input.getDouble).max
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = buffer2 match {
case Row(buffer2Sum: Double) => buffer1(0) = buffer1.getDouble(0) + buffer2Sum
}
def evaluate(buffer: Row): Double = buffer match {
case Row(totalSum: Double) => totalSum
}
}
Your DataFrame containing values and a key for aggregation.
val df = spark.createDataFrame(Seq(
Entry(0, 1.0, 2.0, 3.0), Entry(0, 3.0, 1.0, 2.0), Entry(1, 6.0, 2.0, 2)
))
df.show
+-------+---+---+---+
|groupMe| c1| c2| c3|
+-------+---+---+---+
| 0|1.0|2.0|3.0|
| 0|3.0|1.0|2.0|
| 1|6.0|2.0|2.0|
+-------+---+---+---+
And using the UDAF we expect the sum of max being 6.0 and 6.0
val fields = Array("c1", "c2", "c3")
val struct = StructType(fields.map(StructField(_, DoubleType)))
val myMaxUDAF: MyMaxUDAF = new MyMaxUDAF(struct)
df.groupBy("groupMe").agg(myMaxUDAF(fields.map(df(_)):_*)).show
+-------+---------------------+
|groupMe|mymaxudaf(c1, c2, c3)|
+-------+---------------------+
| 0| 6.0|
| 1| 6.0|
+-------+---------------------+
There is a nice tutorial on UDAF. Unfortunately they don't cover multiple arguments.
https://ragrawal.wordpress.com/2015/11/03/spark-custom-udaf-example/

Related

Grouping by then applying custom function in Spark, using SparkSession in a Java stream?

Let's assume that I have a use case where I want to groupBy then apply a custom function to the grouped values. In Python, I could accomplish this through:
df.groupby("id").apply(custom_function)
and
#pandas_udf("id string, prediction double", PandasUDFType.GROUPED_MAP)
def custom_function(id, dataframe):
rf = RandomForestRegressor(n_estimators=25, random_state=42)
rf.fit(train_features, dataframe.quantity_sold)
prediction = rf.predict(test_features)
return pd.DataFrame({'id': id, 'prediction': prediction}, index=[0])
I could accomplish the same thing in Scala through:
input.rdd.groupBy(row => row.get(0)).collect().map(data => {
val df = sparkSession.createDataFrame(sparkContext.parallelize(data._2.toSeq), input.schema)
(data._1.toString, df)
}).foldLeft(sparkSession.createDataFrame(sparkContext.emptyRDD[Row], outputSchema))((acc, next) => {
val assembler = new VectorAssembler()
.setInputCols(modelColumns)
.setOutputCol(features)
.transform(next._2)
val forest = oldForest
.fit(assembler)
.transform(testAssembler)
acc.union(forest)
})
If we compare these two workarounds, the upper one works much faster than the below one. I tried to do this without collect, but I get the error RDD transformations and actions can only be invoked by the driver, not inside of other transformations.
I am aware that collect returns the results to the driver as a list, that is why I am forced to use Scala collection API (map and flatMap) to further continue my processing.
My questions regarding this are, is the job not supposed to be spread to executors again once collected to the driver (since I am continuing to use Spark ML API)? Or is everything simply calculated (once collected) in the driver as the code goes back to where main method is executed? Basically, why is the run very slow and is there any approach to make this process better without using Python?
Thank you!
EDIT: Managed to solve this (an example as below); say we have this dataset:
+---+-----+------+-----+
|id |first|second|third|
+---+-----+------+-----+
|1 |1.0 |1.0 |1.0 |
|1 |1.0 |2.0 |2.0 |
|1 |1.0 |3.0 |3.0 |
|1 |1.0 |4.0 |4.0 |
|1 |1.0 |5.0 |5.0 |
+---+-----+------+-----+
Our goal is to group by id, then for the grouped columns (first, second and third), we want to train the model then predict something (with column third being our label).
To group and apply the UDAF (as suggested from werner):
val myAggFct = udaf(MyAgg).apply(array("first", "second", "third"))
df.groupBy("id").agg(myAggFct)
myAggFct UDF is implemented as below:
object MyAgg extends Aggregator[Seq[Double], Seq[Seq[Double]], String] {
override def zero: Seq[Seq[Double]] = scala.collection.mutable.Seq[Seq[Double]]()
override def reduce(b: Seq[Seq[Double]], a: Seq[Double]): Seq[Seq[Double]] = b :+ a
override def merge(b1: Seq[Seq[Double]], b2: Seq[Seq[Double]]): Seq[Seq[Double]] = b1 ++ b2
override def finish(allInts: Seq[Seq[Double]]): String = {
// Defining the attributes (first, second and third, as our dataset)
val array: List[Attribute] = List() :+
new Attribute("first") :+
new Attribute("second") :+
new Attribute("third")
// Creation of the instance and defining what we want to predict, in our case, the last attribute
// aka. `third`
val dataRaw = new Instances("train", new util.ArrayList[Attribute](array.asJava), 0)
dataRaw.setClassIndex(dataRaw.numAttributes() - 1)
// Converting our Seq[Seq[Double]] to Dense Instances, so we can add it to `dataRaw`
// aka. our trained model
dataRaw.addAll(allInts.map(v => new DenseInstance(1.0, v.toArray)).asJava)
// We create a Random Forest object and we use `dataRaw` as classifier
val mlp = new RandomForest()
mlp.buildClassifier(dataRaw)
// Give it a test case, in this case, we want to see where first = 1.0 and second = 2.0 fall into
val testInstance = new DenseInstance(1.0, Seq(1.0, 2.0).toArray)
testInstance.setDataset(dataRaw)
// We classify the instance and add some content for clearer output
mlp.classifyInstance(testInstance).toString + ": " + allInts.mkString(", ")
}
override def bufferEncoder: Encoder[Seq[Seq[Double]]] = newSequenceEncoder[Seq[Seq[Double]]]
override def outputEncoder: Encoder[String] = Encoders.STRING
}
Final result:
+---+------------------------------------------------------------------------------------------------------------+
|id |myagg$(array(first, second, third)) |
+---+------------------------------------------------------------------------------------------------------------+
|1 |2.2: List(1.0, 1.0, 1.0), List(1.0, 2.0, 2.0), List(1.0, 3.0, 3.0), List(1.0, 4.0, 4.0), List(1.0, 5.0, 5.0)|
+---+------------------------------------------------------------------------------------------------------------+
In this case, there might be some overhead while converting from/to Java and Scala.
This is a good use case for an User-Defined Aggregate Function.
What is an User-Defined Aggregate Function?
After grouping a dataframe with groupBy usually one or more aggregation functions like min, max or sum are used to aggregate all values that belong to one group of rows into a single value. If none of Spark's built-in functions suits your needs you can write your own function that takes the data from one of the groups and aggregates it into a new value.
Like you can use
df.groupBy('myCol1).agg(sum('myCol2))
you can use
df.groupBy('myCol1).agg(customFunction('myCol2))
where customFunction does whatever you need it to do, for example applying a RandomForestRegressor to all elements of one group of data.
How to create an User-Defined Aggregate Function?
Here is an (arguably simplistic) example for an User-Defined Aggregate Function. This function collects all values of one group in a sequence and then concatenates all these values into a string.
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
//some test data: 1,2,3,...,10
val df = (1 to 10).toDF()
//create the user defined aggregation function
object MyAgg extends Aggregator[Int, Seq[Int], String]{
override def zero: Seq[Int] = scala.collection.mutable.Seq[Int]()
override def reduce(b: Seq[Int], a: Int): Seq[Int] = b :+ a
override def merge(b1: Seq[Int], b2: Seq[Int]): Seq[Int] = b1 ++ b2
override def finish(allInts: Seq[Int]): String = allInts.foldLeft("START")((s,b) => s + "_" + b)
override def bufferEncoder: Encoder[Seq[Int]] = newSequenceEncoder[Seq[Int]]
override def outputEncoder: Encoder[String] = Encoders.STRING
}
val myAggFct = udaf(MyAgg).withName("myAgg")
//group the dataframe and apply myAggFct to each group separately
df.groupBy(expr("value % 3")).agg(myAggFct('value)).show
Output:
+-----------+--------------+
|(value % 3)| myagg(value)|
+-----------+--------------+
| 1|START_1_4_7_10|
| 2| START_2_5_8|
| 0| START_3_6_9|
+-----------+--------------+
How does the User-Defined Aggregate Function work?
The two functions reduce and merge combine all values of one group into a sequence created by the zero function.
The central function is the function finish. Here the sequence of all collected values (allInts) is transformed into the result of the aggregation operation. This would be the place to apply for example the RandomForestRegressor. As the finish function runs distributed on the executor nodes, all required additional data should be broadcasted.
Note: the example above could also (better) be implemented using Dataset.reduce because we do not need the values as sequence. We simply could add the values to the string as soon as we see them. But for a regressor we need the complete list of values and so the User-Defined Aggreate Function is reasonable here.
The python version actually run in the executors and hence distributes the load. Collect requires all executors to send their data to the driver for processing. This means your only use the threads provided by the driver. You are also likely suffering from lots of garbage collection as well as you are creating a Vector assembler over and over again.(and immediately throwing it away)
If you want you can do collect like things in-side of an executor. you can use mapPartitions.
val df4 = df2.mapPartitions(iterator => { // Start executer code
// Do the heavy initialization here
// Like database connections e.t.c
val util = new Util()
val res = iterator.map(row=>{
val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))
(fullName, row.getString(3),row.getInt(5))
})
res // End executor code
})
val df4part = df4.toDF("fullName","id","salary")
df4part.printSchema()
df4part.show(false)
The catch is that you cannot use any feature that uses sparkContext as that only lives inside the driver. Said another way: You can only use pure scala features inside the executor code. But if you can find a Scala library for Random forest that would be answer. The iterator used inside is very memory efficient and will run much faster than your collect that you are doing.
Likely you really want to use spark's RandomForestRegressor?
It look like you have a global oldForest so I can't tell what you are using but [a global variable] won't work with mapParitions so initialize it once and use it many times(inside the executor code)
Collect Code

Cumulative product UDF for Spark SQL

I have seen in other posts of this being done for dataframes: https://stackoverflow.com/a/52992212/4080521
But I am trying to figure out how I can write an udf for a cumulative product.
Assuming I have a very basic table
Input data:
+----+
| val|
+----+
| 1 |
| 2 |
| 3 |
+----+
If i want to take the sum of this I can simply do something like
sparkSession.createOrReplaceTempView("table")
spark.sql("""Select SUM(table.val) from table""").show(100, false)
and this simply works because SUM is a pre defined function.
How would I define something similar for multiplication (or even how can I implement sum in an UDF myself)?
Trying the following
sparkSession.createOrReplaceTempView("_Period0")
val prod = udf((vals:Seq[Decimal]) => vals.reduce(_ * _))
spark.udf.register("prod",prod)
spark.sql("""Select prod(table.vals) from table""").show(100, false)
I get the following error:
Message: cannot resolve 'UDF(vals)' due to data type mismatch: argument 1 requires array<decimal(38,18)> type, however, 'table.vals' is of decimal(28,14)
Obviously each specific cell is not an array, but it seems the udf needs to take in an array to perform the aggregation. Is it even possible with spark sql?
You can implement it through UserDefinedAggregateFunction
You need to define several functions to work with the input and the buffer values.
Quick example for the product function using just doubles as type:
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class myUDAF extends UserDefinedAggregateFunction {
// inputSchema for the function
override def inputSchema: StructType = {
new StructType().add("val", DoubleType, nullable = true)
}
//Schema for the inner UDAF buffer, in the product case, you just need an accumulator
override def bufferSchema: StructType = StructType(StructField("accumulated", DoubleType) :: Nil)
//OutputDataType
override def dataType: DataType = DoubleType
override def deterministic: Boolean = true
//Initicla buffer value 1 for product
override def initialize(buffer: MutableAggregationBuffer) = buffer(0) = 1.0
//How to update the buffer, for product you just need to perform a product between the two elements (buffer & input)
override def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) = buffer.getAs[Double](0) * input.getAs[Double](0)
}
//Merge results with the previous buffered value (product as well here)
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Double](0) * buffer2.getAs[Double](0)
}
//Function on how to return the value
override def evaluate(buffer: Row) = buffer.getAs[Double](0)
}
Then you can register the function as you would do with any other UDF:
spark.udf.register("prod", new myUDAF)
RESULT
scala> spark.sql("Select prod(val) from table").show
+-----------+
|myudaf(val)|
+-----------+
| 6.0|
+-----------+
You can find further documentation here

How to use approxQuantile by group?

Spark has SQL function percentile_approx(), and its Scala counterpart is df.stat.approxQuantile().
However, the Scala counterpart cannot be used on grouped datasets, something like df.groupby("foo").stat.approxQuantile(), as answered here: https://stackoverflow.com/a/51933027.
But it's possible to do both grouping and percentiles in SQL syntax. So I'm wondering, maybe I can define an UDF from SQL percentile_approx function and use it on my grouped dataset?
Spark >= 3.1
Corresponding SQL functions have been added in Spark 3.1 - see SPARK-30569.
Spark < 3.1
While you cannot use approxQuantile in an UDF, and you there is no Scala wrapper for percentile_approx it is not hard to implement one yourself:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column = percentile_approx(
col, percentage, lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)
)
}
Example usage:
import PercentileApprox._
val df = (Seq.fill(100)("a") ++ Seq.fill(100)("b")).toDF("group").withColumn(
"value", when($"group" === "a", randn(1) + 10).otherwise(randn(3))
)
df.groupBy($"group").agg(percentile_approx($"value", lit(0.5))).show
+-----+------------------------------------+
|group|percentile_approx(value, 0.5, 10000)|
+-----+------------------------------------+
| b| -0.06336346702250675|
| a| 9.818985618591595|
+-----+------------------------------------+
df.groupBy($"group").agg(
percentile_approx($"value", typedLit(Seq(0.1, 0.25, 0.75, 0.9)))
).show(false)
+-----+----------------------------------------------------------------------------------+
|group|percentile_approx(value, [0.1,0.25,0.75,0.9], 10000) |
+-----+----------------------------------------------------------------------------------+
|b |[-1.2098351202406483, -0.6640768986666159, 0.6778253126144265, 1.3255676906697658]|
|a |[8.902067202468098, 9.290417382259626, 10.41767257153993, 11.067087075488068] |
+-----+----------------------------------------------------------------------------------+
Once this is on the JVM classpath you can also add PySpark wrapper, using logic similar to built-in functions.

Manipulating a dataframe within a Spark UDF

I have a UDF that filters and selects values from a dataframe, but it runs into "object not serializable" error. Details below.
Suppose I have a dataframe df1 that has columns with names ("ID", "Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10"). I want sum a subset of the "Y" columns based on the matching "ID" and "Value" from another dataframe df2. I tried the following:
val y_list = ("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10").map(c => col(c))
def udf_test(ID: String, value: Int): Double = {
df1.filter($"ID" === ID).select(y_list:_*).first.toSeq.toList.take(value).foldLeft(0.0)(_+_)
}
sqlContext.udf.register("udf_test", udf_test _)
val df_result = df2.withColumn("Result", callUDF("udf_test", $"ID", $"Value"))
This gives me errors of the form:
java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: Y1)
I looked this up and realized that Spark Column is not serializable. I am wondering:
1) There is any way to manipulate a dataframe within an UDF?
2) If not, what's the best way to achieve the type of operation above? My real case is more complicated than this. It requires me to select values from multiple small dataframes based on some columns in a big dataframe, and compute back a value to the big dataframe.
I am using Spark 1.6.3. Thanks!
You can't use Dataset operations inside UDFs. UDF can only manupulate on existing columns and produce one result column. It can't filter Dataset or make aggregations, but it can be used inside filter. UDAF also can aggregate values.
Instead, you can use .as[SomeCaseClass] to make Dataset from DataFrame and use normal, strongly typed functions inside filter, map, reduce.
Edit: If you want to join your bigDF with every small DF in smallDFs List, you can do:
import org.apache.spark.sql.functions._
val bigDF = // some processing
val smallDFs = Seq(someSmallDF1, someSmallDF2)
val joined = smallDFs.foldLeft(bigDF)((acc, df) => acc.join(broadcast(df), "join_column"))
broadcast is a function to add Broadcast Hint to small DF, so that small DF will use more efficient Broadcast Join instead of Sort Merge Join
1) No, you can only use plain scala code within UDFs
2) If you interpreted your code correctly, you can achieve your goal with:
df2
.join(
df1.select($"ID",y_list.foldLeft(lit(0))(_ + _).as("Result")),Seq("ID")
)
import org.apache.spark.sql.functions._
val events = Seq (
(1,1,2,3,4),
(2,1,2,3,4),
(3,1,2,3,4),
(4,1,2,3,4),
(5,1,2,3,4)).toDF("ID","amt1","amt2","amt3","amt4")
var prev_amt5=0
var i=1
def getamt5value(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) : Int = {
if(i==1){
i=i+1
prev_amt5=0
}else{
i=i+1
}
if (ID == 0)
{
if(amt1==0)
{
val cur_amt5= 1
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=1*(amt2+amt3)
prev_amt5=cur_amt5
cur_amt5
}
}else if (amt4==0 || (prev_amt5==0 & amt1==0)){
val cur_amt5=0
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=prev_amt5 + amt2 + amt3 + amt4
prev_amt5=cur_amt5
cur_amt5
}
}
val getamt5 = udf {(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) =>
getamt5value(ID,amt1,amt2,amt3,amt4)
}
myDF.withColumn("amnt5", getamt5(myDF.col("ID"),myDF.col("amt1"),myDF.col("amt2"),myDF.col("amt3"),myDF.col("amt4"))).show()

Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?
Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))
You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")
In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().
Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username
-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id
One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))
Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+
Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai
You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.
Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

Resources