How to use approxQuantile by group? - apache-spark

Spark has SQL function percentile_approx(), and its Scala counterpart is df.stat.approxQuantile().
However, the Scala counterpart cannot be used on grouped datasets, something like df.groupby("foo").stat.approxQuantile(), as answered here: https://stackoverflow.com/a/51933027.
But it's possible to do both grouping and percentiles in SQL syntax. So I'm wondering, maybe I can define an UDF from SQL percentile_approx function and use it on my grouped dataset?

Spark >= 3.1
Corresponding SQL functions have been added in Spark 3.1 - see SPARK-30569.
Spark < 3.1
While you cannot use approxQuantile in an UDF, and you there is no Scala wrapper for percentile_approx it is not hard to implement one yourself:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column = percentile_approx(
col, percentage, lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)
)
}
Example usage:
import PercentileApprox._
val df = (Seq.fill(100)("a") ++ Seq.fill(100)("b")).toDF("group").withColumn(
"value", when($"group" === "a", randn(1) + 10).otherwise(randn(3))
)
df.groupBy($"group").agg(percentile_approx($"value", lit(0.5))).show
+-----+------------------------------------+
|group|percentile_approx(value, 0.5, 10000)|
+-----+------------------------------------+
| b| -0.06336346702250675|
| a| 9.818985618591595|
+-----+------------------------------------+
df.groupBy($"group").agg(
percentile_approx($"value", typedLit(Seq(0.1, 0.25, 0.75, 0.9)))
).show(false)
+-----+----------------------------------------------------------------------------------+
|group|percentile_approx(value, [0.1,0.25,0.75,0.9], 10000) |
+-----+----------------------------------------------------------------------------------+
|b |[-1.2098351202406483, -0.6640768986666159, 0.6778253126144265, 1.3255676906697658]|
|a |[8.902067202468098, 9.290417382259626, 10.41767257153993, 11.067087075488068] |
+-----+----------------------------------------------------------------------------------+
Once this is on the JVM classpath you can also add PySpark wrapper, using logic similar to built-in functions.

Related

Spark DataFrame Aggregation based on two or more Columns

I want to write a UDAF for some customized aggregation based on more than one column. A simple example would be a dataframe with two columns, c1 and c2. For each row, I take the max of c1 and c2 (let's call it cmax), then I take the sum of cmax.
When I call df.agg(), it does not look like I can pass two or more columns to any aggregation method including UDAF. 1st question, is it true?
For this simple example, I could create another column called cmax, and do the aggregation on cmax. But in reality, I would need to do aggregation based on N combinations of columns and the results would be a collection of size N. I would want to loop the combinations within the update method in my UDAF. Therefore it would require N intermediate columns, which does not seem to be a clean solution to me. 2nd question, I wonder if creating intermediate columns is the way to do it, or if there is a better solution.
I noticed in RDD, the problem is much easier. I can pass the entire record to my aggregation function and I have access to all the data fields.
You can use as many columns in a UDAF as the signature of it's apply function accepts multiple Columns (from it's source code).
def apply(exprs: Column*): Column
You just have to make sure that the inputSchema returns a StructType reflecting the columns that you want to consume as your UDAF input.
For the case of columns c1 and c2 your UDAF has to implement a inputSchema with the following schema:
def inputSchema: StructType = StructType(Array(StructField("c1", DoubleType), StructField("c2", DoubleType)))
However if you want a more general solution, you can always initialize the custom UDAF with arguments that allows returning the right inputSchema. See the example below that allows defining an arbitrary StructType at construction time (Note that we don't verify that StructType is of DoubleType).
class MyMaxUDAF(schema: StructType) extends UserDefinedAggregateFunction {
def inputSchema: StructType = this.schema
def bufferSchema: StructType = StructType(Array(StructField("maxSum", DoubleType)))
def dataType: DataType = DoubleType
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) = 0.0
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getDouble(0) + Array.range(0, input.length).map(input.getDouble).max
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = buffer2 match {
case Row(buffer2Sum: Double) => buffer1(0) = buffer1.getDouble(0) + buffer2Sum
}
def evaluate(buffer: Row): Double = buffer match {
case Row(totalSum: Double) => totalSum
}
}
Your DataFrame containing values and a key for aggregation.
val df = spark.createDataFrame(Seq(
Entry(0, 1.0, 2.0, 3.0), Entry(0, 3.0, 1.0, 2.0), Entry(1, 6.0, 2.0, 2)
))
df.show
+-------+---+---+---+
|groupMe| c1| c2| c3|
+-------+---+---+---+
| 0|1.0|2.0|3.0|
| 0|3.0|1.0|2.0|
| 1|6.0|2.0|2.0|
+-------+---+---+---+
And using the UDAF we expect the sum of max being 6.0 and 6.0
val fields = Array("c1", "c2", "c3")
val struct = StructType(fields.map(StructField(_, DoubleType)))
val myMaxUDAF: MyMaxUDAF = new MyMaxUDAF(struct)
df.groupBy("groupMe").agg(myMaxUDAF(fields.map(df(_)):_*)).show
+-------+---------------------+
|groupMe|mymaxudaf(c1, c2, c3)|
+-------+---------------------+
| 0| 6.0|
| 1| 6.0|
+-------+---------------------+
There is a nice tutorial on UDAF. Unfortunately they don't cover multiple arguments.
https://ragrawal.wordpress.com/2015/11/03/spark-custom-udaf-example/

Using a UDF in Spark data frame for text mining

I have the function below
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
}
Which needs to be applied to column "title" in data frame df_article.
How can I achieve that in spark using UDF?.
Sample Data
+--------------------+
| title|
+--------------------+
|A new relictual a...|
|A new relictual a...|
|A new relictual a...|
+--------------------+
You can define your UDF as such:
import org.apache.spark.sql.functions.udf
val myToken = udf((xs: String) => xs.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+"))
and create a new dataframe with an additional column with:
df_article.withColumn("newTitle", myToken(df_article("title")))
Alternatively, you may also register your tokenize function with:
val tk = sqlContext.udf.register("tk", tokenize _)
and get the new dataframe by applying:
df_article.withColumn("newTitle", tk(df_article("title")))
I wouldn't use UDFs here at all. You can easily compose the same function using built-in expression in a safe and more efficient manner:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lower, regexp_replace, split}
def tokenize(c: Column) = split(
regexp_replace(lower(c), "[^a-zA-Z0-9\\s]", ""), "\\s+"
)
df.select(tokenize($"title"))
There are also ml.feature.Tokenize and ml.featureRegexTokenizer which you may find useful.
why UDF?, you can use built-in functions
Here an example in pyspark:
from pyspark.sql.functions import regexp_replace, lower
df_article.withColumn("title_cleaned", lower((regexp_replace('title', '([^a-zA-Z0-9\&\b]+)', " "))))
Check the built-in functions:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.first

Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0

Averaging over window function leads to StackOverflowError

I am trying to determine the average timespan between dates in a Dataframe column by using a window-function. Materializing the Dataframe however throws a Java exception.
Consider the following example:
from pyspark import SparkContext
from pyspark.sql import HiveContext, Window, functions
from datetime import datetime
sc = SparkContext()
sq = HiveContext(sc)
data = [
[datetime(2014,1,1)],
[datetime(2014,2,1)],
[datetime(2014,3,1)],
[datetime(2014,3,6)],
[datetime(2014,8,23)],
[datetime(2014,10,1)],
]
df = sq.createDataFrame(data, schema=['ts'])
ts = functions.col('ts')
w = Window.orderBy(ts)
diff = functions.datediff(
ts,
functions.lag(ts, count=1).over(w)
)
avg_diff = functions.avg(diff)
While df.select(diff.alias('diff')).show() correctly renders as
+----+
|diff|
+----+
|null|
| 31|
| 28|
| 5|
| 170|
| 39|
+----+
doing df.select(avg_diff).show() gives a java.lang.StackOverflowError.
Am I wrong to assume that this should work? And if so, what am I doing wrong and what could I do instead?
I am using the Python API on Spark 1.6
When I do df2 = df.select(diff.alias('diff')) and then do
df2.select(functions.avg('diff'))
there's no error. Unfortunately that is not an option in my current setup.
It looks like a bug in Catalyst but. Chaining methods should work just fine:
df.select(diff.alias('diff')).agg(functions.avg('diff'))
Nevertheless I would be careful here. Window functions shouldn't be used to perform global (without PARTITION BY clause) operations. These move all data to a single partition and perform a sequential scan. Using RDDs could be a better choice here.

Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?
Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))
You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")
In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().
Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username
-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id
One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))
Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+
Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai
You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.
Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

Resources