Conditional aggregation Spark DataFrame - apache-spark

I would like to understand the best way to do an aggregation in Spark in this scenario:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Person(name:String, acc:Int, logDate:String)
val dateFormat = "dd/MM/yyyy"
val filterType = // Could has "MIN" or "MAX" depending on a run parameter
val filterDate = new Timestamp(System.currentTimeMillis)
val df = sc.parallelize(List(Person("Giorgio",20,"31/12/9999"),
Person("Giorgio",30,"12/10/2009")
Person("Diego", 10,"12/10/2010"),
Person("Diego", 20,"12/10/2010"),
Person("Diego", 30,"22/11/2011"),
Person("Giorgio",10,"31/12/9999"),
Person("Giorgio",30,"31/12/9999"))).toDF()
val df2 = df.withColumn("logDate",unix_timestamp($"logDate",dateFormat).cast(TimestampType))
val df3 = df.groupBy("name").agg(/*conditional aggregation*/)
df3.show /*Expected output show below */
Basically I want to group all records by name column and then based on the filterType parameter, I want to filter all valid records for a Person, then after filtering, I want to sum all acc values obtaining a final
DataFrame with name and totalAcc columns.
For example:
filterType = MIN , I want to take all records with having min(logDate) , could be many of them, so basically in this case I completely ignore filterDate param:
Diego,10,12/10/2010
Diego,20,12/10/2010
Giorgio,30,12/10/2009
Final result expected from aggregation is: (Diego, 30),(Giorgio,30)
filterType = MAX , I want to take all records with logDate > filterDate, I for a key I don't have any records respecting this condition, I need to take records with min(logDate) as done in MIN scenario, so:
Diego, 10, 12/10/2010
Diego, 20, 12/10/2010
Giorgio, 20, 31/12/9999
Giorgio, 10, 31/12/9999
Giorgio, 30, 31/12/9999
Final result expected from aggregation is: (Diego,30),(Giorgio,60)
In this case for Diego I didn't have any records with logDate > logFilter, so I fallback to apply MIN scenario, taking just for Diego all records with min logDate.

You can write your conditional aggregation using when/otherwise as
df2.groupBy("name").agg(sum(when(lit(filterType) === "MIN" && $"logDate" < filterDate, $"acc").otherwise(when(lit(filterType) === "MAX" && $"logDate" > filterDate, $"acc"))).as("sum"))
.filter($"sum".isNotNull)
which would give you your desired output according to filterType
But
Eventually you would require both aggregated dataframes so I would suggest you to avoid filterType field and just go with aggregation by creating additional column for grouping using when/otherwise function. So that you can have both aggregated values in one dataframe as
df2.withColumn("additionalGrouping", when($"logDate" < filterDate, "less").otherwise("more"))
.groupBy("name", "additionalGrouping").agg(sum($"acc"))
.drop("additionalGrouping")
.show(false)
which would output as
+-------+--------+
|name |sum(acc)|
+-------+--------+
|Diego |10 |
|Giorgio|60 |
+-------+--------+
Updated
Since the question is updated with the logic changed, here is the idea and solution to the changed scenario
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("name").orderBy($"logDate".asc)
val minDF = df2.withColumn("minLogDate", first("logDate").over(windowSpec)).filter($"minLogDate" === $"logDate")
.groupBy("name")
.agg(sum($"acc").as("sum"))
val finalDF =
if(filterType == "MIN") {
minDF
}
else if(filterType == "MAX"){
val tempMaxDF = df2
.groupBy("name")
.agg(sum(when($"logDate" > filterDate,$"acc")).as("sum"))
tempMaxDF.filter($"sum".isNull).drop("sum").join(minDF, Seq("name"), "left").union(tempMaxDF.filter($"sum".isNotNull))
}
else {
df2
}
so for filterType = MIN you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|30 |
+-------+---+
and for filterType = MAX you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|60 |
+-------+---+
In case if the filterType isn't MAX or MIN then original dataframe is returned
I hope the answer is helpful

You don't need conditional aggregation. Just filter:
df
.where(if (filterType == "MAX") $"logDate" < filterDate else $"logDate" > filterDate)
.groupBy("name").agg(sum($"acc")

Related

withColumn is adding values only to the first row in the dataframe in pyspark

withColumn is adding values only to the first row in the dataframe in pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F, Window as W
columns = ["language","users_count"]
data = [(" ", 20000), ("Python", 100000), ("Scala", 3000)]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
columns = ["language","users_count"]
df = rdd.toDF(columns)
df.printSchema();
df.filter((F.trim(df.language) == '') | (df.users_count >= 1000)).withColumn("errors", F.when(F.trim(F.col("language")) == '', F.concat(F.lit("Invalid Language;")))).withColumn("errors", F.when(F.col("users_count") > 1000, F.concat(F.col("errors"), F.lit("Invalid Users_Count;")))).show(truncate=False)
I am trying filter rows based on the filter criteria and if any of the rows have whitespace in the 'language' column or if the value in 'users_count' column is greater than 1000, then the application will add an appropriate error message in 'errors' column like so, Invalid Language; Invalid Users_Count.
But this error message in 'errors' column appears only for first row, all the other rows have 'null' value in 'errors' column.
This is the output in databricks for the above code:
+--------+-----------+-------------------------------------+
|language|users_count|errors |
+--------+-----------+-------------------------------------+
| |20000 |Invalid Language;Invalid Users_Count;|
|Python |100000 |null |
|Scala |3000 |null |
+--------+-----------+-------------------------------------+
.concat() does not work with a null column, you would have to use .concat_ws().
df.filter((F.trim(df.language) == '') | (df.users_count >= 1000)).withColumn("errors", F.when(F.trim(F.col("language")) == '', F.concat(F.lit("Invalid Language;")))).withColumn("errors", F.when(F.col("users_count") > 1000, F.concat_ws("", F.col("errors"), F.lit("Invalid Users_Count;")))).show(truncate=False)

How does Spark SQL implement the group by aggregate

How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+

Extract Numeric data from the Column in Spark Dataframe

I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+

How to pick latest record in spark structured streaming join

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have rates meta data of currency sample as below :
val ratesMetaDataDf = Seq(
("EUR","5/10/2019","1.130657","USD"),
("EUR","5/9/2019","1.13088","USD")
).toDF("base_code", "rate_date","rate_value","target_code")
.withColumn("rate_date", to_date($"rate_date" ,"MM/dd/yyyy").cast(DateType))
.withColumn("rate_value", $"rate_value".cast(DoubleType))
Sales records which i received from kafka topic is , as (sample) below
:
val kafkaDf = Seq((15,2016, 4, 100.5,"USD","2021-01-20","EUR",221.4)
).toDF("companyId", "year","quarter","sales","code","calc_date","c_code","prev_sales")
To calculate "prev_sales" , I need get its "c_code" 's respective "rate_value" which is nearest to the "calc_date" i.e. rate_date"
Which i am doing as following
val w2 = Window.orderBy(col("rate_date") desc)
val rateJoinResultDf = kafkaDf.as("k").join(ratesMetaDataDf.as("e"))
.where( ($"k.c_code" === $"e.base_code") &&
($"rate_date" < $"calc_date")
).orderBy($"rate_date" desc)
.withColumn("row",row_number.over(w2))
.where($"row" === 1).drop("row")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast(DoubleType))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
In the above to get nearest record (i.e. "5/10/2019" from ratesMetaDataDf ) for given "rate_date" I am using window and row_number function and sorting the records by "desc".
But in the spark-sql streaming it is causing the error as below
"
Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;"
So how to fetch first record to join in the above.
Replace your last code part with below code. This code will do left join and calculate date difference calc_date & rate_date. Next Window function we will pick nearest date and calculate prev_sales by using same your calculation.
Please note I have added one filter condition filter(col("diff") >=0),
which will handle a case of calc_date < rate_date. I have added few
more records for better understanding of this case.
scala> ratesMetaDataDf.show
+---------+----------+----------+-----------+
|base_code| rate_date|rate_value|target_code|
+---------+----------+----------+-----------+
| EUR|2019-05-10| 1.130657| USD|
| EUR|2019-05-09| 1.12088| USD|
| EUR|2019-12-20| 1.1584| USD|
+---------+----------+----------+-----------+
scala> kafkaDf.show
+---------+----+-------+-----+----+----------+------+----------+
|companyId|year|quarter|sales|code| calc_date|c_code|prev_sales|
+---------+----+-------+-----+----+----------+------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| EUR| 221.4|
| 15|2016| 4|100.5| USD|2019-06-20| EUR| 221.4|
+---------+----+-------+-----+----+----------+------+----------+
scala> val W = Window.partitionBy("companyId","year","quarter","sales","code","calc_date","c_code","prev_sales").orderBy(col("diff"))
scala> val rateJoinResultDf= kafkaDf.alias("k").join(ratesMetaDataDf.alias("r"), col("k.c_code") === col("r.base_code"), "left")
.withColumn("diff",datediff(col("calc_date"), col("rate_date")))
.filter(col("diff") >= 0)
.withColumn("closedate", row_number.over(W))
.filter(col("closedate") === 1)
.drop("diff", "closedate")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast("Decimal(14,5)"))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
scala> rateJoinResultDf.show
+---------+----+-------+-----+----+----------+----------+
|companyId|year|quarter|sales|code| calc_date|prev_sales|
+---------+----+-------+-----+----+----------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| 256.46976|
| 15|2016| 4|100.5| USD|2019-06-20| 250.32746|
+---------+----+-------+-----+----+----------+----------+

Order Spark SQL Dataframe with nested values / complex data types

My goal is to collect an ordered list of nested values. It should be ordered based on an element in the nested list. I tried out different approaches but have concerns in terms of performance and correctness.
Order globally
case class Payment(Id: String, Date: String, Paid: Double)
val payments = Seq(
Payment("mk", "10:00 AM", 8.6D),
Payment("mk", "06:00 AM", 12.6D),
Payment("yc", "07:00 AM", 16.6D),
Payment("yc", "09:00 AM", 2.6D),
Payment("mk", "11:00 AM", 5.6D)
)
val df = spark.createDataFrame(payments)
// order globally
df.orderBy(col("Paid").desc)
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------+--------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[06:00 AM,12.6], [10:00 AM,8.6], [11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------+--------------+------------------+
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Window function
// use Window
val window = Window.partitionBy(col("Id")).orderBy(col("Paid").desc)
df.withColumn("rank", rank().over(window))
.groupBy(col("Id"))
.agg(
collect_list(struct(col("rank"), col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------------+--------------+------------------+
|yc |[[1,07:00 AM,16.6], [2,09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[1,06:00 AM,12.6], [2,10:00 AM,8.6], [3,11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------------+--------------+------------------+
This should work or do I miss something?
Order in UDF on-the-fly
// order in UDF
val largestPaymentDate = udf((lr: Seq[Row]) => {
lr.max(Ordering.by((l: Row) => l.getAs[Double]("Paid"))).getAs[String]("Date")
})
df.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPaymentDate", largestPaymentDate(col("UserPayments")))
.show(false)
+---+-------------------------------------------------+------------------+
|Id |UserPayments |LargestPaymentDate|
+---+-------------------------------------------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |07:00 AM |
|mk |[[10:00 AM,8.6], [06:00 AM,12.6], [11:00 AM,5.6]]|06:00 AM |
+---+-------------------------------------------------+------------------+
I guess nothing to complain here in terms of correctness. But for the following operations, I'd prefer that the list is ordered and I don't have to do every time explicitly.
I tried to write a UDF which takes the list as an input and returns the ordered list - but returning a list was too painful and I gave it up.
I'd reverse the order of the struct and aggregate with max:
val result = df
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))) as "UserPayments",
max(struct(col("Paid"), col("Date"))) as "MaxPayment"
)
result.show
// +---+--------------------+---------------+
// | Id| UserPayments| MaxPayment|
// +---+--------------------+---------------+
// | yc|[[07:00 AM,16.6],...|[16.6,07:00 AM]|
// | mk|[[10:00 AM,8.6], ...|[12.6,06:00 AM]|
// +---+--------------------+---------------+
You can later flatten the struct:
result.select($"id", $"UserPayments", $"MaxPayment.*").show
// +---+--------------------+----+--------+
// | id| UserPayments|Paid| Date|
// +---+--------------------+----+--------+
// | yc|[[07:00 AM,16.6],...|16.6|07:00 AM|
// | mk|[[10:00 AM,8.6], ...|12.6|06:00 AM|
// +---+--------------------+----+--------+
Same way you can sort_array of reordered structs
df
.groupBy(col("Id"))
.agg(
sort_array(collect_list(struct(col("Paid"), col("Date")))) as "UserPayments"
)
.show(false)
// +---+-------------------------------------------------+
// |Id |UserPayments |
// +---+-------------------------------------------------+
// |yc |[[2.6,09:00 AM], [16.6,07:00 AM]] |
// |mk |[[5.6,11:00 AM], [8.6,10:00 AM], [12.6,06:00 AM]]|
// +---+-------------------------------------------------+
Finally:
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Data will be ordered globally, but the order will be destroyed by groupBy so this is is not a solution, and can work only accidentally.
repartition (by id) and sortWithinPartitions (by id and Paid) should be reliable replacement.

Resources