Spark SQL secondary filtering and grouping - apache-spark

Problem: I have a data set A {filed1, field2, field3...}, and I would like to first group A by say, field1, then in each of the resulting groups, I would like to do bunch of subqueries, for example, count the number of rows that have field2 == true, or count the number of distinct field3 that have field4 == "some_value" and field5 == false, etc.
Some alternatives I can think of: I can write a customized user defined aggregate function that takes a function that computes the condition for filtering, but this way I have to create an instance of it for every query condition. I've also looked at the countDistinct function can achieve some of the operations, but I can't figure out how to use it to implement the filter-distinct-count semantic.
In Pig, I can do:
FOREACH (GROUP A by field1) {
field_a = FILTER A by field2 == TRUE;
field_b = FILTER A by field4 == 'some_value' AND field5 == FALSE;
field_c = DISTINCT field_b.field3;
GENERATE FLATTEN(group),
COUNT(field_a) as fa,
COUNT(field_b) as fb,
COUNT(field_c) as fc,
Is there a way to do this in Spark SQL?

Excluding distinct count this is can solved by simple sum over condition:
import org.apache.spark.sql.functions.sum
val df = sc.parallelize(Seq(
(1L, true, "x", "foo", true), (1L, true, "y", "bar", false),
(1L, true, "z", "foo", true), (2L, false, "y", "bar", false),
(2L, true, "x", "foo", false)
)).toDF("field1", "field2", "field3", "field4", "field5")
val left = df.groupBy($"field1").agg(
sum($"field2".cast("int")).alias("fa"),
sum(($"field4" === "foo" && ! $"field5").cast("int")).alias("fb")
)
left.show
// +------+---+---+
// |field1| fa| fb|
// +------+---+---+
// | 1| 3| 0|
// | 2| 1| 1|
// +------+---+---+
Unfortunately is much more tricky. GROUP BY clause in Spark SQL doesn't physically group data. Not to mention that finding distinct elements is quite expensive. Probably the best thing you can do is to compute distinct counts separately and simply join the results:
val right = df.where($"field4" === "foo" && ! $"field5")
.select($"field1".alias("field1_"), $"field3")
.distinct
.groupBy($"field1_")
.agg(count("*").alias("fc"))
val joined = left
.join(right, $"field1" === $"field1_", "leftouter")
.na.fill(0)
Using UDAF to count distinct values per condition is definitely an option but efficient implementation will be rather tricky. Converting from internal representation is rather expensive, and implementing fast UDAF with a collection storage is not cheap either. If you can accept approximate solution you can use bloom filter there.

Related

Spark dataframe to nested JSON

I have a dataframe joinDf created from joining the following four dataframes on userId:
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val joinDf = detailsDf
.join(emailDf, Seq("userId"))
.join(foodDf, Seq("userId"))
.join(gameDf, Seq("userId"))
User's food and game favorites should be ordered by score in the ascending order.
I am trying to create a result from this joinDf where the JSON looks like the following:
[
{
"userId": "123",
"firstName": "first123",
"address": "xyz",
"UserFoodFavourites": [
{
"foodName": "food1",
"isFavFood": "true",
"cuisine": "Mediterranean",
},
{
"foodName": "food2",
"isFavFood": "false",
"cuisine": "Italian",
},
{
"foodName": "food3",
"isFavFood": "true",
"cuisine": "American",
}
]
"UserEmail": [
"abc#gmail.com",
"def#gmail.com"
]
"UserGameFavourites": [
{
"gameName": "football",
"isOutdoor": "true"
},
{
"gameName": "chess",
"isOutdoor": "false"
}
]
}
]
Should I use joinDf.groupBy().agg(collect_set())?
Any help would be appreciated.
My solution is based on the answers found here and here
It uses the Window function. It shows how to create a nested list of food preferences for a given userid based on the food score. Here we are creating a struct of FoodDetails from the columns we have
val foodModifiedDf = foodDf.withColumn("FoodDetails",
struct("foodName","isFavFood", "cuisine","score"))
.drop("foodName","isFavFood", "cuisine","score")
println("Just printing the food detials here")
foodModifiedDf.show(10, truncate = false)
Here we are creating a Windowing function which will accumulate the list for a userId based on the FoodDetails.score in descending order. The windowing function when applied goes on accumulating the list as it encounters new rows with same userId. After we have done accumulating, we have to do a groupBy over the userId to select the largest list.
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userId").orderBy( desc("FoodDetails.score"))
val userAndFood = detailsDf.join(foodModifiedDf, "userId")
val newUF = userAndFood.select($"*", collect_list("FoodDetails").over(window) as "FDNew")
println(" UserAndFood dataframe after windowing function applied")
newUF.show(10, truncate = false)
val resultUF = newUF.groupBy("userId")
.agg(max("FDNew"))
println("Final result after select the maximum length list")
resultUF.show(10, truncate = false)
This is how the result looks like finally :
+------+-----------------------------------------------------------------------------------------+
|userId|max(FDNew) |
+------+-----------------------------------------------------------------------------------------+
|123 |[[food3, true, American, 3], [food2, false, Italian, 2], [food1, true, Mediterranean, 1]]|
+------+-----------------------------------------------------------------------------------------+
Given this dataframe, it should be easier to write out the nested json.
The main problem of joining before grouping and collecting lists is the fact that join will produce a lot of records for group by to collapse, in your example it is 12 records after join and before groupby, also you would need to worry about picking "firstName","address" out detailsDf out of 12 duplicates. To avoid both problems your could pre-process the food, email and game dataframes using struct and groupBy and join them to the detailsDf with no risk of exploding your data due to multiple records with the same userId in the joined tables.
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val emailGrp = emailDf.groupBy("userId").agg(collect_list("email").as("UserEmail"))
val foodGrp = foodDf
.select($"userId", struct("score", "foodName","isFavFood","cuisine").as("UserFoodFavourites"))
.groupBy("userId").agg(sort_array(collect_list("UserFoodFavourites")).as("UserFoodFavourites"))
val gameGrp = gameDf
.select($"userId", struct("gameName","isOutdoor","score").as("UserGameFavourites"))
.groupBy("userId").agg(collect_list("UserGameFavourites").as("UserGameFavourites"))
val result = detailsDf.join(emailGrp, Seq("userId"))
.join(foodGrp, Seq("userId"))
.join(gameGrp, Seq("userId"))
result.show(100, false)
Output:
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|userId|firstName|address|UserEmail |UserFoodFavourites |UserGameFavourites |
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|123 |first123 |xyz |[abc#gmail.com, def#gmail.com]|[[1, food1, true, Mediterranean], [2, food2, false, Italian], [3, food3, true, American]]|[[chess, false, 2], [football, true, 1]]|
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
As all groupBy are done on userId and joins as well, spark will optimise it quite well.
UPDATE 1: After #user238607 pointed out that I have missed the original requirement of food preferences being sorted by score, did a quick fix and placed the score column as first element of structure UserFoodFavourites and used sort_array function to arrange data in desired order without forcing extra shuffle operation. Updated the code and its output.

Conditional aggregation Spark DataFrame

I would like to understand the best way to do an aggregation in Spark in this scenario:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Person(name:String, acc:Int, logDate:String)
val dateFormat = "dd/MM/yyyy"
val filterType = // Could has "MIN" or "MAX" depending on a run parameter
val filterDate = new Timestamp(System.currentTimeMillis)
val df = sc.parallelize(List(Person("Giorgio",20,"31/12/9999"),
Person("Giorgio",30,"12/10/2009")
Person("Diego", 10,"12/10/2010"),
Person("Diego", 20,"12/10/2010"),
Person("Diego", 30,"22/11/2011"),
Person("Giorgio",10,"31/12/9999"),
Person("Giorgio",30,"31/12/9999"))).toDF()
val df2 = df.withColumn("logDate",unix_timestamp($"logDate",dateFormat).cast(TimestampType))
val df3 = df.groupBy("name").agg(/*conditional aggregation*/)
df3.show /*Expected output show below */
Basically I want to group all records by name column and then based on the filterType parameter, I want to filter all valid records for a Person, then after filtering, I want to sum all acc values obtaining a final
DataFrame with name and totalAcc columns.
For example:
filterType = MIN , I want to take all records with having min(logDate) , could be many of them, so basically in this case I completely ignore filterDate param:
Diego,10,12/10/2010
Diego,20,12/10/2010
Giorgio,30,12/10/2009
Final result expected from aggregation is: (Diego, 30),(Giorgio,30)
filterType = MAX , I want to take all records with logDate > filterDate, I for a key I don't have any records respecting this condition, I need to take records with min(logDate) as done in MIN scenario, so:
Diego, 10, 12/10/2010
Diego, 20, 12/10/2010
Giorgio, 20, 31/12/9999
Giorgio, 10, 31/12/9999
Giorgio, 30, 31/12/9999
Final result expected from aggregation is: (Diego,30),(Giorgio,60)
In this case for Diego I didn't have any records with logDate > logFilter, so I fallback to apply MIN scenario, taking just for Diego all records with min logDate.
You can write your conditional aggregation using when/otherwise as
df2.groupBy("name").agg(sum(when(lit(filterType) === "MIN" && $"logDate" < filterDate, $"acc").otherwise(when(lit(filterType) === "MAX" && $"logDate" > filterDate, $"acc"))).as("sum"))
.filter($"sum".isNotNull)
which would give you your desired output according to filterType
But
Eventually you would require both aggregated dataframes so I would suggest you to avoid filterType field and just go with aggregation by creating additional column for grouping using when/otherwise function. So that you can have both aggregated values in one dataframe as
df2.withColumn("additionalGrouping", when($"logDate" < filterDate, "less").otherwise("more"))
.groupBy("name", "additionalGrouping").agg(sum($"acc"))
.drop("additionalGrouping")
.show(false)
which would output as
+-------+--------+
|name |sum(acc)|
+-------+--------+
|Diego |10 |
|Giorgio|60 |
+-------+--------+
Updated
Since the question is updated with the logic changed, here is the idea and solution to the changed scenario
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("name").orderBy($"logDate".asc)
val minDF = df2.withColumn("minLogDate", first("logDate").over(windowSpec)).filter($"minLogDate" === $"logDate")
.groupBy("name")
.agg(sum($"acc").as("sum"))
val finalDF =
if(filterType == "MIN") {
minDF
}
else if(filterType == "MAX"){
val tempMaxDF = df2
.groupBy("name")
.agg(sum(when($"logDate" > filterDate,$"acc")).as("sum"))
tempMaxDF.filter($"sum".isNull).drop("sum").join(minDF, Seq("name"), "left").union(tempMaxDF.filter($"sum".isNotNull))
}
else {
df2
}
so for filterType = MIN you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|30 |
+-------+---+
and for filterType = MAX you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|60 |
+-------+---+
In case if the filterType isn't MAX or MIN then original dataframe is returned
I hope the answer is helpful
You don't need conditional aggregation. Just filter:
df
.where(if (filterType == "MAX") $"logDate" < filterDate else $"logDate" > filterDate)
.groupBy("name").agg(sum($"acc")

Order Spark SQL Dataframe with nested values / complex data types

My goal is to collect an ordered list of nested values. It should be ordered based on an element in the nested list. I tried out different approaches but have concerns in terms of performance and correctness.
Order globally
case class Payment(Id: String, Date: String, Paid: Double)
val payments = Seq(
Payment("mk", "10:00 AM", 8.6D),
Payment("mk", "06:00 AM", 12.6D),
Payment("yc", "07:00 AM", 16.6D),
Payment("yc", "09:00 AM", 2.6D),
Payment("mk", "11:00 AM", 5.6D)
)
val df = spark.createDataFrame(payments)
// order globally
df.orderBy(col("Paid").desc)
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------+--------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[06:00 AM,12.6], [10:00 AM,8.6], [11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------+--------------+------------------+
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Window function
// use Window
val window = Window.partitionBy(col("Id")).orderBy(col("Paid").desc)
df.withColumn("rank", rank().over(window))
.groupBy(col("Id"))
.agg(
collect_list(struct(col("rank"), col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------------+--------------+------------------+
|yc |[[1,07:00 AM,16.6], [2,09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[1,06:00 AM,12.6], [2,10:00 AM,8.6], [3,11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------------+--------------+------------------+
This should work or do I miss something?
Order in UDF on-the-fly
// order in UDF
val largestPaymentDate = udf((lr: Seq[Row]) => {
lr.max(Ordering.by((l: Row) => l.getAs[Double]("Paid"))).getAs[String]("Date")
})
df.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPaymentDate", largestPaymentDate(col("UserPayments")))
.show(false)
+---+-------------------------------------------------+------------------+
|Id |UserPayments |LargestPaymentDate|
+---+-------------------------------------------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |07:00 AM |
|mk |[[10:00 AM,8.6], [06:00 AM,12.6], [11:00 AM,5.6]]|06:00 AM |
+---+-------------------------------------------------+------------------+
I guess nothing to complain here in terms of correctness. But for the following operations, I'd prefer that the list is ordered and I don't have to do every time explicitly.
I tried to write a UDF which takes the list as an input and returns the ordered list - but returning a list was too painful and I gave it up.
I'd reverse the order of the struct and aggregate with max:
val result = df
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))) as "UserPayments",
max(struct(col("Paid"), col("Date"))) as "MaxPayment"
)
result.show
// +---+--------------------+---------------+
// | Id| UserPayments| MaxPayment|
// +---+--------------------+---------------+
// | yc|[[07:00 AM,16.6],...|[16.6,07:00 AM]|
// | mk|[[10:00 AM,8.6], ...|[12.6,06:00 AM]|
// +---+--------------------+---------------+
You can later flatten the struct:
result.select($"id", $"UserPayments", $"MaxPayment.*").show
// +---+--------------------+----+--------+
// | id| UserPayments|Paid| Date|
// +---+--------------------+----+--------+
// | yc|[[07:00 AM,16.6],...|16.6|07:00 AM|
// | mk|[[10:00 AM,8.6], ...|12.6|06:00 AM|
// +---+--------------------+----+--------+
Same way you can sort_array of reordered structs
df
.groupBy(col("Id"))
.agg(
sort_array(collect_list(struct(col("Paid"), col("Date")))) as "UserPayments"
)
.show(false)
// +---+-------------------------------------------------+
// |Id |UserPayments |
// +---+-------------------------------------------------+
// |yc |[[2.6,09:00 AM], [16.6,07:00 AM]] |
// |mk |[[5.6,11:00 AM], [8.6,10:00 AM], [12.6,06:00 AM]]|
// +---+-------------------------------------------------+
Finally:
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Data will be ordered globally, but the order will be destroyed by groupBy so this is is not a solution, and can work only accidentally.
repartition (by id) and sortWithinPartitions (by id and Paid) should be reliable replacement.

Find all nulls with SQL query over pyspark dataframe

I have a dataframe of StructField with a mixed schema (DoubleType, StringType, LongType, etc.).
I want to 'iterate' over all columns to return summary statistics. For instance:
set_min = df.select([
fn.min(self.df[c]).alias(c) for c in self.df.columns
]).collect()
Is what I'm using to find the minimum value in each column. That works fine. But when I try something designed similar to find Nulls:
set_null = df.filter(
(lambda x: self.df[x]).isNull().count()
).collect()
I get the TypeError: condition should be string or Column which makes sense, I'm passing a function.
or with list comprehension:
set_null = self.df[c].alias(c).isNull() for c in self.df.columns
Then I try pass it a SQL query as a string:
set_null = df.filter('SELECT fields FROM table WHERE column = NUL').collect()
I get:
ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 14)\n\n== SQL ==\nSELECT fields FROM table WHERE column = NULL\n--------------^^^\n"
How can i pass my function as a 'string or column' so I can use filter or where alternatively, why wont the pure SQL statement work?
There are things wrong in several parts of your attempts:
You are missing square brackets in your list comprehension example
You missed an L in NUL
Your pure SQL doesn't work, because filter/where expects a where clause, not a full SQL statement; they are just aliases and I prefer to use where so it is clearer you just need to give such a clause
In the end you don't need to use where, like karlson also shows. But subtracting from the total count means you have to evaluate the dataframe twice (which can be alleviated by caching, but still not ideal). There is a more direct way:
>>> df.select([fn.sum(fn.isnull(c).cast('int')).alias(c) for c in df.columns]).show()
+---+---+
| A| B|
+---+---+
| 2| 3|
+---+---+
This works because casting a boolean value to integer give 1 for True and 0 for False. If you prefer SQL, the equivalent is:
df.select([fn.expr('SUM(CAST(({c} IS NULL) AS INT)) AS {c}'.format(c=c)) for c in df.columns]).show()
or nicer, without a cast:
df.select([fn.expr('SUM(IF({c} IS NULL, 1, 0)) AS {c}'.format(c=c)) for c in df.columns]).show()
If you want a count of NULL values per column you could count the non-null values and subtract from the total.
For example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as fn
spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(
data=[
(1, None),
(1, 1),
(None, None),
(1, 1),
(None, 1),
(1, None),
],
schema=("A", "B")
)
total = df.count()
missing_counts = df.select(
*[(total - fn.count(col)).alias("missing(%s)" % col) for col in df.columns]
)
missing_counts.show()
>>> +----------+----------+
... |missing(A)|missing(B)|
... +----------+----------+
... | 2| 3|
... +----------+----------+

Spark Dataframe groupBy and sort results into a list

I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+

Resources