I have a dataframe joinDf created from joining the following four dataframes on userId:
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val joinDf = detailsDf
.join(emailDf, Seq("userId"))
.join(foodDf, Seq("userId"))
.join(gameDf, Seq("userId"))
User's food and game favorites should be ordered by score in the ascending order.
I am trying to create a result from this joinDf where the JSON looks like the following:
[
{
"userId": "123",
"firstName": "first123",
"address": "xyz",
"UserFoodFavourites": [
{
"foodName": "food1",
"isFavFood": "true",
"cuisine": "Mediterranean",
},
{
"foodName": "food2",
"isFavFood": "false",
"cuisine": "Italian",
},
{
"foodName": "food3",
"isFavFood": "true",
"cuisine": "American",
}
]
"UserEmail": [
"abc#gmail.com",
"def#gmail.com"
]
"UserGameFavourites": [
{
"gameName": "football",
"isOutdoor": "true"
},
{
"gameName": "chess",
"isOutdoor": "false"
}
]
}
]
Should I use joinDf.groupBy().agg(collect_set())?
Any help would be appreciated.
My solution is based on the answers found here and here
It uses the Window function. It shows how to create a nested list of food preferences for a given userid based on the food score. Here we are creating a struct of FoodDetails from the columns we have
val foodModifiedDf = foodDf.withColumn("FoodDetails",
struct("foodName","isFavFood", "cuisine","score"))
.drop("foodName","isFavFood", "cuisine","score")
println("Just printing the food detials here")
foodModifiedDf.show(10, truncate = false)
Here we are creating a Windowing function which will accumulate the list for a userId based on the FoodDetails.score in descending order. The windowing function when applied goes on accumulating the list as it encounters new rows with same userId. After we have done accumulating, we have to do a groupBy over the userId to select the largest list.
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userId").orderBy( desc("FoodDetails.score"))
val userAndFood = detailsDf.join(foodModifiedDf, "userId")
val newUF = userAndFood.select($"*", collect_list("FoodDetails").over(window) as "FDNew")
println(" UserAndFood dataframe after windowing function applied")
newUF.show(10, truncate = false)
val resultUF = newUF.groupBy("userId")
.agg(max("FDNew"))
println("Final result after select the maximum length list")
resultUF.show(10, truncate = false)
This is how the result looks like finally :
+------+-----------------------------------------------------------------------------------------+
|userId|max(FDNew) |
+------+-----------------------------------------------------------------------------------------+
|123 |[[food3, true, American, 3], [food2, false, Italian, 2], [food1, true, Mediterranean, 1]]|
+------+-----------------------------------------------------------------------------------------+
Given this dataframe, it should be easier to write out the nested json.
The main problem of joining before grouping and collecting lists is the fact that join will produce a lot of records for group by to collapse, in your example it is 12 records after join and before groupby, also you would need to worry about picking "firstName","address" out detailsDf out of 12 duplicates. To avoid both problems your could pre-process the food, email and game dataframes using struct and groupBy and join them to the detailsDf with no risk of exploding your data due to multiple records with the same userId in the joined tables.
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val emailGrp = emailDf.groupBy("userId").agg(collect_list("email").as("UserEmail"))
val foodGrp = foodDf
.select($"userId", struct("score", "foodName","isFavFood","cuisine").as("UserFoodFavourites"))
.groupBy("userId").agg(sort_array(collect_list("UserFoodFavourites")).as("UserFoodFavourites"))
val gameGrp = gameDf
.select($"userId", struct("gameName","isOutdoor","score").as("UserGameFavourites"))
.groupBy("userId").agg(collect_list("UserGameFavourites").as("UserGameFavourites"))
val result = detailsDf.join(emailGrp, Seq("userId"))
.join(foodGrp, Seq("userId"))
.join(gameGrp, Seq("userId"))
result.show(100, false)
Output:
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|userId|firstName|address|UserEmail |UserFoodFavourites |UserGameFavourites |
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|123 |first123 |xyz |[abc#gmail.com, def#gmail.com]|[[1, food1, true, Mediterranean], [2, food2, false, Italian], [3, food3, true, American]]|[[chess, false, 2], [football, true, 1]]|
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
As all groupBy are done on userId and joins as well, spark will optimise it quite well.
UPDATE 1: After #user238607 pointed out that I have missed the original requirement of food preferences being sorted by score, did a quick fix and placed the score column as first element of structure UserFoodFavourites and used sort_array function to arrange data in desired order without forcing extra shuffle operation. Updated the code and its output.
Related
I have a json file, which contains data that looks like this:
"Url": "https://sample.com", "Method": "POST", "Headers": [{"Key": "accesstoken", "Value": ["123"]}, {"Key": "id", "Value": ["abc"]}, {"Key": "context", "Value": ["sample"]}]
When reading the json, I explicitly define the schema as:
schema = StructType(
[
StructField('Url', StringType(), True),
StructField('Method', StringType(), True),
StructField("Headers",ArrayType(StructType([
StructField('Key', StringType(), True),
StructField("Value",ArrayType(StringType()),True),
]),True),True)
]
)
The goal is to read the Key-Value data as columns instead of rows.
Url
Method
accesstoken
id
context
https://sample.com
POST
123
abc
sample
Exploding the "Headers" column only transforms it into multiple rows. Another problem with the data is that, instead of having a literal key-value pair (e.g. "accesstoken": "123"), my key value pair value is stored in 2 separate pairs!
I tried to iterate over the values to create a map first, but I am not able to iterate through the "Headers" column.
df_map = df.withColumn('map', to_json(array(*[create_map(element.Key, element.Value) for element in df.Headers])))
I also tried to read the "Headers" column as MapType(StringType, ArrayType(StringType)), but it could not read then value. It shows null when I did that.
Is there anyway to achieve this? Do I have to read the data as a plain text and pre-process the data instead of the dataframe?
You were in the right way, but to concatenate your map must use reduce expression:
from pyspark.sql.types import *
import pyspark.sql.functions as f
# [...] Your dataframe initialization
df = df.select('Url', 'Method', f.explode(f.expr('REDUCE(Headers, cast(map() as map<string, array<string>>), (acc, el) -> map_concat(acc, map(el.Key, el.Value)))')))
# Transform key:value into columns
df_pivot = (df
.groupBy('Url', 'Method')
.pivot('key')
.agg(f.first('value')))
array_columns = [column for column, _type in df_pivot.dtypes if _type.startswith('array')]
df_pivot = (df_pivot
.withColumn('zip', f.explode(f.arrays_zip(*array_columns)))
.select('Url', 'Method', 'zip.*'))
df_pivot.show(truncate=False)
Output
+------------------+------+-----------+-------+---+
|Url |Method|accesstoken|context|id |
+------------------+------+-----------+-------+---+
|https://sample.com|POST |123 |sample |abc|
+------------------+------+-----------+-------+---+
I read the following blog and find the API is very useful.
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
In the blog, there are lots of data selection example. Like using input
{
"a": {
"b": 1
}
}
Apply Scala: events.select("a.b"), the output would be
{
"b": 1
}
But data type conversion are not mentioned in the blog. Saying I have the following input:
{
"timestampInSec": "1514917353",
"ip": "123.39.76.112",
"money": "USD256",
"countInString": "6"
}
The expected output is:
{
"timestamp": "2018-01-02 11:22:33",
"ip_long": 2066173040,
"currency": "USD",
"money_amount": 256,
"count": 6
}
There are some transformations that not included in org.apache.spark.sql.functions._:
Timestamp is in second and is a string type
Convert IP to long
Split USD256 to two columns and convert one of the column to number
Convert string to number
Another thing is error handling and default value. If there is an invalid input like:
{
"timestampInSec": "N/A",
"money": "999",
"countInString": "Number-Six"
}
It is expected that the output can be
{
"timestamp": "1970-01-01 00:00:00",
"ip_long": 0,
"currency": "NA",
"money_amount": 999,
"count": -1
}
input timestampInSec is not a number. It is expected to use 0 and create a timestamp string as return value
ip is missing in the input. It is expected to usea default value 0.
money field is not complete. It has money amount but missed currency. It is expected to use NA as default currency and correctly translate the money_amount
countInString is not a number. It is expected to use -1 (not 0) as default value .
These requirments are not common and need some self-defined business logic code.
I do checked some function like to_timestamp. There are some codegen stuff and seems not very easy to add new functions. Is there some guide/document about writing self-defined transformation function? Is there a easy way to meet the requirments?
For all:
import org.apache.spark.sql.functions._
Timestamp is in second and is a string type
val timestamp = coalesce(
$"timestampInSec".cast("long").cast("timestamp"),
lit(0).cast("timestamp")
).alias("timestamp")
Split USD256 to two columns and convert one of the column to number
val currencyPattern = "^([A-Z]+)?([0-9]+)$"
val currency = (trim(regexp_extract($"money", currencyPattern, 1)) match {
case c => when(length(c) === 0, "NA").otherwise(c)
}).alias("currency")
val amount = regexp_extract($"money", currencyPattern, 2)
.cast("decimal(38, 0)").alias("money_amount")
Convert string to number
val count = coalesce($"countInString".cast("long"), lit(-1)).alias("count")
Convert IP to long
val ipPattern = "^([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})"
val ip_long = coalesce(Seq((1, 24), (2, 16), (3, 8), (4, 0)).map {
case (group, numBits) => shiftLeft(
regexp_extract($"ip", ipPattern, group).cast("long"),
numBits
)
}.reduce(_ + _), lit(0)).alias("ip_long")
Result
val df = Seq(
("1514917353", "123.39.76.112", "USD256", "6"),
("N/A", null, "999", null)
).toDF("timestampInSec", "ip", "money", "countInString")
df.select(timestamp, currency, amount, count, ip_long).show
// +-------------------+--------+------------+-----+----------+
// | timestamp|currency|money_amount|count| ip_long|
// +-------------------+--------+------------+-----+----------+
// |2018-01-02 18:22:33| USD| 256| 6|2066173040|
// |1970-01-01 00:00:00| NA| 999| -1| 0|
// +-------------------+--------+------------+-----+----------+
Hi I have a spark data frame which prints like this (single row)
[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),1487530800317]
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
[abc,11918,46734,1487530800317]
[abc,1233,1234,1487530800317]
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
WrappedArray(46734,1234,[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),WrappedArray(1,2),1487530800317]
my output should be
[abc,11918,46734,1,1487530800317]
[abc,1233,1234,2,1487530800317]
Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] = _.zip(_)
val udfZip = udf(zipThem)
data.select($"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
.show
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
root
-- a
-- d
-- records
---- b
---- c
Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => (x.zip(y).zip(z)) map {
case ((a,b),c) => (a,b,c)
})
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
flattened.show
Hope that is understandable :)
I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+
Problem: I have a data set A {filed1, field2, field3...}, and I would like to first group A by say, field1, then in each of the resulting groups, I would like to do bunch of subqueries, for example, count the number of rows that have field2 == true, or count the number of distinct field3 that have field4 == "some_value" and field5 == false, etc.
Some alternatives I can think of: I can write a customized user defined aggregate function that takes a function that computes the condition for filtering, but this way I have to create an instance of it for every query condition. I've also looked at the countDistinct function can achieve some of the operations, but I can't figure out how to use it to implement the filter-distinct-count semantic.
In Pig, I can do:
FOREACH (GROUP A by field1) {
field_a = FILTER A by field2 == TRUE;
field_b = FILTER A by field4 == 'some_value' AND field5 == FALSE;
field_c = DISTINCT field_b.field3;
GENERATE FLATTEN(group),
COUNT(field_a) as fa,
COUNT(field_b) as fb,
COUNT(field_c) as fc,
Is there a way to do this in Spark SQL?
Excluding distinct count this is can solved by simple sum over condition:
import org.apache.spark.sql.functions.sum
val df = sc.parallelize(Seq(
(1L, true, "x", "foo", true), (1L, true, "y", "bar", false),
(1L, true, "z", "foo", true), (2L, false, "y", "bar", false),
(2L, true, "x", "foo", false)
)).toDF("field1", "field2", "field3", "field4", "field5")
val left = df.groupBy($"field1").agg(
sum($"field2".cast("int")).alias("fa"),
sum(($"field4" === "foo" && ! $"field5").cast("int")).alias("fb")
)
left.show
// +------+---+---+
// |field1| fa| fb|
// +------+---+---+
// | 1| 3| 0|
// | 2| 1| 1|
// +------+---+---+
Unfortunately is much more tricky. GROUP BY clause in Spark SQL doesn't physically group data. Not to mention that finding distinct elements is quite expensive. Probably the best thing you can do is to compute distinct counts separately and simply join the results:
val right = df.where($"field4" === "foo" && ! $"field5")
.select($"field1".alias("field1_"), $"field3")
.distinct
.groupBy($"field1_")
.agg(count("*").alias("fc"))
val joined = left
.join(right, $"field1" === $"field1_", "leftouter")
.na.fill(0)
Using UDAF to count distinct values per condition is definitely an option but efficient implementation will be rather tricky. Converting from internal representation is rather expensive, and implementing fast UDAF with a collection storage is not cheap either. If you can accept approximate solution you can use bloom filter there.