SparkSQL Windows: Creating Frame Based On Array Column - apache-spark

I am looking to use SparkSQL's window function, but with a custom condition on the frame specification.
The dataframe being operated on is as follows:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|score|
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 1 |
|a |2 |[] | 1 |
|a |3 |[] | 1 |
|b |1 |[] | 1 |
|a |4 |[1, 2] | 1 |
+--------------------+--------------------+--------------------+-----+
Every element in the prerequisites column is a value in another row's elementid column.
I would like to create a window where I partition by userid, and then grab all preceding rows where elementid is contained in the present row's prerequisites column.
Once I attain this window, I want to perform a sum on the score column.
Desired output for the above example:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|sum |
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 0 |
|a |2 |[] | 0 |
|a |3 |[] | 0 |
|b |1 |[] | 0 |
|a |4 |[1, 2] | 2 |
+--------------------+--------------------+--------------------+-----+
Notice how because user a is the only user with the prerequisites of its element preceding it, its the only one with > 0 sum.
The closest question I saw was this question, which utilises collect_list.
However, that doesn't construct a window so much as collect a potential list of IDs. Anyone have any ideas on how to construct the aforementioned window?

scala> import org.apache.spark.sql.expressions.{Window,UserDefinedFunction}
scala> df.show()
+------+---------+-------------+-----+
|userid|elementid|prerequisites|score|
+------+---------+-------------+-----+
| a| 1| []| 1|
| a| 2| []| 1|
| a| 3| []| 1|
| b| 1| []| 1|
| a| 4| [1, 2]| 1|
+------+---------+-------------+-----+
scala> df.printSchema
root
|-- userid: string (nullable = true)
|-- elementid: string (nullable = true)
|-- prerequisites: array (nullable = true)
| |-- element: string (containsNull = true)
|-- score: string (nullable = true)
scala> val W = Window.partitionBy("userid")
scala> val df1 = df.withColumn("elementidList", collect_set(col("elementid")).over(W))
.withColumn("elementidScoreMap", map_from_arrays(col("elementidList"), collect_list(col("score").cast("long")).over(W)))
.withColumn("common", array_intersect(col("prerequisites"), col("elementidList")))
.drop("elementidList", "score")
scala> def getSumUDF:UserDefinedFunction = udf((Score:Map[String,Long], Id:String) => {
| var out:Long = 0
| Id.split(",").foreach{ x => out = Score(x.toString) + out}
| out})
scala> df1.withColumn("sum", when(size(col("common")) =!= 0 ,getSumUDF(col("elementidScoreMap"), concat_ws(",",col("prerequisites")))).otherwise(lit(0)))
.drop("elementidScoreMap", "common")
.show()
+------+---------+-------------+---+
|userid|elementid|prerequisites|sum|
+------+---------+-------------+---+
| b| 1| []| 0|
| a| 1| []| 0|
| a| 2| []| 0|
| a| 3| []| 0|
| a| 4| [1, 2]| 2|
+------+---------+-------------+---+

Related

Spark-Scala : Create split rows based on the value of other column

I have an Input as below
id
size
1
4
2
2
output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it
1-2 times.
id
size
1
1
1
2
1
3
1
4
2
1
2
2
You can create an array of sequence from 1 to size using sequence function and then to explode it:
import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
.withColumn("size", explode(sequence(lit(1), col("size"))))
.show(false)
The output would be:
+---+----+
|id |size|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|2 |1 |
|2 |2 |
+---+----+
You can use first use sequence function to create sequence from 1 to size and then explode it.
val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
+---+----+------------+
| id|size| seq|
+---+----+------------+
| 1| 4|[1, 2, 3, 4]|
| 2| 2| [1, 2]|
+---+----+------------+
df.withColumn("size", explode($"seq")).drop("seq").show()
+---+----+
| id|size|
+---+----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+----+
You could turn your size column into an incrementing sequence using Seq.range and then explode the arrays. Something like this:
import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}
// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")
// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
.map(row => {
(row.getInt(0), Seq.range(1, row.getInt(1) + 1))
})
.toDF("id", "array")
.select(col("id"), explode(col("array")))
output.show()
+---+---+
| id|col|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+---+

Set new column value based on latest record

I have a dataframe which is similar to below
+-------+-------+----------+
|dept_id|user_id|entry_date|
+-------+-------+----------+
| 3| 1|2020-06-03|
| 3| 2|2020-06-03|
| 3| 3|2020-06-03|
| 3| 4|2020-06-03|
| 3| 1|2020-06-04|
| 3| 1|2020-06-05|
+-------+-------+----------+
Now I need to add a new column which should indicate the latest entry date of the user. 1 means latest, 0 means old
+-------+-------+----------+----------
|dept_id|user_id|entry_date|latest_rec
+-------+-------+----------+----------
| 3| 1|2020-06-03|0
| 3| 2|2020-06-03|1
| 3| 3|2020-06-03|1
| 3| 4|2020-06-03|1
| 3| 1|2020-06-04|0
| 3| 1|2020-06-05|1
+-------+-------+----------+---------
I tried by finding rank of the user
val win = Window.partitionBy("dept_id", "user_id").orderBy(asc("entry_date"))
someDF.withColumn("rank_num",rank().over(win))
Now stuck with how to populate the latest_rec column based on the rank_num column. How should I proceed with the next step?
I'd use row_number to find the max date, and then derive your indicator based on that.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("dept_id", "user_id").orderBy("entry_date")
val win = <your df>.withColumn("der_rank",row_number().over(windowSpec))
val final = win.withColumn("latest_rec",when("der_rank" === 1,1).otherwise(0))
Instead of using rank, get the last when you partitionBy dept_id, user_id and orderBy entry_date, range from currentRow to unboundedFollowingRow as latest_entry_date. Then compare entry_date with latest_entry_date and set the latest_rec values accordingly.
scala> df.show+-------+-------+----------+
|dept_id|user_id|entry_date|
+-------+-------+----------+
| 3| 1|2020-06-03|
| 3| 2|2020-06-03|
| 3| 3|2020-06-03|
| 3| 4|2020-06-03|
| 3| 1|2020-06-04|
| 3| 1|2020-06-05|
+-------+-------+----------+
scala> val win = Window.partitionBy("dept_id","user_id").orderBy("entry_date").rowsBetween(Window.currentRow, Window.unboundedFollowing)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#b3f21c2
scala> df.withColumn("latest_entry_date", last($"entry_date", true).over(win)).show+-------+-------+----------+-----------------+
|dept_id|user_id|entry_date|latest_entry_date|
+-------+-------+----------+-----------------+
| 3| 1|2020-06-03| 2020-06-05|
| 3| 1|2020-06-04| 2020-06-05|
| 3| 1|2020-06-05| 2020-06-05|
| 3| 3|2020-06-03| 2020-06-03|
| 3| 2|2020-06-03| 2020-06-03|
| 3| 4|2020-06-03| 2020-06-03|
+-------+-------+----------+-----------------+
scala> df.withColumn("latest_entry_date", last($"entry_date", true).over(win)).withColumn("latest_rec", when($"entry_date" === $"latest_entry_date", 1).otherwise(0)).show
+-------+-------+----------+-----------------+----------+
|dept_id|user_id|entry_date|latest_entry_date|latest_rec|
+-------+-------+----------+-----------------+----------+
| 3| 1|2020-06-03| 2020-06-05| 0|
| 3| 1|2020-06-04| 2020-06-05| 0|
| 3| 1|2020-06-05| 2020-06-05| 1|
| 3| 3|2020-06-03| 2020-06-03| 1|
| 3| 2|2020-06-03| 2020-06-03| 1|
| 3| 4|2020-06-03| 2020-06-03| 1|
+-------+-------+----------+-----------------+----------+
Another alternative approach:
Load the test data provided
val data =
"""
|dept_id|user_id|entry_date
| 3| 1|2020-06-03
| 3| 2|2020-06-03
| 3| 3|2020-06-03
| 3| 4|2020-06-03
| 3| 1|2020-06-04
| 3| 1|2020-06-05
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
// .option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +-------+-------+----------+
* |dept_id|user_id|entry_date|
* +-------+-------+----------+
* |3 |1 |2020-06-03|
* |3 |2 |2020-06-03|
* |3 |3 |2020-06-03|
* |3 |4 |2020-06-03|
* |3 |1 |2020-06-04|
* |3 |1 |2020-06-05|
* +-------+-------+----------+
*
* root
* |-- dept_id: string (nullable = true)
* |-- user_id: string (nullable = true)
* |-- entry_date: string (nullable = true)
*/
Use max(entry_date) over(partition by 'dept_id', 'user_id')
val w = Window.partitionBy("dept_id", "user_id")
val latestRec = when(datediff(max(to_date($"entry_date")).over(w), to_date($"entry_date")) =!= lit(0), 0)
.otherwise(1)
df1.withColumn("latest_rec", latestRec)
.orderBy("dept_id", "user_id", "entry_date")
.show(false)
/**
* +-------+-------+----------+----------+
* |dept_id|user_id|entry_date|latest_rec|
* +-------+-------+----------+----------+
* |3 |1 |2020-06-03|0 |
* |3 |1 |2020-06-04|0 |
* |3 |1 |2020-06-05|1 |
* |3 |2 |2020-06-03|1 |
* |3 |3 |2020-06-03|1 |
* |3 |4 |2020-06-03|1 |
* +-------+-------+----------+----------+
*/

how to execute many expressions in the selectExpr

it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")
def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+
Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+

Spark DataFrame select null value

I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.
eg: if my dataframe is below
+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
| 1| a| b|
| 2| null| d|
| 3| m| null|
+-----+------+------+
my final dataframe should be
+-----+------+-----+--------------+
|Rowid|Record|Value| error_desc|
+-----+------+-----+--------------+
| 1| a| b| null|
| 2| null| d|record is null|
| 3| m| null| value is null|
+-----+------+-----+--------------+
I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
| 1| a| b|
| 2| null| d|
| 3| m| null|
| 4| null| d|
| 5| null| null|
| null| e| null|
| 7| e| r|
+-----+------+-----+
scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
| var check:String = ""
| val ColList = Column.split(",").toList
| ColList.foreach{ x =>
| if (r.getAs(x) == null)
| {
| check = check + x.toString + " is null. "
| }}
| check
| })
scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc |
+-----+------+-----+-------------------------------+
|1 |a |b | |
|2 |null |d |Record is null. |
|3 |m |null |Value is null. |
|4 |null |d |Record is null. |
|5 |null |null |Record is null. Value is null. |
|null |e |null |Rowid is null. Value is null. |
|7 |e |r | |
+-----+------+-----+-------------------------------+

splitting content of a pyspark dataframe column and aggregating them into new columns

I am trying to extract and split the data within pyspark dataframe column, following which, aggregate it into a new columns.
Input Table.
+--+-----------+
|id|description|
+--+-----------+
|1 | 3:2,3|2:1|
|2 | 2 |
|3 | 2:12,16 |
|4 | 3:2,4,6 |
|5 | 2 |
|6 | 2:3,7|2:3|
+--------------+
Desired Output.
+--+-----------+-------+-----------+
|id|description|sum_emp|org_changed|
+--+-----------+-------+-----------+
|1 | 3:2,3|2:1| 5 | 3 |
|2 | 2 | 2 | 0 |
|3 | 2:12,16 | 2 | 2 |
|4 | 3:2,4,6 | 3 | 3 |
|5 | 2 | 2 | 0 |
|6 | 2:3,7|2:3| 4 | 3 |
+--------------+-------+-----------+
Before the ":", values ought to be added. The values post the ":" are to be counted. The | marks the shift in the record(can be ignored)
Some data points are as long as 2:3,4,5|3:4,6,3|4:3,7,8
Any help would be greatly appreciated
Scenario Explained:
Considering the 6th id for example. The 6 refers to a biz unit id. The 'Description' column describes the team within that given unit.
Now for the meaning of the values 2:3,7|2:3 are as follows:
1)Fist 2 followed by 3&7 = there are 2 folks in the team and one of them has been in another org for 3 years and for 7 years (perhaps its the second guys first company)
2)Second 2 followed by 3 = there are 2 folks again in a sub team, and 1 person has spent 3 years in another org.
Desired output:
sum_emp = total number of employees in that given biz unit.
org_changed = total number of organizations folks in that biz unit have changed.
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([[1,"3:2,3|2:1"],
[2,"2"],
[3,"2:12,16"],
[4,"3:2,4,6"],
[5,"2"],
[6,"2:3,7|2:3"]]),
["id","description"])
+---+-----------+
| id|description|
+---+-----------+
| 1| 3:2,3|2:1|
| 2| 2|
| 3| 2:12,16|
| 4| 3:2,4,6|
| 5| 2|
| 6| 2:3,7|2:3|
+---+-----------+
First we'll split the records and explode the resulting array so we only have one record per line:
import pyspark.sql.functions as psf
df = df.withColumn(
"record",
psf.explode(psf.split("description", '\|'))
)
+---+-----------+-------+
| id|description| record|
+---+-----------+-------+
| 1| 3:2,3|2:1| 3:2,3|
| 1| 3:2,3|2:1| 2:1|
| 2| 2| 2|
| 3| 2:12,16|2:12,16|
| 4| 3:2,4,6|3:2,4,6|
| 5| 2| 2|
| 6| 2:3,7|2:3| 2:3,7|
| 6| 2:3,7|2:3| 2:3|
+---+-----------+-------+
Now we'll split records into the number of players and a list of years:
df = df.withColumn(
"record",
psf.split("record", ':')
).withColumn(
"nb_players",
psf.col("record")[0]
).withColumn(
"years",
psf.split(psf.col("record")[1], ',')
)
+---+-----------+----------+----------+---------+
| id|description| record|nb_players| years|
+---+-----------+----------+----------+---------+
| 1| 3:2,3|2:1| [3, 2,3]| 3| [2, 3]|
| 1| 3:2,3|2:1| [2, 1]| 2| [1]|
| 2| 2| [2]| 2| null|
| 3| 2:12,16|[2, 12,16]| 2| [12, 16]|
| 4| 3:2,4,6|[3, 2,4,6]| 3|[2, 4, 6]|
| 5| 2| [2]| 2| null|
| 6| 2:3,7|2:3| [2, 3,7]| 2| [3, 7]|
| 6| 2:3,7|2:3| [2, 3]| 2| [3]|
+---+-----------+----------+----------+---------+
Finally, we want to sum for each id the number of players and the length of years:
df = df.withColumn(
"years_size",
psf.when(psf.size("years") > 0, psf.size("years")).otherwise(0)
).groupby("id").agg(
psf.first("description").alias("description"),
psf.sum("nb_players").alias("sum_emp"),
psf.sum("years_size").alias("org_changed")
).sort("id").show()
+---+-----------+-------+-----------+
| id|description|sum_emp|org_changed|
+---+-----------+-------+-----------+
| 1| 3:2,3|2:1| 5.0| 3|
| 2| 2| 2.0| 0|
| 3| 2:12,16| 2.0| 2|
| 4| 3:2,4,6| 3.0| 3|
| 5| 2| 2.0| 0|
| 6| 2:3,7|2:3| 4.0| 3|
+---+-----------+-------+-----------+

Resources