Set new column value based on latest record - apache-spark

I have a dataframe which is similar to below
+-------+-------+----------+
|dept_id|user_id|entry_date|
+-------+-------+----------+
| 3| 1|2020-06-03|
| 3| 2|2020-06-03|
| 3| 3|2020-06-03|
| 3| 4|2020-06-03|
| 3| 1|2020-06-04|
| 3| 1|2020-06-05|
+-------+-------+----------+
Now I need to add a new column which should indicate the latest entry date of the user. 1 means latest, 0 means old
+-------+-------+----------+----------
|dept_id|user_id|entry_date|latest_rec
+-------+-------+----------+----------
| 3| 1|2020-06-03|0
| 3| 2|2020-06-03|1
| 3| 3|2020-06-03|1
| 3| 4|2020-06-03|1
| 3| 1|2020-06-04|0
| 3| 1|2020-06-05|1
+-------+-------+----------+---------
I tried by finding rank of the user
val win = Window.partitionBy("dept_id", "user_id").orderBy(asc("entry_date"))
someDF.withColumn("rank_num",rank().over(win))
Now stuck with how to populate the latest_rec column based on the rank_num column. How should I proceed with the next step?

I'd use row_number to find the max date, and then derive your indicator based on that.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("dept_id", "user_id").orderBy("entry_date")
val win = <your df>.withColumn("der_rank",row_number().over(windowSpec))
val final = win.withColumn("latest_rec",when("der_rank" === 1,1).otherwise(0))

Instead of using rank, get the last when you partitionBy dept_id, user_id and orderBy entry_date, range from currentRow to unboundedFollowingRow as latest_entry_date. Then compare entry_date with latest_entry_date and set the latest_rec values accordingly.
scala> df.show+-------+-------+----------+
|dept_id|user_id|entry_date|
+-------+-------+----------+
| 3| 1|2020-06-03|
| 3| 2|2020-06-03|
| 3| 3|2020-06-03|
| 3| 4|2020-06-03|
| 3| 1|2020-06-04|
| 3| 1|2020-06-05|
+-------+-------+----------+
scala> val win = Window.partitionBy("dept_id","user_id").orderBy("entry_date").rowsBetween(Window.currentRow, Window.unboundedFollowing)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#b3f21c2
scala> df.withColumn("latest_entry_date", last($"entry_date", true).over(win)).show+-------+-------+----------+-----------------+
|dept_id|user_id|entry_date|latest_entry_date|
+-------+-------+----------+-----------------+
| 3| 1|2020-06-03| 2020-06-05|
| 3| 1|2020-06-04| 2020-06-05|
| 3| 1|2020-06-05| 2020-06-05|
| 3| 3|2020-06-03| 2020-06-03|
| 3| 2|2020-06-03| 2020-06-03|
| 3| 4|2020-06-03| 2020-06-03|
+-------+-------+----------+-----------------+
scala> df.withColumn("latest_entry_date", last($"entry_date", true).over(win)).withColumn("latest_rec", when($"entry_date" === $"latest_entry_date", 1).otherwise(0)).show
+-------+-------+----------+-----------------+----------+
|dept_id|user_id|entry_date|latest_entry_date|latest_rec|
+-------+-------+----------+-----------------+----------+
| 3| 1|2020-06-03| 2020-06-05| 0|
| 3| 1|2020-06-04| 2020-06-05| 0|
| 3| 1|2020-06-05| 2020-06-05| 1|
| 3| 3|2020-06-03| 2020-06-03| 1|
| 3| 2|2020-06-03| 2020-06-03| 1|
| 3| 4|2020-06-03| 2020-06-03| 1|
+-------+-------+----------+-----------------+----------+

Another alternative approach:
Load the test data provided
val data =
"""
|dept_id|user_id|entry_date
| 3| 1|2020-06-03
| 3| 2|2020-06-03
| 3| 3|2020-06-03
| 3| 4|2020-06-03
| 3| 1|2020-06-04
| 3| 1|2020-06-05
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
// .option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +-------+-------+----------+
* |dept_id|user_id|entry_date|
* +-------+-------+----------+
* |3 |1 |2020-06-03|
* |3 |2 |2020-06-03|
* |3 |3 |2020-06-03|
* |3 |4 |2020-06-03|
* |3 |1 |2020-06-04|
* |3 |1 |2020-06-05|
* +-------+-------+----------+
*
* root
* |-- dept_id: string (nullable = true)
* |-- user_id: string (nullable = true)
* |-- entry_date: string (nullable = true)
*/
Use max(entry_date) over(partition by 'dept_id', 'user_id')
val w = Window.partitionBy("dept_id", "user_id")
val latestRec = when(datediff(max(to_date($"entry_date")).over(w), to_date($"entry_date")) =!= lit(0), 0)
.otherwise(1)
df1.withColumn("latest_rec", latestRec)
.orderBy("dept_id", "user_id", "entry_date")
.show(false)
/**
* +-------+-------+----------+----------+
* |dept_id|user_id|entry_date|latest_rec|
* +-------+-------+----------+----------+
* |3 |1 |2020-06-03|0 |
* |3 |1 |2020-06-04|0 |
* |3 |1 |2020-06-05|1 |
* |3 |2 |2020-06-03|1 |
* |3 |3 |2020-06-03|1 |
* |3 |4 |2020-06-03|1 |
* +-------+-------+----------+----------+
*/

Related

Spark-Scala : Create split rows based on the value of other column

I have an Input as below
id
size
1
4
2
2
output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it
1-2 times.
id
size
1
1
1
2
1
3
1
4
2
1
2
2
You can create an array of sequence from 1 to size using sequence function and then to explode it:
import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
.withColumn("size", explode(sequence(lit(1), col("size"))))
.show(false)
The output would be:
+---+----+
|id |size|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|2 |1 |
|2 |2 |
+---+----+
You can use first use sequence function to create sequence from 1 to size and then explode it.
val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
+---+----+------------+
| id|size| seq|
+---+----+------------+
| 1| 4|[1, 2, 3, 4]|
| 2| 2| [1, 2]|
+---+----+------------+
df.withColumn("size", explode($"seq")).drop("seq").show()
+---+----+
| id|size|
+---+----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+----+
You could turn your size column into an incrementing sequence using Seq.range and then explode the arrays. Something like this:
import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}
// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")
// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
.map(row => {
(row.getInt(0), Seq.range(1, row.getInt(1) + 1))
})
.toDF("id", "array")
.select(col("id"), explode(col("array")))
output.show()
+---+---+
| id|col|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+---+

how to execute many expressions in the selectExpr

it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")
def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+
Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+

SparkSQL Windows: Creating Frame Based On Array Column

I am looking to use SparkSQL's window function, but with a custom condition on the frame specification.
The dataframe being operated on is as follows:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|score|
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 1 |
|a |2 |[] | 1 |
|a |3 |[] | 1 |
|b |1 |[] | 1 |
|a |4 |[1, 2] | 1 |
+--------------------+--------------------+--------------------+-----+
Every element in the prerequisites column is a value in another row's elementid column.
I would like to create a window where I partition by userid, and then grab all preceding rows where elementid is contained in the present row's prerequisites column.
Once I attain this window, I want to perform a sum on the score column.
Desired output for the above example:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|sum |
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 0 |
|a |2 |[] | 0 |
|a |3 |[] | 0 |
|b |1 |[] | 0 |
|a |4 |[1, 2] | 2 |
+--------------------+--------------------+--------------------+-----+
Notice how because user a is the only user with the prerequisites of its element preceding it, its the only one with > 0 sum.
The closest question I saw was this question, which utilises collect_list.
However, that doesn't construct a window so much as collect a potential list of IDs. Anyone have any ideas on how to construct the aforementioned window?
scala> import org.apache.spark.sql.expressions.{Window,UserDefinedFunction}
scala> df.show()
+------+---------+-------------+-----+
|userid|elementid|prerequisites|score|
+------+---------+-------------+-----+
| a| 1| []| 1|
| a| 2| []| 1|
| a| 3| []| 1|
| b| 1| []| 1|
| a| 4| [1, 2]| 1|
+------+---------+-------------+-----+
scala> df.printSchema
root
|-- userid: string (nullable = true)
|-- elementid: string (nullable = true)
|-- prerequisites: array (nullable = true)
| |-- element: string (containsNull = true)
|-- score: string (nullable = true)
scala> val W = Window.partitionBy("userid")
scala> val df1 = df.withColumn("elementidList", collect_set(col("elementid")).over(W))
.withColumn("elementidScoreMap", map_from_arrays(col("elementidList"), collect_list(col("score").cast("long")).over(W)))
.withColumn("common", array_intersect(col("prerequisites"), col("elementidList")))
.drop("elementidList", "score")
scala> def getSumUDF:UserDefinedFunction = udf((Score:Map[String,Long], Id:String) => {
| var out:Long = 0
| Id.split(",").foreach{ x => out = Score(x.toString) + out}
| out})
scala> df1.withColumn("sum", when(size(col("common")) =!= 0 ,getSumUDF(col("elementidScoreMap"), concat_ws(",",col("prerequisites")))).otherwise(lit(0)))
.drop("elementidScoreMap", "common")
.show()
+------+---------+-------------+---+
|userid|elementid|prerequisites|sum|
+------+---------+-------------+---+
| b| 1| []| 0|
| a| 1| []| 0|
| a| 2| []| 0|
| a| 3| []| 0|
| a| 4| [1, 2]| 2|
+------+---------+-------------+---+

Spark DataFrame select null value

I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.
eg: if my dataframe is below
+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
| 1| a| b|
| 2| null| d|
| 3| m| null|
+-----+------+------+
my final dataframe should be
+-----+------+-----+--------------+
|Rowid|Record|Value| error_desc|
+-----+------+-----+--------------+
| 1| a| b| null|
| 2| null| d|record is null|
| 3| m| null| value is null|
+-----+------+-----+--------------+
I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
| 1| a| b|
| 2| null| d|
| 3| m| null|
| 4| null| d|
| 5| null| null|
| null| e| null|
| 7| e| r|
+-----+------+-----+
scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
| var check:String = ""
| val ColList = Column.split(",").toList
| ColList.foreach{ x =>
| if (r.getAs(x) == null)
| {
| check = check + x.toString + " is null. "
| }}
| check
| })
scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc |
+-----+------+-----+-------------------------------+
|1 |a |b | |
|2 |null |d |Record is null. |
|3 |m |null |Value is null. |
|4 |null |d |Record is null. |
|5 |null |null |Record is null. Value is null. |
|null |e |null |Rowid is null. Value is null. |
|7 |e |r | |
+-----+------+-----+-------------------------------+

spark table manipulation - Column values to rows and row values transposed

I have the following dataset
And i want to convert this to the following using spark. Any pointers would be helpful.
spark 2.4.3 you can you map_from_array and it is pretty straight forward and inbuilt function.
scala> val df = Seq((1,40,60,10), (2,34,10,20), (3,87,29,62) ).toDF("cust_id","100x","200x","300x")
scala> df.show
+-------+----+----+----+
|cust_id|100x|200x|300x|
+-------+----+----+----+
| 1| 40| 60| 10|
| 2| 34| 10| 20|
| 3| 87| 29| 62|
+-------+----+----+----+
Apply map_from_array and explode it will give your desired result
df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select('cust_id, explode('map)).show(false)
+-------+-------+-----+
|cust_id|key |value|
+-------+-------+-----+
|1 |cust_id|1 |
|1 |100x |40 |
|1 |200x |60 |
|1 |300x |10 |
|2 |cust_id|2 |
|2 |100x |34 |
|2 |200x |10 |
|2 |300x |20 |
|3 |cust_id|3 |
|3 |100x |87 |
|3 |200x |29 |
|3 |300x |62 |
+-------+-------+-----+
I think built-in function will give more performance as compared to udf.
I did a method some time ago to do this:
/**
* Transforms (reshapes) a dataframe by transforming columns into rows
*
* Note that the datatype of all columns to be transposed to rows must be the same!
*
* #param df The input dataframe
* #param remain The columns which should remain unchanged
* #param keyName The name of the new key-column
* #param valueName The name of the new value-column
* #return The transformed dataframe having (reamin.size + 2) columns
*/
def colsToRows(df: DataFrame, remain: Seq[String], keyName:String="key",valueName:String="value"): DataFrame = {
// cols: all columns to be transformed to rows
val (cols, types) = df.dtypes.filter{ case (c, _) => !remain.contains(c)}.unzip
assert(types.distinct.size == 1,s"All columns need to have same type, but found ${types.distinct}")
// make an array of the values in the columns and then explode it to generate rows
val kvs = explode(array(
cols.map(c => struct(lit(c).alias(keyName), col(c).alias(valueName))): _*
))
// columns which should remain
val byExprs = remain.map(col(_))
// construct final dataframe
df
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq(col(s"_kvs.$keyName"), col(s"_kvs.$valueName")): _*)
}
You can use it like this:
val df = Seq(
(1,40,60,10),
(2,34,10,20),
(3,87,29,62)
).toDF("cust_id","100x","200x","300x")
colsToRows(df,remain = Seq("cust_id"),keyName = "sid")
.show()
gives
+-------+----+-----+
|cust_id| sid|value|
+-------+----+-----+
| 1|100x| 40|
| 1|200x| 60|
| 1|300x| 10|
| 2|100x| 34|
| 2|200x| 10|
| 2|300x| 20|
| 3|100x| 87|
| 3|200x| 29|
| 3|300x| 62|
+-------+----+-----+
You can do by using stack function too.
Here is an example code to try it out.
val df = Seq((1,40,60,10), (2,34,10,20), (3,87,29,62) ).toDF("cust_id","100x","200x","300x")
df.show()
scala> df.show()
+-------+----+----+----+
|cust_id|100x|200x|300x|
+-------+----+----+----+
| 1| 40| 60| 10|
| 2| 34| 10| 20|
| 3| 87| 29| 62|
+-------+----+----+----+
val skipColumn = "cust_id"
var columnCount = df.schema.size -1
df.columns
var columnsStr = ""
var counter = 0
for ( col <- df.columns ) {
counter = counter + 1
if(col != skipColumn) {
if(counter == df.schema.size) {
columnsStr = columnsStr + s"'$col', $col"
}
else {
columnsStr = columnsStr + s"'$col', $col,"
}
}
}
val unPivotDF = df.select($"cust_id",
expr(s"stack($columnCount, $columnsStr) as (Sid,Value)"))
unPivotDF.show()
scala> unPivotDF.show()
+-------+----+-----+
|cust_id| Sid|Value|
+-------+----+-----+
| 1|100x| 40|
| 1|200x| 60|
| 1|300x| 10|
| 2|100x| 34|
| 2|200x| 10|
| 2|300x| 20|
| 3|100x| 87|
| 3|200x| 29|
| 3|300x| 62|
+-------+----+-----+

Resources