spark table manipulation - Column values to rows and row values transposed - apache-spark

I have the following dataset
And i want to convert this to the following using spark. Any pointers would be helpful.

spark 2.4.3 you can you map_from_array and it is pretty straight forward and inbuilt function.
scala> val df = Seq((1,40,60,10), (2,34,10,20), (3,87,29,62) ).toDF("cust_id","100x","200x","300x")
scala> df.show
+-------+----+----+----+
|cust_id|100x|200x|300x|
+-------+----+----+----+
| 1| 40| 60| 10|
| 2| 34| 10| 20|
| 3| 87| 29| 62|
+-------+----+----+----+
Apply map_from_array and explode it will give your desired result
df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select('cust_id, explode('map)).show(false)
+-------+-------+-----+
|cust_id|key |value|
+-------+-------+-----+
|1 |cust_id|1 |
|1 |100x |40 |
|1 |200x |60 |
|1 |300x |10 |
|2 |cust_id|2 |
|2 |100x |34 |
|2 |200x |10 |
|2 |300x |20 |
|3 |cust_id|3 |
|3 |100x |87 |
|3 |200x |29 |
|3 |300x |62 |
+-------+-------+-----+
I think built-in function will give more performance as compared to udf.

I did a method some time ago to do this:
/**
* Transforms (reshapes) a dataframe by transforming columns into rows
*
* Note that the datatype of all columns to be transposed to rows must be the same!
*
* #param df The input dataframe
* #param remain The columns which should remain unchanged
* #param keyName The name of the new key-column
* #param valueName The name of the new value-column
* #return The transformed dataframe having (reamin.size + 2) columns
*/
def colsToRows(df: DataFrame, remain: Seq[String], keyName:String="key",valueName:String="value"): DataFrame = {
// cols: all columns to be transformed to rows
val (cols, types) = df.dtypes.filter{ case (c, _) => !remain.contains(c)}.unzip
assert(types.distinct.size == 1,s"All columns need to have same type, but found ${types.distinct}")
// make an array of the values in the columns and then explode it to generate rows
val kvs = explode(array(
cols.map(c => struct(lit(c).alias(keyName), col(c).alias(valueName))): _*
))
// columns which should remain
val byExprs = remain.map(col(_))
// construct final dataframe
df
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq(col(s"_kvs.$keyName"), col(s"_kvs.$valueName")): _*)
}
You can use it like this:
val df = Seq(
(1,40,60,10),
(2,34,10,20),
(3,87,29,62)
).toDF("cust_id","100x","200x","300x")
colsToRows(df,remain = Seq("cust_id"),keyName = "sid")
.show()
gives
+-------+----+-----+
|cust_id| sid|value|
+-------+----+-----+
| 1|100x| 40|
| 1|200x| 60|
| 1|300x| 10|
| 2|100x| 34|
| 2|200x| 10|
| 2|300x| 20|
| 3|100x| 87|
| 3|200x| 29|
| 3|300x| 62|
+-------+----+-----+

You can do by using stack function too.
Here is an example code to try it out.
val df = Seq((1,40,60,10), (2,34,10,20), (3,87,29,62) ).toDF("cust_id","100x","200x","300x")
df.show()
scala> df.show()
+-------+----+----+----+
|cust_id|100x|200x|300x|
+-------+----+----+----+
| 1| 40| 60| 10|
| 2| 34| 10| 20|
| 3| 87| 29| 62|
+-------+----+----+----+
val skipColumn = "cust_id"
var columnCount = df.schema.size -1
df.columns
var columnsStr = ""
var counter = 0
for ( col <- df.columns ) {
counter = counter + 1
if(col != skipColumn) {
if(counter == df.schema.size) {
columnsStr = columnsStr + s"'$col', $col"
}
else {
columnsStr = columnsStr + s"'$col', $col,"
}
}
}
val unPivotDF = df.select($"cust_id",
expr(s"stack($columnCount, $columnsStr) as (Sid,Value)"))
unPivotDF.show()
scala> unPivotDF.show()
+-------+----+-----+
|cust_id| Sid|Value|
+-------+----+-----+
| 1|100x| 40|
| 1|200x| 60|
| 1|300x| 10|
| 2|100x| 34|
| 2|200x| 10|
| 2|300x| 20|
| 3|100x| 87|
| 3|200x| 29|
| 3|300x| 62|
+-------+----+-----+

Related

Spark-Scala : Create split rows based on the value of other column

I have an Input as below
id
size
1
4
2
2
output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it
1-2 times.
id
size
1
1
1
2
1
3
1
4
2
1
2
2
You can create an array of sequence from 1 to size using sequence function and then to explode it:
import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
.withColumn("size", explode(sequence(lit(1), col("size"))))
.show(false)
The output would be:
+---+----+
|id |size|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|2 |1 |
|2 |2 |
+---+----+
You can use first use sequence function to create sequence from 1 to size and then explode it.
val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
+---+----+------------+
| id|size| seq|
+---+----+------------+
| 1| 4|[1, 2, 3, 4]|
| 2| 2| [1, 2]|
+---+----+------------+
df.withColumn("size", explode($"seq")).drop("seq").show()
+---+----+
| id|size|
+---+----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+----+
You could turn your size column into an incrementing sequence using Seq.range and then explode the arrays. Something like this:
import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}
// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")
// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
.map(row => {
(row.getInt(0), Seq.range(1, row.getInt(1) + 1))
})
.toDF("id", "array")
.select(col("id"), explode(col("array")))
output.show()
+---+---+
| id|col|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+---+

Set new column value based on latest record

I have a dataframe which is similar to below
+-------+-------+----------+
|dept_id|user_id|entry_date|
+-------+-------+----------+
| 3| 1|2020-06-03|
| 3| 2|2020-06-03|
| 3| 3|2020-06-03|
| 3| 4|2020-06-03|
| 3| 1|2020-06-04|
| 3| 1|2020-06-05|
+-------+-------+----------+
Now I need to add a new column which should indicate the latest entry date of the user. 1 means latest, 0 means old
+-------+-------+----------+----------
|dept_id|user_id|entry_date|latest_rec
+-------+-------+----------+----------
| 3| 1|2020-06-03|0
| 3| 2|2020-06-03|1
| 3| 3|2020-06-03|1
| 3| 4|2020-06-03|1
| 3| 1|2020-06-04|0
| 3| 1|2020-06-05|1
+-------+-------+----------+---------
I tried by finding rank of the user
val win = Window.partitionBy("dept_id", "user_id").orderBy(asc("entry_date"))
someDF.withColumn("rank_num",rank().over(win))
Now stuck with how to populate the latest_rec column based on the rank_num column. How should I proceed with the next step?
I'd use row_number to find the max date, and then derive your indicator based on that.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("dept_id", "user_id").orderBy("entry_date")
val win = <your df>.withColumn("der_rank",row_number().over(windowSpec))
val final = win.withColumn("latest_rec",when("der_rank" === 1,1).otherwise(0))
Instead of using rank, get the last when you partitionBy dept_id, user_id and orderBy entry_date, range from currentRow to unboundedFollowingRow as latest_entry_date. Then compare entry_date with latest_entry_date and set the latest_rec values accordingly.
scala> df.show+-------+-------+----------+
|dept_id|user_id|entry_date|
+-------+-------+----------+
| 3| 1|2020-06-03|
| 3| 2|2020-06-03|
| 3| 3|2020-06-03|
| 3| 4|2020-06-03|
| 3| 1|2020-06-04|
| 3| 1|2020-06-05|
+-------+-------+----------+
scala> val win = Window.partitionBy("dept_id","user_id").orderBy("entry_date").rowsBetween(Window.currentRow, Window.unboundedFollowing)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#b3f21c2
scala> df.withColumn("latest_entry_date", last($"entry_date", true).over(win)).show+-------+-------+----------+-----------------+
|dept_id|user_id|entry_date|latest_entry_date|
+-------+-------+----------+-----------------+
| 3| 1|2020-06-03| 2020-06-05|
| 3| 1|2020-06-04| 2020-06-05|
| 3| 1|2020-06-05| 2020-06-05|
| 3| 3|2020-06-03| 2020-06-03|
| 3| 2|2020-06-03| 2020-06-03|
| 3| 4|2020-06-03| 2020-06-03|
+-------+-------+----------+-----------------+
scala> df.withColumn("latest_entry_date", last($"entry_date", true).over(win)).withColumn("latest_rec", when($"entry_date" === $"latest_entry_date", 1).otherwise(0)).show
+-------+-------+----------+-----------------+----------+
|dept_id|user_id|entry_date|latest_entry_date|latest_rec|
+-------+-------+----------+-----------------+----------+
| 3| 1|2020-06-03| 2020-06-05| 0|
| 3| 1|2020-06-04| 2020-06-05| 0|
| 3| 1|2020-06-05| 2020-06-05| 1|
| 3| 3|2020-06-03| 2020-06-03| 1|
| 3| 2|2020-06-03| 2020-06-03| 1|
| 3| 4|2020-06-03| 2020-06-03| 1|
+-------+-------+----------+-----------------+----------+
Another alternative approach:
Load the test data provided
val data =
"""
|dept_id|user_id|entry_date
| 3| 1|2020-06-03
| 3| 2|2020-06-03
| 3| 3|2020-06-03
| 3| 4|2020-06-03
| 3| 1|2020-06-04
| 3| 1|2020-06-05
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
// .option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +-------+-------+----------+
* |dept_id|user_id|entry_date|
* +-------+-------+----------+
* |3 |1 |2020-06-03|
* |3 |2 |2020-06-03|
* |3 |3 |2020-06-03|
* |3 |4 |2020-06-03|
* |3 |1 |2020-06-04|
* |3 |1 |2020-06-05|
* +-------+-------+----------+
*
* root
* |-- dept_id: string (nullable = true)
* |-- user_id: string (nullable = true)
* |-- entry_date: string (nullable = true)
*/
Use max(entry_date) over(partition by 'dept_id', 'user_id')
val w = Window.partitionBy("dept_id", "user_id")
val latestRec = when(datediff(max(to_date($"entry_date")).over(w), to_date($"entry_date")) =!= lit(0), 0)
.otherwise(1)
df1.withColumn("latest_rec", latestRec)
.orderBy("dept_id", "user_id", "entry_date")
.show(false)
/**
* +-------+-------+----------+----------+
* |dept_id|user_id|entry_date|latest_rec|
* +-------+-------+----------+----------+
* |3 |1 |2020-06-03|0 |
* |3 |1 |2020-06-04|0 |
* |3 |1 |2020-06-05|1 |
* |3 |2 |2020-06-03|1 |
* |3 |3 |2020-06-03|1 |
* |3 |4 |2020-06-03|1 |
* +-------+-------+----------+----------+
*/

How to compute the numerical difference between columns of different dataframes?

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?
I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+
If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+

How to define spark dataframe join match priority

I have two dataframes.
dataDF
+---+
| tt|
+---+
| a|
| b|
| c|
| ab|
+---+
alter
+----+-----+------+
|name|alter|profit|
+----+-----+------+
| a| aa| 1|
| b| a| 5|
| c| ab| 8|
+----+-----+------+
The task is to search col "tt" in dataframe alter col("name"), if found it join them, if not found it, then search col "tt" in col("alter"). The priority of col ("name") is high than col("alter"). That means if row of col("tt") is matched to col("name"), I do not want to match it to other row which only matches col("alter"). How can I achieve this task?
I tried to write a join, but it does not work.
dataDF = dataDF.select("*")
.join(broadcast(alterDF),
col("tt") === col("Name") || col("tt") === col("alter"),
"left")
The result is:
+---+----+-----+------+
| tt|name|alter|profit|
+---+----+-----+------+
| a| a| aa| 1|
| a| b| a| 5| // this row is not expected.
| b| b| a| 5|
| c| c| ab| 8|
| ab| c| ab| 8|
+---+----+-----+------+
You can try joining twice. First time with the name column, filter out the tt values for which data is not matched and join it with the alter column. Union both the results. Please find the code below for the same. I hope it is helpful.
//Creating Test Data
val dataDF = Seq("a", "b", "c", "ab").toDF("tt")
val alter = Seq(("a", "aa", 1), ("b", "a", 5), ("c", "ab", 8))
.toDF("name", "alter", "profit")
val join1 = dataDF.join(alter, col("tt") === col("name"), "left")
val join2 = join1.filter( col("name").isNull).select("tt")
.join(alter, col("tt") === col("alter"), "left")
val joinDF = join1.filter( col("name").isNotNull).union(join2)
joinDF.show(false)
+---+----+-----+------+
|tt |name|alter|profit|
+---+----+-----+------+
|a |a |aa |1 |
|b |b |a |5 |
|c |c |ab |8 |
|ab |c |ab |8 |
+---+----+-----+------+

Given primary key, compare other columns of two data frames and output diff columns in the vertical way

I want to compare two dataframes that have the same schema, and have a primary key column.
For each primary key, if other columns have any difference (could be multiple columns, so need to use some dynamic way to scan all other columns), I want to output the column name and values of both dataframes.
Also, I want to output the result if one primary key doesn't exist in another dataframe (so "full outer join" will be used). Here is some example:
dataframe1:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book2 | 2 |
|3 |book3 | 3 |
|4 |book4 | 4 |
+-----------+------+------+
dataframe2:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book8 | 8 |
|3 |book3 | 7 |
|5 |book5 | 5 |
+-----------+------+------+
The result would be:
+-----------+------+----------+------------+------------*
|primary_key|diff_column_name | dataframe1 | dataframe2 |
+-----------+------+----------+------------+------------*
|2 |book | book2 | book8 |
|2 |number | 2 | 8 |
|3 |number | 3 | 7 |
|4 |book | book4 | null |
|4 |number | 4 | null |
|5 |book | null | book5 |
|5 |number | null | 5 |
+-----------+------+----------+------------+------------*
I know the first step is to join both dataframes on the primary key:
// joining the two DFs on primary_key
val result = df1.as("l")
.join(df2.as("r"), "primary_key", "fullouter")
But I am not sure how to proceed. Can someone give me some advice? Thanks
Data:
val df1 = Seq(
(1, "book1", 1), (2, "book2", 2), (3, "book3", 3), (4, "book4", 4)
).toDF("primary_key", "book", "number")
val df2 = Seq(
(1, "book1", 1), (2, "book8", 8), (3, "book3", 7), (5, "book5", 5)
).toDF("primary_key", "book", "number")
Imports
import org.apache.spark.sql.functions._
Define list of columns:
val cols = Seq("book", "number")
Join as you do right now:
val joined = df1.as("l").join(df2.as("r"), Seq("primary_key"), "fullouter")
Define:
val comp = explode(array(cols.map(c => struct(
lit(c).alias("diff_column_name"),
// Value left
col(s"l.${c}").cast("string").alias("dataframe1"),
// Value right
col(s"r.${c}").cast("string").alias("dataframe2"),
// Differs
not(col(s"l.${c}") <=> col(s"r.${c}")).alias("diff")
)): _*))
Select and filter:
joined
.withColumn("comp", comp)
.select($"primary_key", $"comp.*")
// Filter out mismatches and get rid of obsolete diff
.where($"diff").drop("diff")
.orderBy("primary_key").show
// +-----------+----------------+----------+----------+
// | 2| book| book2| book8|
// | 2| number| 2| 8|
// | 3| number| 3| 7|
// | 4| book| book4| null|
// | 4| number| 4| null|
// | 5| book| null| book5|
// | 5| number| null| 5|
// +-----------+----------------+----------+----------+

Resources