Spark-Scala : Create split rows based on the value of other column - apache-spark

I have an Input as below
id
size
1
4
2
2
output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it
1-2 times.
id
size
1
1
1
2
1
3
1
4
2
1
2
2

You can create an array of sequence from 1 to size using sequence function and then to explode it:
import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
.withColumn("size", explode(sequence(lit(1), col("size"))))
.show(false)
The output would be:
+---+----+
|id |size|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|2 |1 |
|2 |2 |
+---+----+

You can use first use sequence function to create sequence from 1 to size and then explode it.
val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
+---+----+------------+
| id|size| seq|
+---+----+------------+
| 1| 4|[1, 2, 3, 4]|
| 2| 2| [1, 2]|
+---+----+------------+
df.withColumn("size", explode($"seq")).drop("seq").show()
+---+----+
| id|size|
+---+----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+----+

You could turn your size column into an incrementing sequence using Seq.range and then explode the arrays. Something like this:
import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}
// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")
// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
.map(row => {
(row.getInt(0), Seq.range(1, row.getInt(1) + 1))
})
.toDF("id", "array")
.select(col("id"), explode(col("array")))
output.show()
+---+---+
| id|col|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+---+

Related

SparkSQL Windows: Creating Frame Based On Array Column

I am looking to use SparkSQL's window function, but with a custom condition on the frame specification.
The dataframe being operated on is as follows:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|score|
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 1 |
|a |2 |[] | 1 |
|a |3 |[] | 1 |
|b |1 |[] | 1 |
|a |4 |[1, 2] | 1 |
+--------------------+--------------------+--------------------+-----+
Every element in the prerequisites column is a value in another row's elementid column.
I would like to create a window where I partition by userid, and then grab all preceding rows where elementid is contained in the present row's prerequisites column.
Once I attain this window, I want to perform a sum on the score column.
Desired output for the above example:
+--------------------+--------------------+--------------------+-----+
| userid| elementid| prerequisites|sum |
+--------------------+--------------------+--------------------+-----+
|a |1 |[] | 0 |
|a |2 |[] | 0 |
|a |3 |[] | 0 |
|b |1 |[] | 0 |
|a |4 |[1, 2] | 2 |
+--------------------+--------------------+--------------------+-----+
Notice how because user a is the only user with the prerequisites of its element preceding it, its the only one with > 0 sum.
The closest question I saw was this question, which utilises collect_list.
However, that doesn't construct a window so much as collect a potential list of IDs. Anyone have any ideas on how to construct the aforementioned window?
scala> import org.apache.spark.sql.expressions.{Window,UserDefinedFunction}
scala> df.show()
+------+---------+-------------+-----+
|userid|elementid|prerequisites|score|
+------+---------+-------------+-----+
| a| 1| []| 1|
| a| 2| []| 1|
| a| 3| []| 1|
| b| 1| []| 1|
| a| 4| [1, 2]| 1|
+------+---------+-------------+-----+
scala> df.printSchema
root
|-- userid: string (nullable = true)
|-- elementid: string (nullable = true)
|-- prerequisites: array (nullable = true)
| |-- element: string (containsNull = true)
|-- score: string (nullable = true)
scala> val W = Window.partitionBy("userid")
scala> val df1 = df.withColumn("elementidList", collect_set(col("elementid")).over(W))
.withColumn("elementidScoreMap", map_from_arrays(col("elementidList"), collect_list(col("score").cast("long")).over(W)))
.withColumn("common", array_intersect(col("prerequisites"), col("elementidList")))
.drop("elementidList", "score")
scala> def getSumUDF:UserDefinedFunction = udf((Score:Map[String,Long], Id:String) => {
| var out:Long = 0
| Id.split(",").foreach{ x => out = Score(x.toString) + out}
| out})
scala> df1.withColumn("sum", when(size(col("common")) =!= 0 ,getSumUDF(col("elementidScoreMap"), concat_ws(",",col("prerequisites")))).otherwise(lit(0)))
.drop("elementidScoreMap", "common")
.show()
+------+---------+-------------+---+
|userid|elementid|prerequisites|sum|
+------+---------+-------------+---+
| b| 1| []| 0|
| a| 1| []| 0|
| a| 2| []| 0|
| a| 3| []| 0|
| a| 4| [1, 2]| 2|
+------+---------+-------------+---+

How to compute the numerical difference between columns of different dataframes?

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?
I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+
If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+

How do I calculate the start/end of an interval (set of rows) containing identical values?

Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+

How to define spark dataframe join match priority

I have two dataframes.
dataDF
+---+
| tt|
+---+
| a|
| b|
| c|
| ab|
+---+
alter
+----+-----+------+
|name|alter|profit|
+----+-----+------+
| a| aa| 1|
| b| a| 5|
| c| ab| 8|
+----+-----+------+
The task is to search col "tt" in dataframe alter col("name"), if found it join them, if not found it, then search col "tt" in col("alter"). The priority of col ("name") is high than col("alter"). That means if row of col("tt") is matched to col("name"), I do not want to match it to other row which only matches col("alter"). How can I achieve this task?
I tried to write a join, but it does not work.
dataDF = dataDF.select("*")
.join(broadcast(alterDF),
col("tt") === col("Name") || col("tt") === col("alter"),
"left")
The result is:
+---+----+-----+------+
| tt|name|alter|profit|
+---+----+-----+------+
| a| a| aa| 1|
| a| b| a| 5| // this row is not expected.
| b| b| a| 5|
| c| c| ab| 8|
| ab| c| ab| 8|
+---+----+-----+------+
You can try joining twice. First time with the name column, filter out the tt values for which data is not matched and join it with the alter column. Union both the results. Please find the code below for the same. I hope it is helpful.
//Creating Test Data
val dataDF = Seq("a", "b", "c", "ab").toDF("tt")
val alter = Seq(("a", "aa", 1), ("b", "a", 5), ("c", "ab", 8))
.toDF("name", "alter", "profit")
val join1 = dataDF.join(alter, col("tt") === col("name"), "left")
val join2 = join1.filter( col("name").isNull).select("tt")
.join(alter, col("tt") === col("alter"), "left")
val joinDF = join1.filter( col("name").isNotNull).union(join2)
joinDF.show(false)
+---+----+-----+------+
|tt |name|alter|profit|
+---+----+-----+------+
|a |a |aa |1 |
|b |b |a |5 |
|c |c |ab |8 |
|ab |c |ab |8 |
+---+----+-----+------+

splitting content of a pyspark dataframe column and aggregating them into new columns

I am trying to extract and split the data within pyspark dataframe column, following which, aggregate it into a new columns.
Input Table.
+--+-----------+
|id|description|
+--+-----------+
|1 | 3:2,3|2:1|
|2 | 2 |
|3 | 2:12,16 |
|4 | 3:2,4,6 |
|5 | 2 |
|6 | 2:3,7|2:3|
+--------------+
Desired Output.
+--+-----------+-------+-----------+
|id|description|sum_emp|org_changed|
+--+-----------+-------+-----------+
|1 | 3:2,3|2:1| 5 | 3 |
|2 | 2 | 2 | 0 |
|3 | 2:12,16 | 2 | 2 |
|4 | 3:2,4,6 | 3 | 3 |
|5 | 2 | 2 | 0 |
|6 | 2:3,7|2:3| 4 | 3 |
+--------------+-------+-----------+
Before the ":", values ought to be added. The values post the ":" are to be counted. The | marks the shift in the record(can be ignored)
Some data points are as long as 2:3,4,5|3:4,6,3|4:3,7,8
Any help would be greatly appreciated
Scenario Explained:
Considering the 6th id for example. The 6 refers to a biz unit id. The 'Description' column describes the team within that given unit.
Now for the meaning of the values 2:3,7|2:3 are as follows:
1)Fist 2 followed by 3&7 = there are 2 folks in the team and one of them has been in another org for 3 years and for 7 years (perhaps its the second guys first company)
2)Second 2 followed by 3 = there are 2 folks again in a sub team, and 1 person has spent 3 years in another org.
Desired output:
sum_emp = total number of employees in that given biz unit.
org_changed = total number of organizations folks in that biz unit have changed.
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([[1,"3:2,3|2:1"],
[2,"2"],
[3,"2:12,16"],
[4,"3:2,4,6"],
[5,"2"],
[6,"2:3,7|2:3"]]),
["id","description"])
+---+-----------+
| id|description|
+---+-----------+
| 1| 3:2,3|2:1|
| 2| 2|
| 3| 2:12,16|
| 4| 3:2,4,6|
| 5| 2|
| 6| 2:3,7|2:3|
+---+-----------+
First we'll split the records and explode the resulting array so we only have one record per line:
import pyspark.sql.functions as psf
df = df.withColumn(
"record",
psf.explode(psf.split("description", '\|'))
)
+---+-----------+-------+
| id|description| record|
+---+-----------+-------+
| 1| 3:2,3|2:1| 3:2,3|
| 1| 3:2,3|2:1| 2:1|
| 2| 2| 2|
| 3| 2:12,16|2:12,16|
| 4| 3:2,4,6|3:2,4,6|
| 5| 2| 2|
| 6| 2:3,7|2:3| 2:3,7|
| 6| 2:3,7|2:3| 2:3|
+---+-----------+-------+
Now we'll split records into the number of players and a list of years:
df = df.withColumn(
"record",
psf.split("record", ':')
).withColumn(
"nb_players",
psf.col("record")[0]
).withColumn(
"years",
psf.split(psf.col("record")[1], ',')
)
+---+-----------+----------+----------+---------+
| id|description| record|nb_players| years|
+---+-----------+----------+----------+---------+
| 1| 3:2,3|2:1| [3, 2,3]| 3| [2, 3]|
| 1| 3:2,3|2:1| [2, 1]| 2| [1]|
| 2| 2| [2]| 2| null|
| 3| 2:12,16|[2, 12,16]| 2| [12, 16]|
| 4| 3:2,4,6|[3, 2,4,6]| 3|[2, 4, 6]|
| 5| 2| [2]| 2| null|
| 6| 2:3,7|2:3| [2, 3,7]| 2| [3, 7]|
| 6| 2:3,7|2:3| [2, 3]| 2| [3]|
+---+-----------+----------+----------+---------+
Finally, we want to sum for each id the number of players and the length of years:
df = df.withColumn(
"years_size",
psf.when(psf.size("years") > 0, psf.size("years")).otherwise(0)
).groupby("id").agg(
psf.first("description").alias("description"),
psf.sum("nb_players").alias("sum_emp"),
psf.sum("years_size").alias("org_changed")
).sort("id").show()
+---+-----------+-------+-----------+
| id|description|sum_emp|org_changed|
+---+-----------+-------+-----------+
| 1| 3:2,3|2:1| 5.0| 3|
| 2| 2| 2.0| 0|
| 3| 2:12,16| 2.0| 2|
| 4| 3:2,4,6| 3.0| 3|
| 5| 2| 2.0| 0|
| 6| 2:3,7|2:3| 4.0| 3|
+---+-----------+-------+-----------+

Resources