Conditionally remove duplicates duplicated rows in spark dataset - apache-spark

I would like to achieve the following thing:
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 20|
| z| x| 10|
| b| y| 7|
| z| y| 5|
| c| w| 1|
| z| w| 2|
+---+---+---+
should be reduced to
val df2 = Seq(("a","x",30),("b","y",12),("c","w",3)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 30|
| b| y| 12|
| c| w| 3|
+---+---+---+
I am aware of the dropDuplicates()command with its options. But for what I ould like to achieve this does not work. Somehow one has to detect the duplicates according to column _2 and then one has to remove the always the row with the z entry in _1 and add its _3 value to the the _3column that one keeps.
Thank you in advance.

As per your question this is what you are looking for
import spark.implicits._
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
val resultDf = df1.groupBy("_2").agg(collect_list("_1")(0).as("_1"), sum("_3").as("_3"))
Output:
+---+---+---+
| _2| _1| _3|
+---+---+---+
| x| a| 30|
| w| c| 3|
| y| b| 12|
+---+---+---+
You will get the result but the order is not guaranteed.

Related

How can I achieve following spark behaviour using replaceWhere clause

I want to write data in delta tables incrementally while replacing (overwriting) partitions already present in sink. Example:
Consider this data inside my delta table already partionned by id column:
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
Now, I would like to insert the following dataframe:
+---+---------+
| id| x|
+---+---------+
| 2| NEW|
| 2| NEW|
| 4| D|
| 5| E|
+---+---------+
The desired output is this
+---+---------+
| id| x|
+---+---------+
| 1| A|
| 2| NEW|
| 2| NEW|
| 3| C|
| 4| D|
| 5| E|
+---+---------+
What I did is the following:
df = spark.read.format("csv").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/in/input.csv")
Ids=[x.id for x in df.select("id").distinct().collect()]
for Id in Ids:
df.filter(df.id==Id).write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id == '$i'".format(i=Id)).mode("append").save("/mnt/blob/datafinance/bronze/simba/test/res/")
spark.read.format("delta").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/res/").show()
And this is the result:
+---+---------+
| id| x|
+---+---------+
| 2| B|
| 1| A|
| 5| E|
| 2| NEW|
| 2|NEW AUSSI|
| 3| C|
| 4| D|
+---+---------+
As you can see it appended all value without replacing the partition id=2 which was already present in table.
I think it is because of mode("append").
But changing it to mode("overwrite") throws the following error:
Data written out does not match replaceWhere 'id == '$i''.
Can anyone tell me how to achieve what I want please ?
Thank you.
I actually had an error in the code. I replaced
.option("replaceWhere", "id == '$i'".format(i=idd))
with
.option("replaceWhere", "id == '{i}'".format(i=idd))
and it worked.
Thanks to #ggordon who noticed me about the error on another question.

Conditions in Spark window function

I have a dataframe like
+---+---+---+---+
| q| w| e| r|
+---+---+---+---+
| a| 1| 20| y|
| a| 2| 22| z|
| b| 3| 10| y|
| b| 4| 12| y|
+---+---+---+---+
I want to mark the rows with the minimum e and r = z . If there are no rows which have r = z, I want the row with the minimum e, even if r = y.
Essentially, something like
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 1| 20| y| 0|
| a| 2| 22| z| 1|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
I can do it using a number of joins, but that would be too expensive.
So I was looking for a window-based solution.
You can calculate the minimum per group once for rows with r = z and then for all rows within a group. The first non-null value can then be compared to e:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = ...
w = Window.partitionBy("q")
#When ordering is not defined, an unbounded window frame is used by default.
df.withColumn("min_e_with_r_eq_z", F.expr("min(case when r='z' then e else null end)").over(w)) \
.withColumn("min_e_overall", F.min("e").over(w)) \
.withColumn("t", F.coalesce("min_e_with_r_eq_z","min_e_overall") == F.col("e")) \
.orderBy("w") \
.show()
Output:
+---+---+---+---+-----------------+-------------+-----+
| q| w| e| r|min_e_with_r_eq_z|min_e_overall| t|
+---+---+---+---+-----------------+-------------+-----+
| a| 1| 20| y| 22| 20|false|
| a| 2| 22| z| 22| 20| true|
| b| 3| 10| y| null| 10| true|
| b| 4| 12| y| null| 10|false|
+---+---+---+---+-----------------+-------------+-----+
Note: I assume that q is the grouping column for the window.
You can assign row numbers based on whether r = z and the value of column e:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
't',
F.when(
F.row_number().over(
Window.partitionBy('q')
.orderBy((F.col('r') == 'z').desc(), 'e')
) == 1,
1
).otherwise(0)
)
df2.show()
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 2| 22| z| 1|
| a| 1| 20| y| 0|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
Adding the spark-scala version of #werner 's accepted answer
val w = Window.partitionBy("q")
df.withColumn("min_e_with_r_eq_z", min(when($"r" === "z", $"e").otherwise(null)).over(w))
.withColumn("min_e_overall", min("e").over(w))
.withColumn("t", coalesce($"min_e_with_r_eq_z", $"min_e_overall") === $"e")
.orderBy("w")
.show()

Pyspark: Stitching multiple event rows in windows

I am trying to stitch few event rows in dataframe together based on time difference between them. I have created a new column in dataframe which represent time difference with the previous row using lag. The dataframe looks as follows:
sc=spark.sparkContext
df = spark.createDataFrame(
sc.parallelize(
[['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
),
['id',"row_number", "time_diff"]
)
I want to stitch the rows if the time_diff with the previous event is less than 160.
For this, I was planning to assign the new row numbers to all the events which are within 160 time of each other and then take groupby on new row number
For the above dataframe I wanted the output as:
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 1|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
I wrote a program as follows:
from pyspark.sql.functions import when,col
window = Window.partitionBy('id').orderBy('row_number')
df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
.otherwise(f.lag(col('new_row_number')).over(window)))
but the output I got was as follows:
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 2|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
Can someone help me out in resolving this?
Thanks
So you want the previous value of the column currently being populated which is not possible, so to achieve this we can do following:
window = Window.partitionBy('id').orderBy('row_number')
df3=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))\
.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
To explain:
First we generate the row value for every row which is greater than 160 else null
df2=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))
df2.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| null|
| x| 3| 102| null|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| null|
+---+----------+---------+--------------+
Then we fill the dataframe with last value using this
df3=df2.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
df3.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
Hope it solves your question.

PySpark window over : Is there any better way to find row number for two columns partitions

I have a dataset with 3 columns(T,S, and A). I need to filter out records such a way that T and S columns have one to one match.
e.g.If T1 is matched with S1 then T2 row with S1 value should be filtered.
I am able to achieve it using 2-time window over but it will do a lot of shuffling in the cluster during the second window function (First window shuffling I can control with df.sort/repartition).
l = [('T1', 'S1', 10), ('T2', 'S1', 10), ('T1', 'S2', 10), ('T2', 'S2', 10)]
df = spark.createDataFrame(l).toDF('T','S','A')
df.show()
+---+---+---+
| T| S| A|
+---+---+---+
| T1| S1| 10|
| T2| S1| 10|
| T1| S2| 10|
| T2| S2| 10|
+---+---+---+
w1 = w.partitionBy('T').orderBy('A')
w2 = w.partitionBy('S').orderBy('A','T')
df.withColumn('r1', f.row_number().over(w1)).withColumn('r2',f.row_number().over(w2)).show()
It gives below result so I can filter records if r1 == r2 and get expected output.
+---+---+---+---+---+
| T| S| A| r1| r2|
+---+---+---+---+---+
| T1| S2| 10| 2| 1|
| T2| S2| 10| 2| 2|
| T1| S1| 10| 1| 1|
| T2| S1| 10| 1| 2|
+---+---+---+---+---+
expected result
+---+---+---+---+---+
| T| S| A| r1| r2|
+---+---+---+---+---+
| T2| S2| 10| 2| 2|
| T1| S1| 10| 1| 1|
+---+---+---+---+---+

Spark pairwise differences within groups

I have a spark dataframe, for the sake of argument lets take it to be:
val df = sc.parallelize(
Seq(("a",1,2),("a",1,4),("b",5,6),("b",10,2),("c",1,1))
).toDF("id","x","y")
+---+---+---+
| id| x| y|
+---+---+---+
| a| 1| 2|
| a| 1| 4|
| b| 5| 6|
| b| 10| 2|
| c| 1| 1|
+---+---+---+
I would like to compute all pairwise differences between entries in the dataframe with the same id and output the result to another dataframe. For a small dataframe I can accomplish this by:
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| c| 0| 0|
| b| 0| 0|
| b| -5| 4|
| b| 5| -4|
| b| 0| 0|
| a| 0| 0|
| a| 0| -2|
| a| 0| 2|
| a| 0| 0|
+---+---+---+
However, for large dataframes this isn't a reasonable approach as the crossJoin will mostly produce data that will be discarded by the subsequent where clause.
I'm still pretty new to spark and groupBy seemed like a natural place to start looking, but I can't figure out how to accomplish this using groupBy. Any help would be welcome.
I would eventually like to remove redundancy, for instance in:
val df1 = df.withColumn("idx",monotonicallyIncreasingId)
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id") && col("idx") < col("_idx")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| b| -5| 4|
| a| 0| -2|
+---+---+---+
But if its easier to accomplish this with redundancy, then I can live with that.
This is not an uncommon transformation to perform in ML so I thought something out of MLlib might be appropriate, but again I haven't found anything there either.
Can be achived via inner join, result the same as expected:
df.alias("left").join(df.alias("right"),"id")
.select($"id",
($"left.x"-$"right.x").alias("dx"),
($"left.y"-$"right.y").alias("dy"))

Resources