How to select distinct rows from a Spark Window partition - apache-spark

I have a sample DF with duplicate rows like this:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
val w = Window.partitionBy($"ENCOUNTER_ID")
Using the above Spark Window partition, is it possible to select distinct rows? I am expecting the output DF as:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
I don't want to use DF.DISTINCT or DF.DROPDUPLICATES as it would involve shuffling.
I prefer not to use lag or lead because, in real-time, the order of rows can't be guaranteed.

Window function also shuffle data. So if your all columns are duplicate then df.dropDuplicates will be better option to use. If your use case want to use Window function then you can use below approach.
scala> df.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+
//You can use column in partitionBy that need to check for duplicate and also use respective orderBy also as of now I have use sample Window
scala> val W = Window.partitionBy(col("ID"),col("CL_ID"),col("NBR"),col("DT"), col("TYP"), col("KEY")).orderBy(lit(1))
scala> df.withColumn("duplicate", when(row_number.over(W) === lit(1), lit("Y")).otherwise(lit("N")))
.filter(col("duplicate") === lit("Y"))
.drop("duplicate")
.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+

An answer to your question that scales up well with big data :
df.dropDuplicates(include your key cols here = ID in this case).
Window function shuffles data, but if you have duplicate entries and want to choose which one to keep for example, or want to sum the value of the duplicates then window function is the way to go
w = Window.PartitionBy('id')
df.agg(first( value col ).over(w)) #you can use max, min, sum, first, last depending on how you want to treat duplicates
An interesting third possibility if you want to keep the values of duplicates (for record)
is the below before
df.withColumn('dup_values', collect(value_col).over(w))
this will create an extra column with an array per row to keep duplicate values after you've got rid of the rows

Related

Eliminate null value rows for a specific column while doing partitionBy column in pyspark

I have a pyspark dataframe like this:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
|333| null| CT|
+---+-----+-----+
For a given ID, I would like to keep that record even though column "name" is null if its a ID is not repeated, but if the ID is repeated, then I would like to check on name column and make sure it does not contain duplicates within that ID, and also remove if "name" is null ONLY for repeated IDs. Below is the desired output:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
+---+-----+-----+
How can I achieve this in PySpark?
You can do this by grouping by the id column and count the number of names in each group. Null values will be ignored by default in Spark so any group that has 0 in count should be kept. We can now filter away any nulls in groups with a count larger than 0.
In Scala this can be done with a window function as follows:
val w = Window.partitionBy("id")
val df2 = df.withColumn("gCount", count($"name").over(w))
.filter($"name".isNotNull or $"gCount" === 0)
.drop("gCount")
The PySpark equivalent:
w = Window.partitionBy("id")
df.withColumn("gCount", count("name").over(w))
.filter((col("name").isNotNull()) | (col("gCount") == 0))
.drop("gCount")
The above will not remove rows that have multiple nulls for the same id (all these will be kept).
If these should be removed as well, keeping only a single row with name==null, an easy way would be to use .dropDuplicates(['id','name']) before or after running the above code. Note that this also will remove any other duplicates (in which case .dropDuplicates(['id','name', 'state']) could be preferable).
I think you can do that in two steps. First, count values by id
import pyspark.sql.window as psw
w = psw.Window.partitionBy("id")
df = df.withColumn("n",psf.sum(psf.lit(1)).over(w))
Then filter to remove Null when n<1:
df.filter(!((psf.col('name').isNull()) & (psf.col('n') > 1)))
Edit
As mentioned by #Shubham Jain, if you have several Null values for name (duplicates), the above filter will keep them. In that case, the solution proposed by #Shaido is useful: add a post treatment using .dropDuplicates(['id','name']). Or .dropDuplicates(['id','name','state']), following your preference

GroupBy dataframe column without aggregation and set not null values

I have a dataframe having records like below:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |null|
|101|null|303 |
+---+----+----+
Is their a simple way to groupBy and get result like below in Spark SQL:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |303 |
+---+----+----+
Thanks.
Use max or min to aggregate the data. Since you only have a single valid value, this is the one that will be selected. Note that it's not possible to use first here (which is faster) since that can still return null values.
When the columns are of numeric types it can be solved as follows:
df.groupBy("id").agg(max($"L1").as("L1"), max($"L2").as("L2"))
However, if you are dealing with strings, you need to collect all values as a list (or set) and then use coalesce:
df.groupBy("id")
.agg(coalesce(collect_list($"L1")).as("L1"), coalesce(collect_list($"L2")).as("L2"))
Of course, this assumes that the nulls are not strings but actual nulls.

Spark SQL alternatives to groupby/pivot/agg/collect_list using foldLeft & withColumn so as to improve performance

I have a Spark DataFrame consisting of three columns:
id | col1 | col2
-----------------
x | p1 | a1
-----------------
x | p2 | b1
-----------------
y | p2 | b2
-----------------
y | p2 | b3
-----------------
y | p3 | c1
After applying df.groupBy("id").pivot("col1").agg(collect_list("col2")) I am getting the following dataframe (aggDF):
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x|[a1]| [b1]| []|
| y| []|[b2, b3]|[c1]|
+---+----+--------+----+
Then I find the name of columns except the id column.
val cols = aggDF.columns.filter(x => x != "id")
After that I am using cols.foldLeft(aggDF)((df, x) => df.withColumn(x, when(size(col(x)) > 0, col(x)).otherwise(lit(null)))) to replace empty array with null. The performance of this code becomes poor when the number of columns increases. Additionally, I have the name of string columns val stringColumns = Array("p1","p3"). I want to get the following final dataframe:
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1 | [b1]|null|
| y|null|[b2, b3]| c1 |
+---+----+--------+----+
Is there any better solution to this problem in order to achieve the final dataframe?
You current code pays 2 performance costs as structured:
As mentioned by Alexandros, you pay 1 catalyst analysis per DataFrame transform so if you loop other a few hundreds or thousands columns, you'll notice some time spent on the driver before the job is actually submitted. If this is a critical issue for you, you can use a single select statement instead of your foldLeft on withColumns but this won't really change a lot the execution time because of the next point
When you use an expression such as when().otherwise() on columns in what can be optimized as a single select statement, the code generator will produce a single large method processing all the columns. If you have more than a couple hundred columns, it's likely that the resulting method won't be JIT-compiled by default by the JVM, resulting in very slow execution performance (max JIT-able method is 8k bytecode in Hotspot).
You can detect if you hit the second issue by inspecting the executor logs and check if you see a WARNING on a too large method that can't be JITed.
How to try and solve this ?
1 - Changing the logic
You can filter the empty cells before the pivot by using a window transform
import org.apache.spark.sql.expressions.Window
val finalDf = df
.withColumn("count", count('col2) over Window.partitionBy('id,'col1))
.filter('count > 0)
.groupBy("id").pivot("col1").agg(collect_list("col2"))
This may or may not be faster depending on actual dataset as the pivot also generates a large select statement expression by itself so it may hit the large method threshold if you encounter more than approximately 500 values for col1.
You may want to combine this with option 2 as well.
2 - Try and finesse the JVM
You can add an extraJavaOption on your executors to ask the JVM to try and JIT hot methods larger than 8k.
For example, add the option
--conf "spark.executor.extraJavaOptions=-XX:-DontCompileHugeMethods"
on your spark-submit and see how it impacts the pivot execution time.
It's difficult to guarantee a substantial speed increase without more details on your real dataset but it's definitely worth a shot.
If you look at https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 then you see that withColumn with a foldLeft has known performance issues. Select is an alternative, as shown below - using varargs.
Not convinced collect_list is an issue. 1st set of logic I kept as well. pivot kicks off a Job to get distinct values for pivoting. It is an accepted approach imo. Trying to roll your own seems pointless to me, but the other answers may prove me wrong or Spark 2.4 has been improved.
import spark.implicits._
import org.apache.spark.sql.functions._
// Your code & assumig id is only col of interest as in THIS question. More elegant than 1st posting.
val df = Seq( ("x","p1","a1"), ("x","p2","b1"), ("y","p2","b2"), ("y","p2","b3"), ("y","p3","c1")).toDF("id", "col1", "col2")
val aggDF = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
//aggDF.show(false)
val colsToSelect = aggDF.columns // All in this case, 1st col id handled by head & tail
val aggDF2 = aggDF.select((col(colsToSelect.head) +: colsToSelect.tail.map
(col => when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col)).as(s"$col"))):_*)
aggDF2.show(false)
returns:
+---+----+--------+----+
|id |p1 |p2 |p3 |
+---+----+--------+----+
|x |[a1]|[b1] |null|
|y |null|[b2, b3]|[c1]|
+---+----+--------+----+
Also a nice read BTW: https://lansalo.com/2018/05/13/spark-how-to-add-multiple-columns-in-dataframes-and-how-not-to/. The effects become more noticable with a higher number of columns. At the end a reader makes a relevant point.
I think that performance is better with select approach when higher number of columns prevail.
UPD: Over the holidays I trialed both approaches with Spark 2.4.x with little observable difference up to 1000 columns. That has puzzled me.

how to get first value and last value from dataframe column in pyspark?

I Have Dataframe,I want get first value and last value from DataFrame column.
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| G| 2| 0.09523809523809523|
| K| 2| 0.09523809523809523|
| D| 1|0.047619047619047616|
+----+-----+--------------------+
expecting output is from support column first,last value i.e x=[0.23809523809523808,0.047619047619047616.]
You may use collect but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items. Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.
Another idea would be to use agg with the first and last aggregation function. This does not work! (because the reducers do not necessarily get the records in the order of the dataframe)
Spark offers a head function, which makes getting the first element very easy. However, spark does not offer any last function. A straightforward approach would be to sort the dataframe backward and use the head function again.
first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support
Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex to index the dataframe and only keep the first and the last elements.
size = df.count()
df.rdd.zipWithIndex()\
.filter(lambda x : x[1] == 0 or x[1] == size-1)\
.map(lambda x : x[0].support)\
.collect()
You can try indexing the data frame see below example:
df = <your dataframe>
first_record = df.collect()[0]
last_record = df.collect()[-1]
EDIT:
You have to pass the column name as well.
df = <your dataframe>
first_record = df.collect()[0]['column_name']
last_record = df.collect()[-1]['column_name']
Since version 3.0.0, spark also have DataFrame function called
.tail() to get the last value.
This will return List of Row objects:
last=df.tail(1)[0].support

spark groupby on several columns at same time

I’m using Spark2.0
I have a dataframe having several columns like id, latitude, longitude, time,
I want to do a groupby and keep [“latitude”,” longitude”] always together,
Could I do the following?
df.groupBy('id',[“latitude”,” longitude”] ,'time')
I want to calculate records number for each user , at each different time, with each different location [“latitude”,” longitude”].
You can combine "latitude" and "longitude" columns and then can use groupBy. Below sample is using Scala.
val df = Seq(("1","33.33","35.35","8:00"),("2","31.33","39.35","9:00"),("1","33.33","35.35","8:00")).toDF("id","latitude","longitude","time")
df.show()
val df1 = df.withColumn("lat-long",array($"latitude",$"longitude"))
df1.show()
val df2 = df1.groupBy("id","lat-long","time").count()
df2.show()
Output will be like below.
+---+--------------+----+-----+
| id| lat-long|time|count|
+---+--------------+----+-----+
| 2|[31.33, 39.35]|9:00| 1|
| 1|[33.33, 35.35]|8:00| 2|
+---+--------------+----+-----+
You can just use:
df.groupBy('id', 'latitude', 'longitude','time').agg(...)
This will work as expected without any additional steps.

Resources