delete duplicate records based on other column pyspark - apache-spark

I have a data frame in pyspark like below.
df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 1| N|
| 2| Y|
| 3| N|
+---+----+
I want to delete a record when there is a duplicate id and the test is N
Now when I query the new_df
new_df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 2| Y|
| 3| N|
+---+----+
I am unable to figure out the use case.
I have done groupby on the id count but it gives only the id column and count.
I have done like below.
grouped_df = new_df.groupBy("id").count()
How can I achieve my desired result
edit
I have a data frame like below.
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| N|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+
When I done like below
new_df = df.groupBy("sn").agg(max("attribute").alias("attribute"))
I am getting str has no attribute alias error
The expected result should be like below
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+

Not the most generic solution but should fit here nicely:
from pyspark.sql.functions import max
df = spark.createDataFrame(
[(1, "Y"), (1, "N"), (2, "Y"), (3, "N")], ("id", "test")
)
df.groupBy("id").agg(max("test").alias("test")).show()
# +---+----+
# | id|test|
# +---+----+
# | 1| Y|
# | 3| N|
# | 2| Y|
# +---+----+
More generic one:
from pyspark.sql.functions import col, count, when
test = when(count(when(col("test") == "Y", "Y")) > 0, "Y").otherwise("N")
df.groupBy("id").agg(test.alias("test")).show()
# +---+----+
# | id|test|
# +---+----+
# | 1| Y|
# | 3| N|
# | 2| Y|
# +---+----+
which can be generalized to accommodate more classes and non-trivial ordering, for example if you had three classes Y, ?, N evaluated in this order, you could:
(when(count(when(col("test") == "Y", True)) > 0, "Y")
.when(count(when(col("test") == "?", True)) > 0, "?")
.otherwise("N"))
If there are other columns you need to preserve these methods won't work, and you'll need something like shown in Find maximum row per group in Spark DataFrame

Another option using row_number:
df.selectExpr(
'*',
'row_number() over (partition by id order by test desc) as rn'
).filter('rn=1 or test="Y"').drop('rn').show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 3| N|
| 2| Y|
+---+----+
This method doesn't aggregate but only remove duplicated ids when test is N

Using Spark SQL temporary tables, I used Databricks Notebook
case class T(id:Int,test:String)
val df=spark.createDataset(Seq(T(1, "Y"), T(1, "N"), T(2, "Y"), T(3, "N")))
df.createOrReplaceTempView("df")
%sql select id, max(test) from df group by id

You can use the below code:
#register as temp table
df.registerTempTable("df")
#create single rows
newDF = sqlc.sql(WITH dfCte AS
(
select *,row_number() over (partition by id order by test desc) as RowNumber
from df
)
select * from dfCte where RowNumber =1)
#drop row numbers and show the newdf
newDF.drop('RowNumber').show()

Related

Self join on different columns in pyspark?

I have pyspark dataframe like this
df = sqlContext.createDataFrame([
Row(a=1, b=3),
Row(a=3, b=2),
])
+---+---+
| a| b|
+---+---+
| 1| 3|
| 3| 2|
+---+---+
I tried self-join on it like this
df1 = df.alias("df1")
df2 = df.alias("df2")
cond = [df1.a == df2.b]
df1.join(df2, cond).show()
But it gives me error.
Ideally i want to find all pair where one neighbor is common. (3 is common to both 1,2)
+---+---+
| c1| c2|
+---+---+
| 1| 2|
+---+---+
You can rename column names accordingly before self join.
from pyspark.sql.functions import *
df_as1 = df.alias("df_as1").selectExpr("a as c1", "b")
df_as2 = df.alias("df_as2").selectExpr("a", "b as c2")
joined_df = df_as1.join(df_as2, col("df_as1.b") == col("df_as2.a"), 'inner').select("c1", "c2")
joined_df.show()
Output will be:
+---+---+
| c1| c2|
+---+---+
| 1| 2|
+---+---+

Spark: First group by a column then remove the group if specific column is null

Pandas code
df=df.groupby('col1').filter(lambda g: ~ (g.col2.isnull()).all())
First group with col1 and remove groups if all the elements in col2 are null.
I did try following:
Pyspark
df.groupBy("col1").filter(~df.col2.isNotNull().all())
You can do a non-null count over each group, and use filter to remove the rows where the count is 0:
# example dataframe
df.show()
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 1|null|
| 2| 1|
| 2|null|
| 3| 1|
+----+----+
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'not_null',
F.count('col2').over(Window.partitionBy('col1'))
).filter('not_null != 0').drop('not_null')
df2.show()
+----+----+
|col1|col2|
+----+----+
| 3| 1|
| 2| 1|
| 2|null|
+----+----+

How to set the value of a Pyspark column based on two conditions of the value of another column

Say I have a dataframe:
+-----+-----+-----+
|id |foo. |bar. |
+-----+-----+-----+
| 1| baz| 0|
| 2| baz| 0|
| 3| 333| 2|
| 4| 444| 1|
+-----+-----+-----+
I want to set the 'foo' column to a value depending on the value of bar.
If bar is 2: set the value of foo for that row to 'X',
else if bar is 1: set the value of foo for that row to 'Y'
And if neither condition is met, leave the foo value as it is.
pyspark.when seems like the closest method, but that doesn't seem to work based on another columns value.
when can work with other columns. You can use F.col to get the value of the other column and provide an appropriate condition:
import pyspark.sql.functions as F
df2 = df.withColumn(
'foo',
F.when(F.col('bar') == 2, 'X')
.when(F.col('bar') == 1, 'Y')
.otherwise(F.col('foo'))
)
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
We can solve this using when òr UDF in spark to insert new column based on condition.
Create Sample DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AddConditionalColumn').getOrCreate()
data = [(1,"baz",0),(2,"baz",0),(3,"333",2),(4,"444",1)]
columns = ["id","foo","bar"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3|333| 2|
| 4|444| 1|
+---+---+---+
Using When:
from pyspark.sql.functions import when
df2 = df.withColumn("foo", when(df.bar == 2,"X")
.when(df.bar == 1,"Y")
.otherwise(df.foo))
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
Using UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import *
def executeRule(value):
if value == 2:
return 'X'
elif value == 1:
return 'Y'
else:
return value
# Converting function to UDF
ruleUDF = F.udf(executeRule, StringType())
df3 = df.withColumn("foo", ruleUDF("bar"))
df3.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1| 0| 0|
| 2| 0| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+

Adding a group count column to a PySpark dataframe

I am coming from R and the tidyverse to PySpark due to its superior Spark handling, and I am struggling to map certain concepts from one context to the other.
In particular, suppose that I had a dataset like the following
x | y
--+--
a | 5
a | 8
a | 7
b | 1
and I wanted to add a column containing the number of rows for each x value, like so:
x | y | n
--+---+---
a | 5 | 3
a | 8 | 3
a | 7 | 3
b | 1 | 1
In dplyr, I would just say:
import(tidyverse)
df <- read_csv("...")
df %>%
group_by(x) %>%
mutate(n = n()) %>%
ungroup()
and that would be that. I can do something almost as simple in PySpark if I'm looking to summarize by number of rows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.count() \
.show()
And I thought I understood that withColumn was equivalent to dplyr's mutate. However, when I do the following, PySpark tells me that withColumn is not defined for groupBy data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.withColumn("n", count("x")) \
.show()
In the short run, I can simply create a second dataframe containing the counts and join it to the original dataframe. However, it seems like this could become inefficient in the case of large tables. What is the canonical way to accomplish this?
When you do a groupBy(), you have to specify the aggregation before you can display the results. For example:
import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
]
df = sqlCtx.createDataFrame(data, ["x", "y"])
df.groupBy('x').count().select('x', f.col('count').alias('n')).show()
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
Here I used alias() to rename the column. But this only returns one row per group. If you want all rows with the count appended, you can do this with a Window:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.select('x', 'y', f.count('x').over(w).alias('n')).sort('x', 'y').show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+
Or if you're more comfortable with SQL, you can register the dataframe as a temporary table and take advantage of pyspark-sql to do the same thing:
df.registerTempTable('table')
sqlCtx.sql(
'SELECT x, y, COUNT(x) OVER (PARTITION BY x) AS n FROM table ORDER BY x, y'
).show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+
as #pault appendix
import pyspark.sql.functions as F
...
(df
.groupBy(F.col('x'))
.agg(F.count('x').alias('n'))
.show())
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
enjoy
I found we can get even more close to the tidyverse example:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.withColumn('n', f.count('x').over(w)).sort('x', 'y').show()

How to aggregate on one column and take maximum of others in pyspark?

I have columns X (string), Y (string), and Z (float).
And I want to
aggregate on X
take the maximum of column Z
report ALL the values for columns X, Y, and Z
If there are multiple values for column Y that correspond to the maximum for column Z, then take the maximum of those values in column Y.
For example, my table is like: table1:
col X col Y col Z
A 1 5
A 2 10
A 3 10
B 5 15
resulting in:
A 3 10
B 5 15
If I were using SQL, I would do it like this:
select X, Y, Z
from table1
join (select max(Z) as max_Z from table1 group by X) table2
on table1.Z = table2.max_Z
However how do I do this when 1) column Z is a float? and 2) I'm using pyspark sql?
The two following solutions are in Scala, but honestly could not resist posting them to promote my beloved window aggregate functions. Sorry.
The only question is which structured query is more performant/effective?
Window Aggregate Function: rank
val df = Seq(
("A",1,5),
("A",2,10),
("A",3,10),
("B",5,15)
).toDF("x", "y", "z")
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)
// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
.select("x", "y", "z")
.where($"rank" === 1) // <-- take the first row
.orderBy("x")
.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and dropDuplicates
I've always been thinking about the alternatives to rank function and first usually sprung to mind.
// use first and dropDuplicates
scala> df.
withColumn("y", first("y") over byX).
withColumn("z", first("z") over byX).
dropDuplicates.
orderBy("x").
show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
You can consider using Window function. My approach here is to create Window function that partition dataframe by X first. Then, order columns Y and Z by its value.
We can simply select rank == 1 for row that we're interested.
Or we can use first and drop_duplicates to achieve the same task.
PS. Thanks Jacek Laskowski for the comments and Scala solution that leads to this solution.
Create toy example dataset
from pyspark.sql.window import Window
import pyspark.sql.functions as func
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
Window Aggregate Function: rank
Apply windows function with rank function
w = Window.partitionBy(df['X']).orderBy([func.col('Y').desc(), func.col('Z').desc()])
df_max = df.select('X', 'Y', 'Z', func.rank().over(w).alias("rank"))
df_final = df_max.where(func.col('rank') == 1).select('X', 'Y', 'Z').orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and drop_duplicates
This task can also be achieved by using first and drop_duplicates as follows
df_final = df.select('X', func.first('Y').over(w).alias('Y'), func.first('Z').over(w).alias('Z'))\
.drop_duplicates()\
.orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Lets create a dataframe from your sample data as -
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
df.show()
output:
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
:
# create a intermediate dataframe that find max of Z
df1 = df.groupby('X').max('Z').toDF('X2','max_Z')
:
# create 2nd intermidiate dataframe that finds max of Y where Z = max of Z
df2 = df.join(df1,df.X==df1.X2)\
.where(col('Z')==col('max_Z'))\
.groupBy('X')\
.max('Y').toDF('X','max_Y')
:
# join above two to form final result
result = df1.join(df2,df1.X2==df2.X)\
.select('X','max_Y','max_Z')\
.orderBy('X')
result.show()
:
+---+-----+-----+
| X|max_Y|max_Z|
+---+-----+-----+
| A| 3| 10|
| B| 5| 15|
+---+-----+-----+

Resources