Conditional replacement of values in pyspark dataframe - apache-spark

I have the spark dataframe below:
+----------+-------------+--------------+------------+----------+-------------------+
| part| company| country| city| price| date|
+----------+-------------+--------------+------------+----------+-------------------+
| 52125-136| Brainsphere| null| Braga| 493.94€|2016-05-10 11:13:43|
| 70253-307|Chatterbridge| Spain| Barcelona| 969.29€|2016-05-10 13:06:30|
| 50563-113| Kanoodle| Japan| Niihama| ¥72909.95|2016-05-10 13:11:57|
|52380-1102| Flipstorm| France| Nanterre| 794.84€|2016-05-10 13:19:12|
| 54473-578| Twitterbeat| France| Annecy| 167.48€|2016-05-10 15:09:46|
| 76335-006| Ntags| Portugal| Lisbon| 373.07€|2016-05-10 15:20:22|
| 49999-737| Buzzbean| Germany| Düsseldorf| 861.2€|2016-05-10 15:21:51|
| 68233-011| Flipstorm| Greece| Athens| 512.89€|2016-05-10 15:22:03|
| 36800-952| Eimbee| France| Amiens| 219.74€|2016-05-10 21:22:46|
| 16714-295| Teklist| null| Arnhem| 624.4€|2016-05-10 21:57:15|
| 42254-213| Thoughtmix| Portugal| Amadora| 257.99€|2016-05-10 22:01:04|
From these columns, only the country column has null values. So what I want to do is to fill the null values with the country that corresponds to the city on the right. The dataframe is big and there are cases where Braga (for example) has the country that it belongs and other cases where this is not the case.
So, how can I fill those null values in the country column based on the city column on the right and at the same time take advantage of Spark's parallel computation?

You can use a window functions for that.
from pyspark.sql import functions as F, Window
df.withColumn(
"country",
F.coalesce(
F.col("country"),
F.first("country").over(Window.partitionBy("city").orderBy("city")),
),
).show()

Use coalesce function in spark to get first non null value from list of columns.
Example:
df.show()
#+--------+---------+
#| country| city|
#+--------+---------+
#| null| Braga|
#| Spain|Barcelona|
#| null| Arnhem|
#|portugal| Amadora|
#+--------+---------+
from pyspark.sql.functions import *
df.withColumn("country",coalesce(col("country"),col("city"))).show()
#+--------+---------+
#| country| city|
#+--------+---------+
#| Braga| Braga|
#| Spain|Barcelona|
#| Arnhem| Arnhem|
#|portugal| Amadora|
#+--------+---------+

Related

Return null in SUM if some values are null

I have a case where I may have null values in the column that needs to be summed up in a group.
If I encounter a null in a group, I want the sum of that group to be null. But PySpark by default seems to ignore the null rows and sum-up the rest of the non-null values.
For example:
dataframe = dataframe.groupBy('dataframe.product', 'dataframe.price') \
.agg(f.sum('price'))
Expected output is:
But I am getting:
sum function returns NULL only if all values are null for that column otherwise nulls are simply ignored.
You can use conditional aggregation, if count(price) == count(*) it means there are no nulls and we return sum(price). Else, null is returned:
from pyspark.sql import functions as F
df.groupby("product").agg(
F.when(F.count("price") == F.count("*"), F.sum("price")).alias("sum_price")
).show()
#+-------+---------+
#|product|sum_price|
#+-------+---------+
#| B| 200|
#| C| null|
#| A| 250|
#+-------+---------+
Since Spark 3.0+, one can also use any function:
df.groupby("product").agg(
F.when(~F.expr("any(price is null)"), F.sum("price")).alias("sum_price")
).show()
You can replace nulls with NaNs using coalesce:
df2 = df.groupBy('product').agg(
F.sum(
F.coalesce(F.col('price'), F.lit(float('nan')))
).alias('sum(price)')
).orderBy('product')
df2.show()
+-------+----------+
|product|sum(price)|
+-------+----------+
| A| 250.0|
| B| 200.0|
| C| NaN|
+-------+----------+
If you want to keep integer type, you can convert NaNs back to nulls using nanvl:
df2 = df.groupBy('product').agg(
F.nanvl(
F.sum(
F.coalesce(F.col('price'), F.lit(float('nan')))
),
F.lit(None)
).cast('int').alias('sum(price)')
).orderBy('product')
df2.show()
+-------+----------+
|product|sum(price)|
+-------+----------+
| A| 250|
| B| 200|
| C| null|
+-------+----------+

Python Spark join two dataframes and fill column

I have two dataframes that need to be joined in a particular way I am struggling with.
dataframe 1:
+--------------------+---------+----------------+
| asset_domain| eid| oid|
+--------------------+---------+----------------+
| test-domain...| 126656| 126656|
| nebraska.aaa.com| 335660| 335660|
| netflix.com| 460| 460|
+--------------------+---------+----------------+
dataframe 2:
+--------------------+--------------------+---------+--------------+----+----+------------+
| asset| asset_domain|dns_count| ip| ev|post|form_present|
+--------------------+--------------------+---------+--------------+----+----+------------+
| sub1.test-domain...| test-domain...| 6354| 11.11.111.111| 1| 1| null|
| netflix.com| netflix.com| 3836| 22.22.222.222|null|null| null|
+--------------------+--------------------+---------+--------------+----+----+------------+
desired result:
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| asset|dns_count| ip| ev|post|form_present| eid| oid|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| netflix.com| 3836|22.22.222.222|null|null| null| 460| 460|
| sub1.test-domain...| 5924|111.11.111.11| 1| 1| null| 126656| 126656|
| nebraska.aaa.com| null| null|null|null| null| 335660| 335660|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
Basically – it should join df1 and df2 on asset_domain but if that doesn't exist in df2, then the resulting asset should be the asset_domain from df1.
I tried df = df2.join(df1, ["asset_domain"], "right").drop("asset_domain") but that obviously leaves null in the asset column for nebraska.aaa.com since it does not have a matching domain in df2. How do I go about adding those to the asset column for this particular case?
you can use coalesce function after join to create asset column.
df2.join(df1, ["asset_domain"], "right").select(coalesce("asset","asset_domain").alias("asset"),"dns_count","ip","ev","post","form_present","eid","oid").orderBy("asset").show()
#+----------------+---------+-------------+----+----+------------+------+------+
#| asset|dns_count| ip| ev|post|form_present| eid| oid|
#+----------------+---------+-------------+----+----+------------+------+------+
#|nebraska.aaa.com| null| null|null|null| null|335660|335660|
#| netflix.com| 3836|22.22.222.222|null|null| None| 460| 460|
#|sub1.test-domain| 6354|11.11.111.111| 1| 1| null|126656|126656|
#+----------------+---------+-------------+----+----+------------+------+------+
After the join you can use the isNull() function
import pyspark.sql.functions as F
tst1 = sqlContext.createDataFrame([('netflix',1),('amazon',2)],schema=("asset_domain",'xtra1'))
tst2= sqlContext.createDataFrame([('netflix','yahoo',1),('amazon','yahoo',2),('flipkart',None,2)],schema=("asset_domain","asset",'xtra'))
tst_j = tst1.join(tst2,on='asset_domain',how='right')
#%%
tst_res = tst_j.withColumn("asset",F.when(F.col('asset').isNull(),F.col('asset_domain')).otherwise(F.col('asset')))

How to add column with alternate values in PySpark dataframe?

I have the following sample dataframe
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
and I want to explode the values in each row and associate alternating 1-0 values in the generated rows. This way I can identify the start/end entries in each row.
I am able to achieve the desired result this way
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = (df.withColumn('start_end', fn.array('start', 'end'))
.withColumn('date', fn.explode('start_end'))
.withColumn('row_num', fn.row_number().over(w)))
df = (df.withColumn('is_start', fn.when(fn.col('row_num')%2 == 0, 0).otherwise(1))
.select('date', 'is_start'))
which gives
| date | is_start |
|--------|----------|
| start | 1 |
| end | 0 |
| start1 | 1 |
| end1 | 0 |
but it seems overly complicated for such a simple task.
Is there any better/cleaner way without using UDFs?
You can use pyspark.sql.functions.posexplode along with pyspark.sql.functions.array.
First create an array out of your start and end columns, then explode this with the position:
from pyspark.sql.functions import array, posexplode
df.select(posexplode(array("end", "start")).alias("is_start", "date")).show()
#+--------+------+
#|is_start| date|
#+--------+------+
#| 0| end|
#| 1| start|
#| 0| end1|
#| 1|start1|
#+--------+------+
You can try union:
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df = df.withColumn('startv', F.lit(1))
df = df.withColumn('endv', F.lit(0))
df = df.select(['start', 'startv']).union(df.select(['end', 'endv']))
df.show()
+------+------+
| start|startv|
+------+------+
| start| 1|
|start1| 1|
| end| 0|
| end1| 0|
+------+------+
You can rename the columns and re-order the rows starting here.
I had similar situation in my use case. In my situation i had Huge dataset(~50GB) and doing any self join/heavy transformation was resulting in more memory and unstable execution .
I went one more level down of dataset and used flatmap of rdd. This will use map side transformation and it will be cost effective in terms of shuffle, cpu and memory.
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df.show()
+------+----+
| start| end|
+------+----+
| start| end|
|start1|end1|
+------+----+
final_df = df.rdd.flatMap(lambda row: [(row.start, 1), (row.end, 0)]).toDF(['date', 'is_start'])
final_df.show()
+------+--------+
| date|is_start|
+------+--------+
| start| 1|
| end| 0|
|start1| 1|
| end1| 0|
+------+--------+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

pyspark join two Dataframe and keep row by the recent date

I have two Dataframes A and B.
A
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+
B
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+
and I must create a new Dataframe where the score is updated by looking the date
result
+---+------+-----+----------+
|id |player|score|date |
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+
You can join the two dataframes, and use pyspark.sql.functions.when() to pick the values for the score and date columns.
from pyspark.sql.functions import col, when
df_A.alias("a").join(df_B.alias("b"), on=["id", "player"], how="inner")\
.select(
"id",
"player",
when(
col("b.date") > col("a.date"),
col("b.score")
).otherwise(col("a.score")).alias("score"),
when(
col("b.date") > col("a.date"),
col("b.date")
).otherwise(col("a.date")).alias("date")
)\
.show()
#+---+------+-----+----------+
#| id|player|score| date|
#+---+------+-----+----------+
#| 1| alpha| 100|2019-02-13|
#| 2| beta| 6|2018-02-13|
#+---+------+-----+----------+
Read more on when: Spark Equivalent of IF Then ELSE
I am making an assumption that every player is allocated an id and it doesn't change. OP wants that the resulting dataframe should contain the score from the most current date.
# Creating both the DataFrames.
df_A = sqlContext.createDataFrame([(1,'alpha',5,'2018-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_A = df_A.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))
df_B = sqlContext.createDataFrame([(1,'alpha',100,'2019-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_B = df_B.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))
The idea is to make a union(), of these two dataframes and then take the distinct rows. The reason behind taking distinct rows afterwards is the following - Suppose there was no update for a player, then in the B dataframe, it's corresponding values will be the same as in dataframe A. So, we remove such duplicates.
# Importing the requisite packages.
from pyspark.sql.functions import col, max
from pyspark.sql import Window
df = df_A.union(df_B).distinct()
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+
Now, as a final step, use Window() function to loop over the unioned dataframe df and find the latestDate and filter out only those rows where the date is same as the latestDate. That way, all those rows corresponding to those players will be removed where there was an update (manifested by an updated date in dataframe B).
w = Window.partitionBy('id','player')
df = df.withColumn('latestDate', max('date').over(w))\
.where(col('date') == col('latestDate')).drop('latestDate')
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

Resources