Replace values in a PySpark Dataframe group with max row values - apache-spark

We have this PySpark Dataframe:
+---+--------+-----------+
| id|language| summary|
+---+--------+-----------+
| 2| Java| Great|
| 4| Python| Awesome|
| 7| Python| Amazing|
| 9| Python| Incredible|
| 3| Scala| Good|
| 6| Scala| Fantastic|
+---+--------+-----------+
This issue is a bit convoluted, so please bear with me. For rows with the same language column value I want to be able to adjust the summary column values using the id as a tie breaker (the rows with the same language should select the row with the max id for that language and change all summaries to equal the max id row's summary). So for example for Python, I want to be able to replace all the summaries with "Incredible" since the row with "Incredible" has the highest id for Python. Same for Scala. So it would result into this:
+---+--------+-----------+
| id|language| summary|
+---+--------+-----------+
| 2| Java| Great|
| 4| Python| Incredible|
| 7| Python| Incredible|
| 9| Python| Incredible|
| 3| Scala| Fantastic|
| 6| Scala| Fantastic|
+---+--------+-----------+
We can assume that the ids are always going to be unique for each language group. We will never have the same id twice for one language although we may see the same id for different languages.

Another way using windowing:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
df = df.orderBy('language','id')
win = Window().partitionBy('language')
df = df.withColumn('summary', F.last('summary').over(win))

You can get the summary corresponding to the maximum id for each language using a window function:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'summary',
F.max(F.struct('id', 'summary')).over(Window.partitionBy('language'))['summary']
)
df2.show()
+---+--------+----------+
| id|language| summary|
+---+--------+----------+
| 5| Scala| Fantastic|
| 6| Scala| Fantastic|
| 2| Python|Incredible|
| 3| Python|Incredible|
| 4| Python|Incredible|
| 1| Java| Great|
+---+--------+----------+

Related

Calculate Spark column value depending on another row value on the same column

I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+

how to combine rows in a data frame by id

I have a data frame:
+---------+---------------------+
| id| Name|
+---------+---------------------+
| 1| 'Gary'|
| 1| 'Danny'|
| 2| 'Christopher'|
| 2| 'Kevin'|
+---------+---------------------+
I need to combine all the Name values in the id column. Please tell me how to get from it:
+---------+------------------------+
| id| Name|
+---------+------------------------+
| 1| ['Gary', 'Danny']|
| 2| ['Kevin','Christopher']|
+---------+------------------------+
You can use groupBy and collect functions. Based on your need you can use list or set etc.
df.groupBy(col("id")).agg(collect_list(col("Name"))
in case you want duplicate values
df.groupBy(col("id")).agg(collect_set(col("Name"))
if you want unique values
Use groupBy and collect_list functions for this case.
from pyspark.sql.functions import *
df.groupBy(col("id")).agg(collect_list(col("Name")).alias("Name")).show(10,False)
#+---+------------------------+
#|id |Name |
#+---+------------------------+
#|1 |['Gary', 'Danny'] |
#|2 |['Kevin', 'Christopher']|
#+---+------------------------+
df.groupby('id')['Name'].apply(list)

How to use spark window function as cascading changes of previous row to next row

I tried to use window function to calculate current value based on previous value in dynamic way
rowID | value
------------------
1 | 5
2 | 7
3 | 6
Logic:
If value > pre_value then value
So in row 2, since 7 > 5 then value becomes 5.
The final result should be
rowID | value
------------------
1 | 5
2 | 5
3 | 5
However using lag().over(w) gave the result as
rowID | value
------------------
1 | 5
2 | 5
3 | 6
it compares third row value 6 against the "7" not the new value "5"
Any suggestion how to achieve this?
df.show()
#exampledataframe
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 7|
| 3| 6|
| 4| 9|
| 5| 4|
| 6| 3|
+-----+-----+
Your required logic is too dynamic for window functions, therefore, we have to go row by row updating our values. One solution could be to use normal python udf on collected list and then explode once udf has been applied. If have relatively small data, this should be fine.(spark2.4 only because of arrays_zip).
from pyspark.sql import functions as F
from pyspark.sql.types import *
def add_one(a):
for i in range(1,len(a)):
if a[i]>a[i-1]:
a[i]=a[i-1]
return a
udf1= F.udf(add_one, ArrayType(IntegerType()))
df.agg(F.collect_list("rowID").alias("rowID"),F.collect_list("value").alias("value"))\
.withColumn("value", udf1("value"))\
.withColumn("zipped", F.explode(F.arrays_zip("rowID","value"))).select("zipped.*").show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+
UPDATE:
Better yet, as you have groups of 5000, using a Pandas vectorized udf( grouped MAP) should help a lot with processing. And you do not have to collect_list with 5000 integers and explode or use pivot. I think this should be the optimal solution. Pandas UDAF available for spark2.3+
GroupBy below is empty, but you can add your grouping column in that.
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def grouped_map(df1):
for i in range(1, len(df1)):
if df1.loc[i, 'value']>df1.loc[i-1,'value']:
df1.loc[i,'value']=df1.loc[i-1,'value']
return df1
df.groupby().apply(grouped_map).show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

How to rename duplicated columns after join? [duplicate]

This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status column is in two dataframes, i.e. aa_df and bb_df.
The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.
If you are trying to rename the status column of bb_df dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias or as operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol-based variants.)
NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.
So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.
There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select (over drop), I'd do the following to have a single id column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).
Let's assume you ended up with the following query and so you've got two id columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Please see the docs : withColumnRenamed()
You need to pass the name of the existing column and the new name to the function. Both of these should be strings.
result_df = aa_df.join(bb_df,'id', 'left').join(cc_df, 'id', 'left').withColumnRenamed('status', 'user_status')
If you have 'status' columns in 2 dataframes, you can use them in the join as aa_df.join(bb_df, ['id','status'], 'left') assuming aa_df and bb_df have the common column. This way you will not end up having 2 'status' columns.

Resources