Reverse Group By function in pyspark? - apache-spark

Sample Data:
+-----------+------------+---------+
|City |Continent | Price|
+-----------+------------+---------+
| A | Asia | 100|
| B | Asia | 110|
| C | Africa | 60|
| D | Europe | 170|
| E | Europe | 90|
| F | Africa | 100|
+-----------+------------+---------+
Output:
For the second column I know we can just use
df.groupby("Continent").agg({'Price':'avg'})
But how can we calculate the third column? The third column groups by the cities that does not
belong to each continent and then calculates average price.
expected output
------------+--------------+----------------------------------------------+
|Continent | Average Price|Average Price for cities not in this continent|
+-----------+--------------+----------------------------------------------+
| Asia | 105| 105 |
| Africa | 80| 117.5 |
| Europe | 130| 92.5 |
+-----------+--------------+----------------------------------------------+

>>> from pyspark.sql.functions import col,avg
>>> df.show()
+----+---------+-----+
|City|Continent|Price|
+----+---------+-----+
| A| Asia| 100|
| B| Asia| 110|
| C| Africa| 60|
| D| Europe| 170|
| E| Europe| 90|
| F| Africa| 100|
+----+---------+-----+
>>> df1 = df.alias("a").join(df.alias("b"), col("a.Continent") != col("b.Continent"),"left").select(col("a.*"), col("b.price").alias("b_price"))
>>> df1.groupBy("Continent").agg(avg(col("Price")).alias("Average Price"), avg(col("b_price")).alias("Average Price for cities not in this continent")).show()
+---------+-------------+----------------------------------------------+
|Continent|Average Price|Average Price for cities not in this continent|
+---------+-------------+----------------------------------------------+
| Europe| 130.0| 92.5|
| Africa| 80.0| 117.5|
| Asia| 105.0| 105.0|
+---------+-------------+----------------------------------------------+

Related

Using withColumn, assign value to only the last row in a group in pyspark

I have a table with groups that can be ordered by time.
Group | Time | Food
-------------------------
Fruits | 1 | Apples
Fruits | 3 | Ketchup
Fruits | 5 | Bananas
Veggies | 2 | Broccoli
Veggies | 4 | Peas
Veggies | 8 | Carrots
As part of a more complicated when().otherwise() clause inside of withColumn() I need to assign a value into that new column for the last row of each group. I suspect I should use row number so I have something like this:
windowSpec = Window.partitionBy("Group").orderBy("Time")
my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last",
F.when(F.max("group_row").over(windowSpec) == F.col("group_row"), "Last")
.otherwise("Not Last")) \
.show()
I would expect the result to be
Group | Time | Food. | group_row | is_window_last
-------------------------------------------
Fruits | 1 | Apples | 1 | Not Last
Fruits | 3 | Ketchup | 2 | Not Last
Fruits | 5 | Bananas | 3 | Last
Veggies | 2 | Broccoli | 1 | Not Last
Veggies | 4 | Peas | 2 | Not Last
Veggies | 8 | Carrots | 3 | Last
But instead I get
Group | Time | Food. | group_row | is_window_last
-------------------------------------------
Fruits | 1 | Apples | 1 | Last
Fruits | 3 | Ketchup | 2 | Last
Fruits | 5 | Bananas | 3 | Last
Veggies | 2 | Broccoli | 1 | Last
Veggies | 4 | Peas | 2 | Last
Veggies | 8 | Carrots | 3 | Last
I've tried
my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last",
F.when(F.max("group_row").over(windowSpec) == F.col("group_row").over(windowSpec), "Last")
.otherwise("Not Last")) \
.show()
and
my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last",
F.when((F.max("group_row") == F.col("group_row")).over(windowSpec), "Last")
.otherwise("Not Last")) \
.show()
but neither did what I expected.
That's because you use the same windowSpec on both window function
windowSpec = W.partitionBy("g").orderBy("t")
(df
.withColumn('group_row', F.row_number().over(W.partitionBy('g').orderBy('t')))
.withColumn('max_group', F.max('group_row').over(W.partitionBy('g'))) # DON'T order by 'time' here
.withColumn('is_window_last', F
.when(F.col('max_group') == F.col('group_row'), 'Last')
.otherwise('Not Last')
)
.show()
)
+-------+---+--------+---------+---------+--------------+
| g| t| f|group_row|max_group|is_window_last|
+-------+---+--------+---------+---------+--------------+
| Fruits| 1| Apples| 1| 3| Not Last|
| Fruits| 3| Ketchup| 2| 3| Not Last|
| Fruits| 5| Bananas| 3| 3| Last|
|Veggies| 2|Broccoli| 1| 3| Not Last|
|Veggies| 4| Peas| 2| 3| Not Last|
|Veggies| 8| Carrots| 3| 3| Last|
+-------+---+--------+---------+---------+--------------+
This F.max("group_row").over(windowSpec) will calculate max of group_row per row. If you remove orderBy(time) on that window function, you will get max of group_row per group
I would use Last, row number will incur more compute effort. You will be required to compute max row number to achieve your objective.
w=Window.partitionBy('Group').orderBy('Time').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('c',last('Food').over(w)).withColumn('c', when(col('Food')==col('c'),'Last').otherwise('Not Last')).show()
+-------+----+--------+--------+
| Group|Time| Food| c|
+-------+----+--------+--------+
| Fruits| 1| Apples|Not Last|
| Fruits| 3| Ketchup|Not Last|
| Fruits| 5| Bananas| Last|
|Veggies| 2|Broccoli|Not Last|
|Veggies| 4| Peas|Not Last|
|Veggies| 8| Carrots| Last|
+-------+----+--------+--------+

Can I select 2 field as index in Pivot Table in Excel?

I am trying to create a pivot table in excel which taking 2 fields as column and be the key for grouping data.
Example:
Original Table:
| Fruit | Country | Sold |
| -------- | ---- | --|
| Apple | USA| 10|
| Apple | JAPAN| 20|
| Orange| JAPAN|5|
| Orange| USA|3|
| Orange| JAPAN|100|
| Orange| THAILAND|30|
| Banana| THAILAND|20|
| Banana| THAILAND|10|
Pivot Table I want:
| Fruit | Country | TotalSold |
| ------| ---- | --|
| Apple | USA | 10|
| Apple | JAPAN| 20|
| Orange| JAPAN|105|
| Orange| USA |3|
| Orange| THAILAND|30|
| Banana| THAILAND|30|
Basically, I want to use 2 column as key to group the Sold Amount. I have played a while in excel and still cannot find a way to group the data in this way.

PySpark percentile for multiple columns

I want to convert multiple numeric columns of PySpark dataframe into its percentile values using PySpark, without changing its order.
E.g. given an array of column names arr = [Salary, Age, Bonus] to convert columns into percentiles.
Input
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 1000 | 45 | 100 |
| 2 | Sales | 596 | 500 | 30 | 50 |
| 3 | Manufacture | 895 | 600 | 50 | 400 |
| 4 | HR | 212 | 700 | 26 | 60 |
| 5 | Business | 754 | 350 | 18 | 22 |
+----------+-------------+---------+--------+-----+-------+
Expected output
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 100 | 80 | 80 |
| 2 | Sales | 596 | 40 | 60 | 40 |
| 3 | Manufacture | 895 | 60 | 100 | 100 |
| 4 | HR | 212 | 80 | 40 | 60 |
| 5 | Business | 754 | 20 | 20 | 20 |
+----------+-------------+---------+--------+-----+-------+
The formula for percentile for a given element 'x' in the list = (Number of elements less than 'x'/Total number of elements) *100.
You can use percentile_approx for this , in conjunction with groupBy with the desired columns for which you want the percentile to be calculated.
Built in Spark > 3.x
input_list = [
(1,"HR",111,1000,45,100)
,(2,"Sales",112,500,30,50)
,(3,"Manufacture",127,600,50,500)
,(4,"Hr",821,700,26,60)
,(5,"Business",754,350,18,22)
]
sparkDF = sql.createDataFrame(input_list,['emp_no','dept','pincode','salary','age','bonus'])
sparkDF.groupBy(['emp_no','dept']).agg(
*[ F.first(F.col('pincode')).alias('pincode') ]
,*[ F.percentile_approx(F.col(col),0.95).alias(col) for col in ['salary','age','bonus'] ]
).show()
+------+-----------+-------+------+---+-----+
|emp_no| dept|pincode|salary|age|bonus|
+------+-----------+-------+------+---+-----+
| 3|Manufacture| 127| 600| 50| 500|
| 1| HR| 111| 1000| 45| 100|
| 2| Sales| 112| 500| 30| 50|
| 5| Business| 754| 350| 18| 22|
| 4| Hr| 821| 700| 26| 60|
+------+-----------+-------+------+---+-----+
Spark has a window function for calculating percentiles which is called percent_rank.
Test df:
from pyspark.sql import SparkSession, functions as F, Window as W
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, "HR", 111, 1000, 45, 100),
(2, "Sales", 596, 500, 30, 50),
(3, "Manufacture", 895, 600, 50, 400),
(4, "HR", 212, 700, 26, 60),
(5, "Business", 754, 350, 18, 22)],
['Empl_No', 'Dept', 'Pincode', 'Salary', 'Age', 'Bonus'])
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1000| 45| 100|
# | 2| Sales| 596| 500| 30| 50|
# | 3|Manufacture| 895| 600| 50| 400|
# | 4| HR| 212| 700| 26| 60|
# | 5| Business| 754| 350| 18| 22|
# +-------+-----------+-------+------+---+-----+
percent_rank works in a way that the smallest value gets percentile 0 and the biggest value gets 1.
arr = ['Salary', 'Age', 'Bonus']
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
df.show()
# +-------+-----------+-------+------+----+-----+
# |Empl_No| Dept|Pincode|Salary| Age|Bonus|
# +-------+-----------+-------+------+----+-----+
# | 1| HR| 111| 1.0|0.75| 0.75|
# | 2| Sales| 596| 0.25| 0.5| 0.25|
# | 3|Manufacture| 895| 0.5| 1.0| 1.0|
# | 4| HR| 212| 0.75|0.25| 0.5|
# | 5| Business| 754| 0.0| 0.0| 0.0|
# +-------+-----------+-------+------+----+-----+
However, your expectation is somewhat different. You expect it to assume 0 as the smallest value even though it does not exist in the columns.
To solve this I will add a row with 0 values and later it will be deleted.
arr = ['Salary', 'Age', 'Bonus']
# Adding a row containing 0 values
df = df.limit(1).withColumn('Dept', F.lit('_tmp')).select(
*[c for c in df.columns if c not in arr],
*[F.lit(0).alias(c) for c in arr]
).union(df)
# Calculating percentiles
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
# Removing the fake row
df = df.filter("Dept != '_tmp'")
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1.0|0.8| 0.8|
# | 2| Sales| 596| 0.4|0.6| 0.4|
# | 3|Manufacture| 895| 0.6|1.0| 1.0|
# | 4| HR| 212| 0.8|0.4| 0.6|
# | 5| Business| 754| 0.2|0.2| 0.2|
# +-------+-----------+-------+------+---+-----+
You can multiply the percentile by 100 if you like:
*[(100 * F.percent_rank().over(W.orderBy(c))).alias(c) for c in arr]
Then you get...
+-------+-----------+-------+------+-----+-----+
|Empl_No| Dept|Pincode|Salary| Age|Bonus|
+-------+-----------+-------+------+-----+-----+
| 1| HR| 111| 100.0| 80.0| 80.0|
| 2| Sales| 596| 40.0| 60.0| 40.0|
| 3|Manufacture| 895| 60.0|100.0|100.0|
| 4| HR| 212| 80.0| 40.0| 60.0|
| 5| Business| 754| 20.0| 20.0| 20.0|
+-------+-----------+-------+------+-----+-----+

pyspark pivot without aggregation

I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object
As an example have this:
+---------++---------++---------++---------+
| country| code |Value | ids
+---------++---------++---------++---------+
| Mexico |food_1_3 |apple | 1
| Mexico |food_1_3 |banana | 2
| Canada |beverage_2 |milk | 1
| Mexico |beverage_2 |water | 2
+---------++---------++---------++---------+
Need this:
+---------++---------++---------++----------+
| country| id |food_1_3 | beverage_2|
+---------++---------++---------++----------+
| Mexico | 1 |apple | |
| Mexico | 2 |banana |water |
| Canada | 1 | |milk |
|+---------++---------++---------++---------+
I have tried
(df.groupby(df.country, df.id).pivot("code").agg(first('Value').alias('Value')))
but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...
That's because your 'id' is not unique. Add a unique index column and that should work:
import pyspark.sql.functions as F
pivoted = df.groupby(df.country, df.id, F.monotonically_increasing_id().alias('index')).pivot("code").agg(F.first('Value').alias('Value')).drop('index')
pivoted.show()
+-------+---+----------+--------+
|country|ids|beverage_2|food_1_3|
+-------+---+----------+--------+
| Mexico| 1| null| apple|
| Mexico| 2| water| null|
| Canada| 1| milk| null|
| Mexico| 2| null| banana|
+-------+---+----------+--------+

PySpark - Select rows where the column has non-consecutive values after grouping

I have a dataframe of the form:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
| ha42a | AB | 3 |
| ha42a | AB | 4 |
| ha42a | AB | 5 |
I want to filter out users that are seen on consecutive days, if they are not seen in at least a single nonconsecutive day. The resulting dataframe should be:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
where the last user has been removed, since he appeared just on consecutive days.
Does anyone know how this can be done in spark?
Using spark-sql window functions and without any udfs. The df construction is done in scala but the sql part will be same in python. Check this out:
val df = Seq(("d25as","AB",2),("d25as","AB",3),("d25as","AB",5),("m3562","AB",1),("m3562","AB",7),("m3562","AB",9),("ha42a","AB",3),("ha42a","AB",4),("ha42a","AB",5)).toDF("user_id","action","day")
df.createOrReplaceTempView("qubix")
spark.sql(
""" with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
select user_id, action, day from t2 where size(diff2) > 1
""").show(false)
Results:
+-------+------+---+
|user_id|action|day|
+-------+------+---+
|d25as |AB |2 |
|d25as |AB |3 |
|d25as |AB |5 |
|m3562 |AB |1 |
|m3562 |AB |7 |
|m3562 |AB |9 |
+-------+------+---+
pyspark version
>>> from pyspark.sql.functions import *
>>> values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
... ('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
... ('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
>>> df = spark.createDataFrame(values,['user_id','action','day'])
>>> df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
>>> df.createOrReplaceTempView("qubix")
>>> spark.sql(
... """ with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
... t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
... select user_id, action, day from t2 where size(diff2) > 1
... """).show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
+-------+------+---+
>>>
Read the comments in between. The code will be self explanatory then.
from pyspark.sql.functions import udf, collect_list, explode
#Creating the DataFrame
values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
df = sqlContext.createDataFrame(values,['user_id','action','day'])
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
# Grouping together the days in one list.
df = df.groupby(['user_id','action']).agg(collect_list('day'))
df.show()
+-------+------+-----------------+
|user_id|action|collect_list(day)|
+-------+------+-----------------+
| ha42a| AB| [3, 4, 5]|
| m3562| AB| [1, 7, 9]|
| d25as| AB| [2, 3, 5]|
+-------+------+-----------------+
# Creating a UDF to check if the days are consecutive or not. Only keep False ones.
check_consecutive = udf(lambda row: sorted(row) == list(range(min(row), max(row)+1)))
df = df.withColumn('consecutive',check_consecutive(col('collect_list(day)')))\
.where(col('consecutive')==False)
df.show()
+-------+------+-----------------+-----------+
|user_id|action|collect_list(day)|consecutive|
+-------+------+-----------------+-----------+
| m3562| AB| [1, 7, 9]| false|
| d25as| AB| [2, 3, 5]| false|
+-------+------+-----------------+-----------+
# Finally, exploding the DataFrame from above to get the result.
df = df.withColumn("day", explode(col('collect_list(day)')))\
.drop('consecutive','collect_list(day)')
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
+-------+------+---+

Resources