I am getting trouble using agg function and renaming results properly. So far I have made the table of the following format.
sheet
equipment
chamber
time
value1
value2
a
E1
C1
1
11
21
a
E1
C1
2
12
22
a
E1
C1
3
13
23
b
E1
C1
1
14
24
b
E1
C1
2
15
25
b
E1
C1
3
16
26
I would like to create a statistical table like this:
sheet
E1_C1_value1_mean
E1_C1_value1_min
E1_C1_value1_max
E1_C1_value2_mean
E1_C1_value2_min
E1_C1_value2_max
a
12
11
13
22
21
23
b
15
14
16
25
24
26
Which I would like to groupBy "sheet", "equipment", "chamber" to aggregate mean, min, max values.
I also need to rename column by the rule: equip + chamber + aggregation function.
There are multiple "equipment" names and "chamber" names.
As pivot in spark only accept single column, therefore you have to concat the column which you want to pivot:
df = spark.createDataFrame(
[
('a', 'E1', 'C1', 1, 11, 21),
('a', 'E1', 'C1', 2, 12, 22),
('a', 'E1', 'C1', 3, 13, 23),
('b', 'E1', 'C1', 1, 14, 24),
('b', 'E1', 'C1', 2, 15, 25),
('b', 'E1', 'C1', 3, 16, 26),
],
schema=['sheet', 'equipment', 'chamber', 'time', 'value1', 'value2']
)
df.printSchema()
df.show(10, False)
+-----+---------+-------+----+------+------+
|sheet|equipment|chamber|time|value1|value2|
+-----+---------+-------+----+------+------+
|a |E1 |C1 |1 |11 |21 |
|a |E1 |C1 |2 |12 |22 |
|a |E1 |C1 |3 |13 |23 |
|b |E1 |C1 |1 |14 |24 |
|b |E1 |C1 |2 |15 |25 |
|b |E1 |C1 |3 |16 |26 |
+-----+---------+-------+----+------+------+
Assume there are lots of columns that you want to do the aggregation, you can use a loop to create and prevent the bulky coding:
aggregation = []
for col in df.columns[-2:]:
aggregation += [func.min(col).alias(f"{col}_min"), func.max(col).alias(f"{col}_max"), func.avg(col).alias(f"{col}_mean")]
df.withColumn('new_col', func.concat_ws('_', func.col('equipment'), func.col('chamber')))\
.groupby('sheet')\
.pivot('new_col')\
.agg(*aggregation)\
.orderBy('sheet')\
.show(100, False)
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
|sheet|E1_C1_value1_min|E1_C1_value1_max|E1_C1_value1_mean|E1_C1_value2_min|E1_C1_value2_max|E1_C1_value2_mean|
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
|a |11 |13 |12.0 |21 |23 |22.0 |
|b |14 |16 |15.0 |24 |26 |25.0 |
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
First, create a column out of those which you want to pivot.
Then, pivot and aggregate as usual.
Input dataframe:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('a', 'E1', 'C1', 1, 11, 21),
('a', 'E1', 'C1', 2, 12, 22),
('a', 'E1', 'C1', 3, 13, 23),
('b', 'E1', 'C1', 1, 14, 24),
('b', 'E1', 'C1', 2, 15, 25),
('b', 'E1', 'C1', 3, 16, 26)],
['sheet', 'equipment', 'chamber', 'time', 'value1', 'value2'])
Script:
df = df.withColumn('_temp', F.concat_ws('_', 'equipment', 'chamber'))
df = (df
.groupBy('sheet')
.pivot('_temp')
.agg(
F.mean('value1').alias('value1_mean'),
F.min('value1').alias('value1_min'),
F.max('value1').alias('value1_max'),
F.mean('value2').alias('value2_mean'),
F.min('value2').alias('value2_min'),
F.max('value2').alias('value2_max'),
)
)
df.show()
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
# |sheet|E1_C1_value1_mean|E1_C1_value1_min|E1_C1_value1_max|E1_C1_value2_mean|E1_C1_value2_min|E1_C1_value2_max|
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
# | b| 15.0| 14| 16| 25.0| 24| 26|
# | a| 12.0| 11| 13| 22.0| 21| 23|
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
Related
I have a pyspark dataframe df :-
STORE
COL_APPLE_BB
COL_APPLE_NONBB
COL_PEAR_BB
COL_PEAR_NONBB
COL_ORANGE_BB
COL_ORANGE_NONBB
COL_GRAPE_BB
COL_GRAPE_NONBB
1
28
24
24
32
26
54
60
36
2
19
12
24
13
10
24
29
10
I have another pyspark df df2 :-
STORE
PDT
FRUIT
TYPE
1
1
APPLE
BB
1
2
ORANGE
NONBB
1
3
PEAR
BB
1
4
GRAPE
BB
1
5
APPLE
BB
1
6
ORANGE
BB
2
1
PEAR
NONBB
2
2
ORANGE
NONBB
2
3
APPLE
NONBB
Expected pyspark df2 with a column COL_VALUE for repective store,fruit,type:-
STORE
PDT
FRUIT
TYPE
COL_VALUE
1
1
APPLE
BB
28
1
2
ORANGE
NONBB
54
1
3
PEAR
BB
24
1
4
GRAPE
BB
60
1
5
APPLE
BB
28
1
6
ORANGE
BB
26
2
1
PEAR
NONBB
13
2
2
ORANGE
NONBB
24
2
3
APPLE
NONBB
12
from pyspark.sql.functions import *
df = spark.createDataFrame(
[
(1, 28, 24, 24, 32, 26, 54, 60, 36),
(2, 19, 12, 24, 13, 10, 24, 29, 10)
],
["STORE", "COL_APPLE_BB", "COL_APPLE_NONBB", "COL_PEAR_BB", "COL_PEAR_NONBB", "COL_ORANGE_BB", "COL_ORANGE_NONBB", "COL_GRAPE_BB","COL_GRAPE_NONBB"]
)
df2 = spark.createDataFrame(
[
(1, 1, "APPLE", "BB"),
(1, 2, "ORANGE", "NONBB"),
(1, 3, "PEAR", "BB"),
(1, 4, "GRAPE", "BB"),
(1, 5, "APPLE", "BB"),
(1, 6, "ORANGE", "BB"),
(2, 1, "PEAR", "NONBB"),
(2, 2, "ORANGE", "NONBB"),
(2, 3, "APPLE", "NONBB")
],
["STORE", "PDT", "FRUIT", "TYPE"]
)
unPivot_df = df.select("STORE",expr("stack(8, 'APPLE_BB',COL_APPLE_BB,\
'APPLE_NONBB',COL_APPLE_NONBB,\
'PEAR_BB', COL_PEAR_BB,\
'PEAR_NONBB', COL_PEAR_NONBB,\
'ORANGE_BB',COL_ORANGE_BB, \
'ORANGE_NONBB',COL_ORANGE_NONBB,\
'GRAPE_BB',COL_GRAPE_BB,\
'GRAPE_NONBB',COL_GRAPE_NONBB) as (Appended,COL_VALUE)"))
df2 = df2.withColumn("Appended",concat_ws('_',col("FRUIT"),col("TYPE")))
df2 = df2.join(unPivot_df,['STORE',"Appended"],"left")
df2.show()
+-----+------------+---+------+-----+---------+
|STORE| Appended|PDT| FRUIT| TYPE|COL_VALUE|
+-----+------------+---+------+-----+---------+
| 1|ORANGE_NONBB| 2|ORANGE|NONBB| 54|
| 1| PEAR_BB| 3| PEAR| BB| 24|
| 1| GRAPE_BB| 4| GRAPE| BB| 60|
| 1| APPLE_BB| 1| APPLE| BB| 28|
| 2|ORANGE_NONBB| 2|ORANGE|NONBB| 24|
| 2| APPLE_NONBB| 3| APPLE|NONBB| 12|
| 1| ORANGE_BB| 6|ORANGE| BB| 26|
| 1| APPLE_BB| 5| APPLE| BB| 28|
| 2| PEAR_NONBB| 1| PEAR|NONBB| 13|
+-----+------------+---+------+-----+---------+
If you have Spark 3.2 or higher you could use something like:
data = data.melt(
id_vars=['STORE'],
value_vars=data.columns[1:],
var_name="variable",
value_name="value"
)
to get a "long" form of the dataset, and then use regex_extract twice to get the required information from the variable column.
For earlier versions of Spark, use the following:
def process_row(row):
output = []
for index, key in enumerate(row.asDict()):
if key == "STORE":
store = row[key]
else:
_, fruit, type = key.split("_")
output.append((store, index, fruit, type, row[key]))
return output
data = data.rdd.flatMap(process_row).toDF(
schema=["STORE", "PDT", "FRUIT", "TYPE", "COLUMN_VALUE"]
)
Alternatively to melt, you can use stack in earlier Spark versions:
df = spark.createDataFrame(
[
(1, 28, 24),
(2, 19, 12),
],
["STORE", "COL_APPLE_BB", "COL_APPLE_NONBB"]
)
df2 = spark.createDataFrame(
[
(1, 1, "APPLE", "BB"),
(1, 2, "ORANGE", "NONBB"),
(1, 2, "APPLE", "NONBB"),
(2, 3, "APPLE", "NONBB")
],
["STORE", "PDT", "FRUIT", "TYPE"]
)
Create a column that matches the "COL_FRUIT_TYPE" in df:
df3 = df2.withColumn("fruit_type", F.concat(F.lit("COL_"), F.col("FRUIT"), F.lit("_"), F.col("TYPE")))
df3.show(10, False)
gives:
+-----+---+------+-----+----------------+
|STORE|PDT|FRUIT |TYPE |fruit_type |
+-----+---+------+-----+----------------+
|1 |1 |APPLE |BB |COL_APPLE_BB |
|1 |2 |ORANGE|NONBB|COL_ORANGE_NONBB|
|1 |2 |APPLE |NONBB|COL_APPLE_NONBB |
+-----+---+------+-----+----------------+
Then "unpivot" the first df:
from pyspark.sql.functions import expr
unpivotExpr = "stack({}, {}) as (fruit_type, COL_VALUE)".format(len(df.columns) - 1, ','.join( [("'{}', {}".format(c, str(c))) for c in df.columns[1:]] ) )
print(unpivotExpr)
unPivotDF = df.select("STORE", expr(unpivotExpr)) \
.where("STORE is not null")
unPivotDF.show(truncate=False)
The stack function takes as arguments: the number of "columns" that it will be "unpivoting" (here, it derives that it will be len(df.columns) - 1, as we will be skipping the STORE column); then, in case of just column, value pairs, it takes a list of these in the form col_name, value. Here, the
[("'{}', {}".format(c, str(c))) for c in df.columns[1:]] part takes columns from df, skipping the first one (STORE), then returns a pair for each of the remaining columns, such as 'COL_APPLE_BB', COL_APPLE_BB. In the end I join these into a comma-separated string (",".join()) and replace the placeholder {} with this string.
Example how stack function is usually called:
"stack(2, 'COL_APPLE_BB', COL_APPLE_BB, 'COL_APPLE_NONBB', COL_APPLE_NONBB) as (fruit_type, COL_VALUE)"
The unPivotDF.show(truncate=False) outputs:
+-----+---------------+---------+
|STORE|fruit_type |COL_VALUE|
+-----+---------------+---------+
|1 |COL_APPLE_BB |28 |
|1 |COL_APPLE_NONBB|24 |
|2 |COL_APPLE_BB |19 |
|2 |COL_APPLE_NONBB|12 |
+-----+---------------+---------+
and join the two:
df3.join(unPivotDF, ["fruit_type", "STORE"], "left")\
.select("STORE", "PDT", "FRUIT", "TYPE", "COL_VALUE").show(40, False)
result:
+-----+---+------+-----+---------+
|STORE|PDT|FRUIT |TYPE |COL_VALUE|
+-----+---+------+-----+---------+
|1 |2 |ORANGE|NONBB|null |
|1 |2 |APPLE |NONBB|24 |
|1 |1 |APPLE |BB |28 |
|2 |3 |APPLE |NONBB|12 |
+-----+---+------+-----+---------+
The drawback is that you need to enumerate the column names in stack, if I figure out a way to do this automatically, I will update the answer.
EDIT: I have updated the use of the stack function, so that it can derive the columns by itself.
I have the below table:
df = spark.createDataFrame(
[('a', 1, 11, 44),
('b', 2, 21, 33),
('a', 2, 10, 40),
('c', 5, 55, 45),
('b', 4, 22, 35),
('a', 3, 9, 45)],
['id', 'left', 'right', 'centre'])
I need to find and display only the max values as shown below:
[![enter image description here][2]][2]
[![[2]: https://i.stack.imgur.com/q8bGq.png][2]][2]
Simple groupBy and agg:
from pyspark.sql import functions as F
df = df.groupBy('id').agg(
F.max('left').alias('max_left'),
F.max('right').alias('max_right'),
F.max('centre').alias('max_centre'),
)
df.show()
# +---+--------+---------+----------+
# | id|max_left|max_right|max_centre|
# +---+--------+---------+----------+
# | b| 4| 22| 35|
# | a| 3| 11| 45|
# | c| 5| 55| 45|
# +---+--------+---------+----------+
Or slightly more advanced:
df = df.groupBy('id').agg(
*[F.max(c).alias(f'max_{c}') for c in df.columns if c != 'id']
)
I'm working with PySpark. I have a dataset like this:
I want to count lines of my dataset in function of my "Column3" column.
For example, here I want to get this dataset:
pyspark.sql.functions.count(col):
Aggregate function: returns the number of items in a group.
temp = spark.createDataFrame([
(0, 11, 'A'),
(1, 12, 'B'),
(2, 13, 'B'),
(0, 14, 'A'),
(1, 15, 'c'),
(2, 16, 'A'),
], ["column1", "column2", 'column3'])
temp.groupBy('column3').agg(count('*').alias('count')).sort('column3').show(10, False)
# +-------+-----+
# |column3|count|
# +-------+-----+
# |A |3 |
# |B |2 |
# |c |1 |
# +-------+-----+
df.groupBy('column_3').count()
Below is the sales data available to calculate max_price .
Logic for Max_price
Max(last 3 weeks price)
For the first 3 weeks where last weeks data is not available
max price will be
max of(week 1 , week 2 , week 3)
in the below example max (rank 5 , 6 ,7).
how to implement the same using window function in spark?
Here is the solution using PySpark Window, lead/udf.
Please note that i changed the rank 5,6,7 prices to 1,2,3 to differentiate with other values to explain . that this logic is picking what you explained.
max_price_udf = udf(lambda prices_list: max(prices_list), IntegerType())
df = spark.createDataFrame([(1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)], ["product_id", "week", "year","rank","price"])
window = Window.orderBy(col("year").desc(),col("week").desc())
df = df.withColumn("prices_list", array([coalesce(lead(col("price"),x, None).over(window),lead(col("price"),x-3, None).over(window)) for x in range(1, 4)]))
df = df.withColumn("max_price",max_price_udf(col("prices_list")))
df.show()
which results
+----------+----+----+----+-----+------------+---------+
|product_id|week|year|rank|price| prices_list|max_price|
+----------+----+----+----+-----+------------+---------+
| 1| 5|2019| 1| 20|[18, 21, 20]| 21|
| 2| 4|2019| 2| 18| [21, 20, 1]| 21|
| 3| 3|2019| 3| 21| [20, 1, 2]| 20|
| 4| 2|2019| 4| 20| [1, 2, 3]| 3|
| 5| 1|2019| 5| 1| [2, 3, 1]| 3|
| 6| 52|2018| 6| 2| [3, 1, 2]| 3|
| 7| 51|2018| 7| 3| [1, 2, 3]| 3|
+----------+----+----+----+-----+------------+---------+
Here is the solution in Scala
var df = Seq((1, 5, 2019, 1, 20), (2, 4, 2019, 2, 18),
(3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
(5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
(7, 51, 2018, 7, 3)).toDF("product_id", "week", "year", "rank", "price")
val window = Window.orderBy($"year".desc, $"week".desc)
df = df.withColumn("max_price", greatest((for (x <- 1 to 3) yield coalesce(lead(col("price"), x, null).over(window), lead(col("price"), x - 3, null).over(window))):_*))
df.show()
You can use SQL window functions combined with the greatest(). When the SQL window function has less than 3 number of rows, you are considering the current rows and even prior rows. Therefore you need to have the lag1_price, lag2_price calculated in the inner sub-query. In the outer query, you can use the row_count value and use the greatest() function by passing in lag1, lag2 and current price for the respective values against 2,1,0 and get the maximum value.
Check this out:
val df = Seq((1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)).toDF("product_id", "week", "year","rank","price")
df.createOrReplaceTempView("sales")
val df2 = spark.sql("""
select product_id, week, year, price,
count(*) over(order by year desc, week desc rows between 1 following and 3 following ) as count_row,
lag(price) over(order by year desc, week desc ) as lag1_price,
sum(price) over(order by year desc, week desc rows between 2 preceding and 2 preceding ) as lag2_price,
max(price) over(order by year desc, week desc rows between 1 following and 3 following ) as max_price1 from sales
""")
df2.show(false)
df2.createOrReplaceTempView("sales_inner")
spark.sql("""
select product_id, week, year, price,
case
when count_row=2 then greatest(price,max_price1)
when count_row=1 then greatest(price,lag1_price,max_price1)
when count_row=0 then greatest(price,lag1_price,lag2_price)
else max_price1
end as max_price
from sales_inner
""").show(false)
Results:
+----------+----+----+-----+---------+----------+----------+----------+
|product_id|week|year|price|count_row|lag1_price|lag2_price|max_price1|
+----------+----+----+-----+---------+----------+----------+----------+
|1 |5 |2019|20 |3 |null |null |21 |
|2 |4 |2019|18 |3 |20 |null |21 |
|3 |3 |2019|21 |3 |18 |20 |20 |
|4 |2 |2019|20 |3 |21 |18 |3 |
|5 |1 |2019|1 |2 |20 |21 |3 |
|6 |52 |2018|2 |1 |1 |20 |3 |
|7 |51 |2018|3 |0 |2 |1 |null |
+----------+----+----+-----+---------+----------+----------+----------+
+----------+----+----+-----+---------+
|product_id|week|year|price|max_price|
+----------+----+----+-----+---------+
|1 |5 |2019|20 |21 |
|2 |4 |2019|18 |21 |
|3 |3 |2019|21 |20 |
|4 |2 |2019|20 |3 |
|5 |1 |2019|1 |3 |
|6 |52 |2018|2 |3 |
|7 |51 |2018|3 |3 |
+----------+----+----+-----+---------+
I need to aggregate rows in a DataFrame by collecting the values in a certain column in each group into a set. pyspark.sql.functions.collect_set does exactly what I need.
However, I need to do this for two columns in turn, because I need to group the input by one column, divide each group into subgroups by another column, and do some aggregation on each subgroup. I don't see how to get collect_set to create a set for each group.
Example:
df = spark.createDataFrame([('a', 'x', 11, 22), ('a', 'y', 33, 44), ('b', 'x', 55, 66), ('b', 'y', 77, 88),('a','x',12,23),('a','y',34,45),('b','x',56,67),('b','y',78,89)], ('col1', 'col2', 'col3', 'col4'))
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| x| 11| 22|
| a| y| 33| 44|
| b| x| 55| 66|
| b| y| 77| 88|
| a| x| 12| 23|
| a| y| 34| 45|
| b| x| 56| 67|
| b| y| 78| 89|
+----+----+----+----+
g1 = df.groupBy('col1', 'col2').agg(collect_set('col3'),collect_set('col4'))
g1.show()
+----+----+-----------------+-----------------+
|col1|col2|collect_set(col3)|collect_set(col4)|
+----+----+-----------------+-----------------+
| a| x| [12, 11]| [22, 23]|
| b| y| [78, 77]| [88, 89]|
| a| y| [33, 34]| [45, 44]|
| b| x| [56, 55]| [66, 67]|
+----+----+-----------------+-----------------+
g2 = g1.groupBy('col1').agg(collect_set('collect_set(col3)'),collect_set('collect_set(col4)'),count('col2'))
g2.show(truncate=False)
+----+--------------------------------------------+--------------------------------------------+-----------+
|col1|collect_set(collect_set(col3)) |collect_set(collect_set(col4)) |count(col2)|
+----+--------------------------------------------+--------------------------------------------+-----------+
|b |[WrappedArray(56, 55), WrappedArray(78, 77)]|[WrappedArray(66, 67), WrappedArray(88, 89)]|2 |
|a |[WrappedArray(33, 34), WrappedArray(12, 11)]|[WrappedArray(22, 23), WrappedArray(45, 44)]|2 |
+----+--------------------------+--------------------------------------------+-----------+
I'd like the result to look more like
+----+----------------+----------------+-----------+
|col1| ...col3... | ...col4... |count(col2)|
+----+----------------+----------------+-----------+
|b |[56, 55, 78, 77]|[66, 67, 88, 89]|2 |
|a |[33, 34, 12, 11]|[22, 23, 45, 44]|2 |
+----+----------------+----------------+-----------+
but I don't see an aggregate function to take the union of two or more sets, or a pyspark operation to flatten the "array of arrays" structure that shows up in g2.
Does pyspark provide a simple way to accomplish this? Or is there a totally different approach I should be taking?
In PySpark 2.4.5, you could use the now built-in flatten function.
You can flatten the columns with a UDF afterwards:
flatten = udf(lambda l: [x for i in l for x in i], ArrayType(IntegerType()))
I took the liberty of renaming the columns of g2 as col3 and and col4 to save typing. This gives:
g3 = g2.withColumn('col3flat', flatten('col3'))
>>> g3.show()
+----+--------------------+--------------------+-----+----------------+
|col1| col3| col4|count| col3flat|
+----+--------------------+--------------------+-----+----------------+
| b|[[78, 77], [56, 55]]|[[66, 67], [88, 89]]| 2|[78, 77, 56, 55]|
| a|[[12, 11], [33, 34]]|[[22, 23], [45, 44]]| 2|[12, 11, 33, 34]|
+----+--------------------+--------------------+-----+----------------+
You can accomplish the same with
from pyspark.sql.functions import collect_set, countDistinct
(
df.
groupby('col1').
agg(
collect_set('col3').alias('col3_vals'),
collect_set('col4').alias('col4_vals'),
countDistinct('col2').alias('num_grps')
).
show(truncate=False)
)
+----+----------------+----------------+--------+
|col1|col3_vals |col4_vals |num_grps|
+----+----------------+----------------+--------+
|b |[78, 56, 55, 77]|[66, 88, 67, 89]|2 |
|a |[33, 12, 34, 11]|[45, 22, 44, 23]|2 |
+----+----------------+----------------+--------+