pyspark pivot without aggregation - apache-spark

I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object
As an example have this:
+---------++---------++---------++---------+
| country| code |Value | ids
+---------++---------++---------++---------+
| Mexico |food_1_3 |apple | 1
| Mexico |food_1_3 |banana | 2
| Canada |beverage_2 |milk | 1
| Mexico |beverage_2 |water | 2
+---------++---------++---------++---------+
Need this:
+---------++---------++---------++----------+
| country| id |food_1_3 | beverage_2|
+---------++---------++---------++----------+
| Mexico | 1 |apple | |
| Mexico | 2 |banana |water |
| Canada | 1 | |milk |
|+---------++---------++---------++---------+
I have tried
(df.groupby(df.country, df.id).pivot("code").agg(first('Value').alias('Value')))
but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...

That's because your 'id' is not unique. Add a unique index column and that should work:
import pyspark.sql.functions as F
pivoted = df.groupby(df.country, df.id, F.monotonically_increasing_id().alias('index')).pivot("code").agg(F.first('Value').alias('Value')).drop('index')
pivoted.show()
+-------+---+----------+--------+
|country|ids|beverage_2|food_1_3|
+-------+---+----------+--------+
| Mexico| 1| null| apple|
| Mexico| 2| water| null|
| Canada| 1| milk| null|
| Mexico| 2| null| banana|
+-------+---+----------+--------+

Related

Can I select 2 field as index in Pivot Table in Excel?

I am trying to create a pivot table in excel which taking 2 fields as column and be the key for grouping data.
Example:
Original Table:
| Fruit | Country | Sold |
| -------- | ---- | --|
| Apple | USA| 10|
| Apple | JAPAN| 20|
| Orange| JAPAN|5|
| Orange| USA|3|
| Orange| JAPAN|100|
| Orange| THAILAND|30|
| Banana| THAILAND|20|
| Banana| THAILAND|10|
Pivot Table I want:
| Fruit | Country | TotalSold |
| ------| ---- | --|
| Apple | USA | 10|
| Apple | JAPAN| 20|
| Orange| JAPAN|105|
| Orange| USA |3|
| Orange| THAILAND|30|
| Banana| THAILAND|30|
Basically, I want to use 2 column as key to group the Sold Amount. I have played a while in excel and still cannot find a way to group the data in this way.

populate master table from daily table for updated and new inserted records

I have two table which has few records
name is the column on which I can apply join condition
Table A master table
#+-------------+----------+---------------------------+---------+
#| name | Value | date |city |
#+-------------+----------+---------------------------+---------|
#| RHDM | 123 | 2-07-2020 12:00:55:842 |New York |
#| Rohit | 345 | 1-05-2021 11:50:55:222 |Berlin |
#| kerry | 785 | 3-04-2020 11:60:55:840 |Landon |
I have other table with almost same number of columns but the date and value column get changes daily
TableB
#+-------------+----------+---------------------------+---------+
#| name | Value | date |city |
#+-------------+----------+---------------------------+---------+
#| Rohit | 350 | 12-07-2021 12:00:55:842 | Berlin | value and date changed
#| Bob | 985 | 23-04-2020 10:00:55:842 |India | new record
#| kerry | 785 | 13-04-2020 12:00:55:842 | Landon | only date change
I need output as Table3 which need to have all records from table A plus update records from table B ,If there is any change in value and date column that has to pick from tableB into table A
#+-------------+----------+----------------------------+---------+
#| name | Value | date |City |
#+-------------+----------+----------------------------+---------+
#| RHDM | 123 | 2-07-2020 12:00:55:842 |New York |
#| Rohit | 350 | 12-07-2021 12:00:55:842 |Berlin |
#| kerry | 785 | 13-04-2020 12:00:55:842 |Landon |
#| Bob | 985 | 23-04-2020 10:00:55:842 |India |
In python pandas I would have done by creating two df like dfA,dfB and then
result = pd.merge(dfA,dfB,on=['name'],how='outer'indicator=True)
and apply further logic , can anyone suggest how to do it in pyspark,spark-sql
Simply do a join :
from pysql.sql import functions as F
df3 = dfA.join(
dfB,
on="name",
how="full", # Or "outer", same thing
).select(
F.col("name"),
F.coalesce(dfB["Value"], dfA["Value"]).alias("Value"),
F.coalesce(dfB["date"], dfA["date"]).alias("date"),
)
df3.show()
+-----+-----+----------+
| name|Value| date|
+-----+-----+----------+
|kerry| 785|13-04-2020|
| Bob| 985|23-04-2020|
| RHDM| 123| 2-07-2020|
|Rohit| 350|12-07-2021|
+-----+-----+----------+

Reverse Group By function in pyspark?

Sample Data:
+-----------+------------+---------+
|City |Continent | Price|
+-----------+------------+---------+
| A | Asia | 100|
| B | Asia | 110|
| C | Africa | 60|
| D | Europe | 170|
| E | Europe | 90|
| F | Africa | 100|
+-----------+------------+---------+
Output:
For the second column I know we can just use
df.groupby("Continent").agg({'Price':'avg'})
But how can we calculate the third column? The third column groups by the cities that does not
belong to each continent and then calculates average price.
expected output
------------+--------------+----------------------------------------------+
|Continent | Average Price|Average Price for cities not in this continent|
+-----------+--------------+----------------------------------------------+
| Asia | 105| 105 |
| Africa | 80| 117.5 |
| Europe | 130| 92.5 |
+-----------+--------------+----------------------------------------------+
>>> from pyspark.sql.functions import col,avg
>>> df.show()
+----+---------+-----+
|City|Continent|Price|
+----+---------+-----+
| A| Asia| 100|
| B| Asia| 110|
| C| Africa| 60|
| D| Europe| 170|
| E| Europe| 90|
| F| Africa| 100|
+----+---------+-----+
>>> df1 = df.alias("a").join(df.alias("b"), col("a.Continent") != col("b.Continent"),"left").select(col("a.*"), col("b.price").alias("b_price"))
>>> df1.groupBy("Continent").agg(avg(col("Price")).alias("Average Price"), avg(col("b_price")).alias("Average Price for cities not in this continent")).show()
+---------+-------------+----------------------------------------------+
|Continent|Average Price|Average Price for cities not in this continent|
+---------+-------------+----------------------------------------------+
| Europe| 130.0| 92.5|
| Africa| 80.0| 117.5|
| Asia| 105.0| 105.0|
+---------+-------------+----------------------------------------------+

Expand last value of string column to groupby Pandas Dataframe

I have the following Pandas dataframe:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2|John|
+--------+----+
What I want to achieve is to expand the last value of each group to the rest of the group:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2|John|
| 2|John|
| 2|John|
+--------+----+
It looks pretty easy but I am struggling to achieve it because of the columns' type.
What I've tried so far is:
df['name'] = df.groupby('id')['name'].transform('last')
This works for int or float columns, but not for string columns.
I am getting the following error:
No numeric types to aggregate
Thanks in advance.
Edit
bfill() is not valid because I can have the following:
+--------+----+
|id |name|
+--------+----+
| 1| |
| 1| |
| 1| |
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3| |
| 3| |
| 3|John|
+--------+----+
In this case, I want id = 2 to remain as NaN, and it would end up as John, which is incorrect. The desired output would be:
+--------+----+
|id |name|
+--------+----+
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 1|Carl|
| 2| |
| 2| |
| 2| |
| 3|John|
| 3|John|
| 3|John|
+--------+----+
If the empty values are NaN, could you try fillna
df['name'] = df['name'].bfill()
If not, replace empty strings by NaN.
Try this.
import pandas as pd
import numpy as np
dff = pd.DataFrame({"id":[1,1,1,1,2,2,2,3,3,3],
"name":["","","","car1","","","","","","john"]})
dff = dff.replace(r'', np.NaN)
def c(x):
if sum(pd.isnull(x)) != np.size(x):
l = [v for v in x if type(v) == str]
return [l[0]]*np.size(x)
else:
return [""]*np.size(x)
df=dff.groupby('id')["name"].apply(lambda x:c(list(x)))
df = df.to_frame().reset_index()
df = df.set_index('id').name.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'name'})
output
id name
0 1 car1
1 1 car1
2 1 car1
3 1 car1
0 2
1 2
2 2
0 3 john
1 3 john
2 3 john

How to calculate rolling sum with varying window sizes in PySpark

I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values?
Input Data
+-----------+---------+------------+------------+---+
| ProductId | StoreId | Date | Prediction | N |
+-----------+---------+------------+------------+---+
| 1 | 100 | 2019-07-01 | 0.92 | 2 |
| 1 | 100 | 2019-07-02 | 0.62 | 2 |
| 1 | 100 | 2019-07-03 | 0.89 | 2 |
| 1 | 100 | 2019-07-04 | 0.57 | 2 |
| 2 | 200 | 2019-07-01 | 1.39 | 3 |
| 2 | 200 | 2019-07-02 | 1.22 | 3 |
| 2 | 200 | 2019-07-03 | 1.33 | 3 |
| 2 | 200 | 2019-07-04 | 1.61 | 3 |
+-----------+---------+------------+------------+---+
Expected Output Data
+-----------+---------+------------+------------+---+------------------------+
| ProductId | StoreId | Date | Prediction | N | RollingSum |
+-----------+---------+------------+------------+---+------------------------+
| 1 | 100 | 2019-07-01 | 0.92 | 2 | sum(0.92, 0.62) |
| 1 | 100 | 2019-07-02 | 0.62 | 2 | sum(0.62, 0.89) |
| 1 | 100 | 2019-07-03 | 0.89 | 2 | sum(0.89, 0.57) |
| 1 | 100 | 2019-07-04 | 0.57 | 2 | sum(0.57) |
| 2 | 200 | 2019-07-01 | 1.39 | 3 | sum(1.39, 1.22, 1.33) |
| 2 | 200 | 2019-07-02 | 1.22 | 3 | sum(1.22, 1.33, 1.61 ) |
| 2 | 200 | 2019-07-03 | 1.33 | 3 | sum(1.33, 1.61) |
| 2 | 200 | 2019-07-04 | 1.61 | 3 | sum(1.61) |
+-----------+---------+------------+------------+---+------------------------+
There are lots of questions and answers to this problem in Python but I couldn't find any in PySpark.
Similar Question 1
There is a similar question here but in this one frame size is fixed to 3. In the provided answer rangeBetween function is used and it is only working with fixed sized frames so I cannot use it for varying sizes.
Similar Question 2
There is also a similar question here. In this one, writing cases for all possible sizes is suggested but it is not applicable for my case since I don't know how many distinct frame sizes I need to calculate.
Solution attempt 1
I've tried to solve the problem using a pandas udf:
rolling_sum_predictions = predictions.groupBy('ProductId', 'StoreId').apply(calculate_rolling_sums)
calculate_rolling_sums is a pandas udf where I solve the problem in python. This solution works with a small amount of test data. However, when the data gets bigger (in my case, the input df has around 1B rows), calculations take so long.
Solution attempt 2
I have used a workaround of the answer of Similar Question 1 above. I've calculated the biggest possible N, created the list using it and then calculate the sum of predictions by slicing the list.
predictions = predictions.withColumn('DayIndex', F.rank().over(Window.partitionBy('ProductId', 'StoreId').orderBy('Date')))
# find the biggest period
biggest_period = predictions.agg({"N": "max"}).collect()[0][0]
# calculate rolling predictions starting from the DayIndex
w = (Window.partitionBy(F.col("ProductId"), F.col("StoreId")).orderBy(F.col('DayIndex')).rangeBetween(0, biggest_period - 1))
rolling_prediction_lists = predictions.withColumn("next_preds", F.collect_list("Prediction").over(w))
# calculate rolling forecast sums
pred_sum_udf = udf(lambda preds, period: float(np.sum(preds[:period])), FloatType())
rolling_pred_sums = rolling_prediction_lists \
.withColumn("RollingSum", pred_sum_udf("next_preds", "N"))
This solution is also works with the test data. I couldn't have chance to test it with the original data yet but whether it works or not I do not like this solution. Is there any smarter way to solve this?
If you're using spark 2.4+, you can use the new higher-order array functions slice and aggregate to efficiently implement your requirement without any UDFs:
summed_predictions = predictions\
.withColumn("summed", F.collect_list("Prediction").over(Window.partitionBy("ProductId", "StoreId").orderBy("Date").rowsBetween(Window.currentRow, Window.unboundedFollowing))\
.withColumn("summed", F.expr("aggregate(slice(summed,1,N), cast(0 as double), (acc,d) -> acc + d)"))
summed_predictions.show()
+---------+-------+-------------------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| summed|
+---------+-------+-------------------+----------+---+------------------+
| 1| 100|2019-07-01 00:00:00| 0.92| 2| 1.54|
| 1| 100|2019-07-02 00:00:00| 0.62| 2| 1.51|
| 1| 100|2019-07-03 00:00:00| 0.89| 2| 1.46|
| 1| 100|2019-07-04 00:00:00| 0.57| 2| 0.57|
| 2| 200|2019-07-01 00:00:00| 1.39| 3| 3.94|
| 2| 200|2019-07-02 00:00:00| 1.22| 3| 4.16|
| 2| 200|2019-07-03 00:00:00| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04 00:00:00| 1.61| 3| 1.61|
+---------+-------+-------------------+----------+---+------------------+
It might not be the best, but you can get distinct "N" column values and loop like below.
val arr = df.select("N").distinct.collect
for(n <- arr) df.filter(col("N") === n.get(0))
.withColumn("RollingSum",sum(col("Prediction"))
.over(Window.partitionBy("N").orderBy("N").rowsBetween(Window.currentRow, n.get(0).toString.toLong-1))).show
This will give you like:
+---------+-------+----------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| RollingSum|
+---------+-------+----------+----------+---+------------------+
| 2| 200|2019-07-01| 1.39| 3| 3.94|
| 2| 200|2019-07-02| 1.22| 3| 4.16|
| 2| 200|2019-07-03| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04| 1.61| 3| 1.61|
+---------+-------+----------+----------+---+------------------+
+---------+-------+----------+----------+---+----------+
|ProductId|StoreId| Date|Prediction| N|RollingSum|
+---------+-------+----------+----------+---+----------+
| 1| 100|2019-07-01| 0.92| 2| 1.54|
| 1| 100|2019-07-02| 0.62| 2| 1.51|
| 1| 100|2019-07-03| 0.89| 2| 1.46|
| 1| 100|2019-07-04| 0.57| 2| 0.57|
+---------+-------+----------+----------+---+----------+
Then you can do a union of all the dataframes inside the loop.

Resources