doing cumulative sum for each year and month in sparksql - apache-spark

Input:
item loc qty year month
A IND 10 2019 13
A IND 20 2020 1
A IND 10 2020 2
A IND 40 2020 3
A IND 50 2020 5
A IND 10 2020 6
OUTPUT:
item loc sum(qty) year month
A IND 0 2019 13
A IND 10 2020 1
A IND 30 2020 2
A IND 40 2020 3
A IND 50 2020 5
A IND 90 2020 6
description:
how will i get my output is as follows:
if i am calculationg for year 2020 and month 3 then i need to consider the sum(qty) between (month-3) and (month-1) i.e. in this case it will be from year 2019 month 12 to year 2020 and month 2
so for year 2020 and month 3 the ouput will be sum(qty)=10+20+10=40
now for year 2020 and month 6
sum(qty) will be between year 2020 and month -3=3 and year 2020 and month-1=5
so sum(qty)=0(0 for month 4 which is not in the table)+40+50=90

Try this.
df.createOrReplaceTempView("test")
spark.sql("""
SELECT
item,
loc,
COALESCE(
SUM(qty) OVER (
PARTITION BY item
ORDER BY (year - 2000) * 13 + month
RANGE BETWEEN 3 PRECEDING AND 1 PRECEDING
), 0) as sum_qty,
year,
month
FROM
test
""").show
+----+---+-------+----+-----+
|item|loc|sum_qty|year|month|
+----+---+-------+----+-----+
| A|IND| 0|2019| 13|
| A|IND| 10|2020| 1|
| A|IND| 30|2020| 2|
| A|IND| 40|2020| 3|
| A|IND| 50|2020| 5|
| A|IND| 90|2020| 6|
+----+---+-------+----+-----+

Related

How to calculate cumulative sum based on months in a pandas dataframe?

I want to calculate cumulative sum of values in a pandas dataframe column based on months.
code:
import pandas as pd
import numpy as np
data = {'month': ['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', 'January', 'February', 'March'],
'kpi': ['sales', 'sales quantity', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales quantity', 'sales', 'sales', 'sales', 'sales'],
're_o' : [1, 1, 1, 11, 11, 11, 12, 12, 12, 13, 13, 13]
}
# Create DataFrame
df = pd.DataFrame(data)
df['Q-Total'] = 0
df['Q-Total'] = np.where((df['month'] == 'April') | (df['month'] == 'May') | (df['month'] == 'June'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
df['Q-Total'] = np.where((df['month'] == 'July') | (df['month'] == 'August') | (df['month'] == 'September'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
df['Q-Total'] = np.where((df['month'] == 'October') | (df['month'] == 'November') | (df['month'] == 'December'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
df['Q-Total'] = np.where((df['month'] == 'January') | (df['month'] == 'February') | (df['month'] == 'March'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
print(df)
My required output is given below:
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 11
4 August sales 11 22
5 September sales 11 33
6 October sales 12 12
7 November sales quantity 12 12
8 December sales 12 24
9 January sales 13 13
10 February sales 13 26
11 March sales 13 39
But When I run this code,I got an output like below:
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 13
4 August sales 11 24
5 September sales 11 35
6 October sales 12 47
7 November sales quantity 12 13
8 December sales 12 59
9 January sales 13 72
10 February sales 13 85
11 March sales 13 98
I want to calculate cumulative sum in the below manner:
If the months are April,May and June ,take the cumulative sum only from the April,May and June
If the months are July,August and September ,take the cumulative sum only from the July,August and September
If the months are October,November and December ,take the cumulative sum only from the October,November and December
If the months are January,February and March ,take the cumulative sum only from the January,February and March
Can anyone suggest a solution?
You can create quarters periods for groups and then use GroupBy.cumsum:
g = pd.to_datetime(df['month'], format='%B').dt.to_period('Q')
df['Q-Total'] = df.groupby([g,'kpi'])['re_o'].cumsum()
print (df)
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 11
4 August sales 11 22
5 September sales 11 33
6 October sales 12 12
7 November sales quantity 12 12
8 December sales 12 24
9 January sales 13 13
10 February sales 13 26
11 March sales 13 39
Details:
print (df.assign(q = g))
month kpi re_o Q-Total q
0 April sales 1 1 1900Q2
1 May sales quantity 1 1 1900Q2
2 June sales 1 2 1900Q2
3 July sales 11 11 1900Q3
4 August sales 11 22 1900Q3
5 September sales 11 33 1900Q3
6 October sales 12 12 1900Q4
7 November sales quantity 12 12 1900Q4
8 December sales 12 24 1900Q4
9 January sales 13 13 1900Q1
10 February sales 13 26 1900Q1
11 March sales 13 39 1900Q1
You can define custom groups from a list of lists:
groups = [['January', 'February', 'March'],
['April', 'May', 'June'],
['July', 'August', 'September'],
['October', 'November', 'December'],
]
# make mapper
d = {k:v for v,l in enumerate(groups) for k in l}
df['Q-Total'] = df.groupby([df['month'].map(d), 'kpi'])['re_o'].cumsum()
output:
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 11
4 August sales 11 22
5 September sales 11 33
6 October sales 12 12
7 November sales quantity 12 12
8 December sales 12 24
9 January sales 13 13
10 February sales 13 26
11 March sales 13 39

How to sum by month in timestamp Data Frame?

i have dataframe like this :
trx_date
trx_amount
2013-02-11
35
2014-03-10
26
2011-02-9
10
2013-02-12
5
2013-01-11
21
how do i filter that into month and year? so that i can sum the trx_amount
example expected output :
trx_monthly
trx_sum
2013-02
40
2013-01
21
2014-02
35
You can convert values to month periods by Series.dt.to_period and then aggregate sum:
df['trx_date'] = pd.to_datetime(df['trx_date'])
df1 = (df.groupby(df['trx_date'].dt.to_period('m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df1)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert datetimes to strings in format YYYY-MM by Series.dt.strftime:
df2 = (df.groupby(df['trx_date'].dt.strftime('%Y-%m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert to month and years, then output is different - 3 columns:
df2 = (df.groupby([df['trx_date'].dt.year.rename('year'),
df['trx_date'].dt.month.rename('month')])['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
year month trx_sum
0 2011 2 10
1 2013 1 21
2 2013 2 40
3 2014 3 26
You can try this -
df['trx_month'] = df['trx_date'].dt.month
df_agg = df.groupby('trx_month')['trx_sum'].sum()

Handle ValueError while creating date in pd

I'm reading a csv file with p, day, month, and put it in a df. The goal is to create a date from day, month, current year, and I run into this error for 29th of Feb:
ValueError: cannot assemble the datetimes: day is out of range for month
I would like when this error occurs, to replace the day by the day before. How can we do that? Below are few lines of my pd and datex at the end is what I would like to get
p day month year datex
0 p1 29 02 2021 28Feb-2021
1 p2 18 07 2021 18Jul-2021
2 p3 12 09 2021 12Sep-2021
Right now, my code for the date is only the below, so I have nan where the date doesn't exist.
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
You could try something like this :
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
Indeed, you get NA :
p day year month datex
0 p1 29 2021 2 NaT
1 p2 18 2021 7 2021-07-18
2 p3 12 2021 9 2021-09-12
You could then make a particular case for these NA :
df.loc[df.datex.isnull(), 'previous_day'] = df.day -1
p day year month datex previous_day
0 p1 29 2021 2 NaT 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
df.loc[df.datex.isnull(), 'datex'] = pd.to_datetime(df[['previous_day', 'year', 'month']].rename(columns={'previous_day': 'day'}))
p day year month datex previous_day
0 p1 29 2021 2 2021-02-28 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
You have to create a new day column if you want to keep day = 29 in the day column.

Pandas: How to ctrate DateTime index

There is Pandas Dataframe as:
year month count
0 2014 Jan 12
1 2014 Feb 10
2 2015 Jan 12
3 2015 Feb 10
How to create DateTime index from 'year' and 'month',so result would be :
count
2014.01.31 12
2014.02.28 10
2015.01.31 12
2015.02.28 10
Use to_datetime with DataFrame.pop for use and remove columns and add offsets.MonthEnd:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.offsets.MonthEnd()
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10
Or:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.to_timedelta(dates.dt.daysinmonth - 1, unit='d')
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10

Excel Vlookup a total that equals to the sum of some rows in an array

2 tables
day |product| detail amount
Sunday| water| 9
Sunday| mango | 10
Sunday| onion| 15
Sunday| cocacola| 3
Sunday| tomato| 14
day| place| total amount
Sunday| grocery| 39
Sunday| supermarket| 12
i want the result
result
day| product| detail amount| place
Sunday| water| 9| supermarket
Sunday| mango| 10| grocery
Sunday| onion| 15| grocery
Sunday cocacola 3 supermarket
Sunday tomato 14 grocery
the logic of it is to
1.only on sunday
2.serach for the numbers in details table that will match totals table
3.the 2 tables are always equal in grand total

Resources