How to calculate YTD (Year to Date) value using Pandas Dataframe? - python-3.x

I want to calculate YTD using pandas dataframe in each month. Here I have used two measurements named sales and sales Rate. For measurement sales, YTD is calculated by taking the cumulative sum.Code is given below:
report_table['ytd_value'] = report_table.groupby(['financial_year', 'measurement', 'place', 'market', 'product'], sort=False)['value'].cumsum()
But, In the case of measurement sales rate YTD is calculated in different way.
YTD Calculation Explanation (sales rate) given below:
First month (April) YTD value of financial year = First month (April) value of financial year
From second month of financial year onwards YTD valueis calculated using formula.
Month May YTD value = ((APRIL YTD value(sales)* APRIL YTD value(sales rate)) + (APRIL value(sales)* APRIL value(sales rate)) / (APRIL value(sales) + APRIL value(sales rate)
Similarly for other months.Dataframe is given below as an image.
import pandas as pd
data = {'Month': ['April', 'May', 'April', 'June', 'April', 'May'],
'Year': [2022, 2022, 2022, 2022, 2022, 2022],
'Financial_Year': [2023, 2023, 2023, 2023, 2023, 2023],
'Measurement': ['sales', 'sales', 'sales', 'sales', 'sales rate', 'sales rate'],
'Place': ['Delhi', 'Delhi', 'Delhi', 'Delhi', 'Delhi', 'Delhi'],
'Market': ['Domestic', 'Domestic', 'Export', 'Domestic', 'Domestic', 'Domestic'],
'Product': ['Biscuit', 'Biscuit', 'Chocolate', 'Biscuit', 'Biscuit', 'Biscuit'],
'Value': ['10', '10', '20', '25', '10', '20']}
# Create DataFrame
df = pd.DataFrame(data)
df['Value'] = df['Value'].astype(float)
df['ytd_value'] = df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value'].cumsum()
It will calculate ytd_value for both sales and sales rate measurement.But I want to calculate ytd_value for sales rate in the above mentioned format.
I have tried below code, but it shows an error:
rslt_df = df[(df['Measurement'] == 'sales')]
df.loc[df['Measurement'] == "sales rate", 'ytd_value'] = (df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['ytd_value']*rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['ytd_value'] + df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value'] * rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value']) / (rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['ytd_value'] + rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value'])
Expected output:
Month Year Financial_Year ... Product Value ytd_value
0 April 2022 2023 ... Biscuit 10.0 10.0
1 May 2022 2023 ... Biscuit 10.0 20.0
2 April 2022 2023 ... Chocolate 20.0 20.0
3 June 2022 2023 ... Biscuit 25.0 45.0
4 April 2022 2023 ... Biscuit 10.0 10.0
5 May 2022 2023 ... Biscuit 20.0 10.0
Can anyone help me to solve this caclculation?

I recommend you change your dataframe around a bit:
Month Year Financial_Year Place Market Product Sales Sales Rate
0 April 2022 2023 Delhi Domestic Biscuit 10.0 10.0
1 May 2022 2023 Delhi Domestic Biscuit 10.0 20.0
2 June 2022 2023 Delhi Domestic Biscuit 25.0 0.0
You may be able to get here by aggregating the sales values across each month, but the point is that you have a single Sales value and Sales Rate value for each month.
Once you have this, you can set the YTD value for April, and then iterate through the following months to calculate their values.
I think there's an error in the formula you posted for YTD calculations, but using that as is, here's some sample code:
import pandas as pd
data = {'Month': ['April', 'May', 'June'],
'Year': [2022, 2022, 2022],
'Financial_Year': [2023, 2023, 2023],
'Place': ['Delhi', 'Delhi', 'Delhi'],
'Market': ['Domestic', 'Domestic', 'Domestic'],
'Product': ['Biscuit', 'Biscuit', 'Biscuit'],
'Sales': [10, 10, 25],
'Sales Rate': [10, 20, 0]}
# Create DataFrame
df = pd.DataFrame(data)
df['Sales'] = df['Sales'].astype(float)
df['Sales Rate'] = df['Sales Rate'].astype(float)
df['YTD'] = 0.0
df.at[0,'YTD'] = df.iloc[0]['Sales']
for rowidx in range(1, len(df)):
prevrow = df.iloc[rowidx - 1]
tmp = prevrow['Sales'] * prevrow['Sales Rate']
df.at[rowidx,'YTD'] = tmp + tmp/tmp
print(df)
This outputs, for example:
Month Year Financial_Year Place Market Product Sales Sales Rate YTD
0 April 2022 2023 Delhi Domestic Biscuit 10.0 10.0 10.0
1 May 2022 2023 Delhi Domestic Biscuit 10.0 20.0 101.0
2 June 2022 2023 Delhi Domestic Biscuit 25.0 0.0 201.0
You should be able to use this as an example to implement the correct function to calculate the YTD values.

Related

how to get employee count by Hour and Date using pySpark / python?

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.
Example Data
import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)
Example of output
import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)
Try this:
# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")
# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
[
pd.date_range(in_, out_, freq="H", inclusive="right")
for in_, out_ in zip(clock_in, clock_out)
]
).explode()
# Final result
hours.groupby(hours).count()
Result:
2021-03-30 02:00:00 1
2021-03-30 03:00:00 1
2021-03-30 04:00:00 1
2021-03-30 05:00:00 1
2021-08-07 01:00:00 1
2021-12-05 01:00:00 1
2021-12-05 02:00:00 1
2021-12-05 03:00:00 1
2021-12-05 04:00:00 1
2021-12-23 23:00:00 1
2021-12-24 00:00:00 2
2021-12-24 01:00:00 2
2021-12-24 02:00:00 1
dtype: int64
It's slightly different from your expected output but consistent with your business rules.

How to calculate cumulative sum based on months in a pandas dataframe?

I want to calculate cumulative sum of values in a pandas dataframe column based on months.
code:
import pandas as pd
import numpy as np
data = {'month': ['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', 'January', 'February', 'March'],
'kpi': ['sales', 'sales quantity', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales quantity', 'sales', 'sales', 'sales', 'sales'],
're_o' : [1, 1, 1, 11, 11, 11, 12, 12, 12, 13, 13, 13]
}
# Create DataFrame
df = pd.DataFrame(data)
df['Q-Total'] = 0
df['Q-Total'] = np.where((df['month'] == 'April') | (df['month'] == 'May') | (df['month'] == 'June'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
df['Q-Total'] = np.where((df['month'] == 'July') | (df['month'] == 'August') | (df['month'] == 'September'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
df['Q-Total'] = np.where((df['month'] == 'October') | (df['month'] == 'November') | (df['month'] == 'December'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
df['Q-Total'] = np.where((df['month'] == 'January') | (df['month'] == 'February') | (df['month'] == 'March'),
df.groupby(['kpi'], sort=False)['re_o'].cumsum(), df['Q-Total'])
print(df)
My required output is given below:
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 11
4 August sales 11 22
5 September sales 11 33
6 October sales 12 12
7 November sales quantity 12 12
8 December sales 12 24
9 January sales 13 13
10 February sales 13 26
11 March sales 13 39
But When I run this code,I got an output like below:
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 13
4 August sales 11 24
5 September sales 11 35
6 October sales 12 47
7 November sales quantity 12 13
8 December sales 12 59
9 January sales 13 72
10 February sales 13 85
11 March sales 13 98
I want to calculate cumulative sum in the below manner:
If the months are April,May and June ,take the cumulative sum only from the April,May and June
If the months are July,August and September ,take the cumulative sum only from the July,August and September
If the months are October,November and December ,take the cumulative sum only from the October,November and December
If the months are January,February and March ,take the cumulative sum only from the January,February and March
Can anyone suggest a solution?
You can create quarters periods for groups and then use GroupBy.cumsum:
g = pd.to_datetime(df['month'], format='%B').dt.to_period('Q')
df['Q-Total'] = df.groupby([g,'kpi'])['re_o'].cumsum()
print (df)
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 11
4 August sales 11 22
5 September sales 11 33
6 October sales 12 12
7 November sales quantity 12 12
8 December sales 12 24
9 January sales 13 13
10 February sales 13 26
11 March sales 13 39
Details:
print (df.assign(q = g))
month kpi re_o Q-Total q
0 April sales 1 1 1900Q2
1 May sales quantity 1 1 1900Q2
2 June sales 1 2 1900Q2
3 July sales 11 11 1900Q3
4 August sales 11 22 1900Q3
5 September sales 11 33 1900Q3
6 October sales 12 12 1900Q4
7 November sales quantity 12 12 1900Q4
8 December sales 12 24 1900Q4
9 January sales 13 13 1900Q1
10 February sales 13 26 1900Q1
11 March sales 13 39 1900Q1
You can define custom groups from a list of lists:
groups = [['January', 'February', 'March'],
['April', 'May', 'June'],
['July', 'August', 'September'],
['October', 'November', 'December'],
]
# make mapper
d = {k:v for v,l in enumerate(groups) for k in l}
df['Q-Total'] = df.groupby([df['month'].map(d), 'kpi'])['re_o'].cumsum()
output:
month kpi re_o Q-Total
0 April sales 1 1
1 May sales quantity 1 1
2 June sales 1 2
3 July sales 11 11
4 August sales 11 22
5 September sales 11 33
6 October sales 12 12
7 November sales quantity 12 12
8 December sales 12 24
9 January sales 13 13
10 February sales 13 26
11 March sales 13 39

Group By Quarterly Avg and Get Values That Were Used in Avg Calculation -pandas

I have a df like this,
time value
0 2019-07-30 124.00
1 2019-07-19 123.00
2 2019-08-28 191.46
3 2019-10-25 181.13
4 2019-11-01 24.23
5 2019-11-13 340.00
6 2020-01-01 36.12
7 2020-01-25 56.12
8 2020-01-30 121.00
9 2020-02-04 115.62
10 2020-02-06 63.62
I want to group by quarterly average and get the values that were used in average calculation. Something like below.
Year Quarter Values Avg
2019 Q3 124, 123, 191 146
2019 Q4 181.13, 24.23, 340 181.78
2020 Q1 36.12, 26.12, 121, 115.62, 63.62 72.96
How can I achieve my desired result?
Use GroupBy.agg with quarter periods created by Series.dt.quarter with join values converted to strings and mean in named aggregations:
df['time'] = pd.to_datetime(df['time'])
df1 = (df.assign(Year = df['time'].dt.year,
Q = 'Q' + df['time'].dt.quarter.astype(str),
vals = df['value'].astype(str))
.groupby(['Year','Q'])
.agg(Values=('vals', ', '.join), Avg = ('value','mean'))
.reset_index())
print (df1)
Year Q Values Avg
0 2019 Q3 124.0, 123.0, 191.46 146.153333
1 2019 Q4 181.13, 24.23, 340.0 181.786667
2 2020 Q1 36.12, 56.12, 121.0, 115.62, 63.62 78.496000
EDIT:
df['time'] = pd.to_datetime(df['time'])
df1 = (df.groupby(df['time'].dt.to_period('Q').rename('YearQ'))['value']
.agg([('Values', lambda x: ', '.join(x.astype(str))),('Avg','mean')])
.reset_index()
.assign(Year = lambda x: x['YearQ'].dt.year,
Q = lambda x: 'Q' + x['YearQ'].dt.quarter.astype(str))
.reindex(['Year','Q','Values','Avg'], axis=1))
print (df1)
Year Q Values Avg
0 2019 Q3 124.0, 123.0, 191.46 146.153333
1 2019 Q4 181.13, 24.23, 340.0 181.786667
2 2020 Q1 36.12, 56.12, 121.0, 115.62, 63.62 78.496000
Create a grouper, groupby and reshape the index to year and quarter:
grouper = pd.Grouper(key='time',freq='Q')
res = (df
.assign(temp = df.value.astype(str))
.groupby(grouper)
.agg(Values=('temp', ','.join),
Avg = ('value',np.mean)
)
)
res.index = [res.index.year, 'Q' + res.index.quarter.astype(str)]
res.index = res.index.set_names(['Year','Quarter'])
Values Avg
Year Quarter
2019 Q3 123.0,124.0,191.46 146.153333
Q4 181.13,24.23,340.0 181.786667
2020 Q1 36.12,56.12,121.0,115.62,63.62 78.496000

Python Calculate New Date Based on Date Range

I have a Python Pandas DataFrame containing birth dates of hockey players that looks like this:
Player Birth Year Birth Date
Player A 1990 1990-05-12
Player B 1991 1991-10-30
Player C 1992 1992-09-10
Player D 1990 1990-11-15
I want to create a new column labeled 'Draft Year' that is calculated based on this rule:
If MM-DD is before 09-15, Draft Year = Birth Year + 18
Else if MM-DD is after 09-15 Draft Year = Birth Year + 19
This would make the output from the example:
Player Birth Year Birth Date Draft Year
Player A 1990 1990-05-12 2008
Player B 1991 1991-10-30 2010
Player C 1992 1992-09-10 2010
Player D 1990 1990-11-15 2009
I've tried separating the MM-DD from the date format by using
Data['Birth Date'] = Data['Birth Date'].str.split('-').str[1:]
But that returns me a list of [mm, dd] which is tricky to work with. Any suggestions on how to do this concisely would be greatly appreciated!
Use numpy.where:
data['Birth Date']=pd.to_datetime(data['Birth Date']) #to convert to datetime
cond=(df['Birth Date'].dt.month>=9)&(df['Birth Date'].dt.day>=15)
cond2=(df['Birth Date'].dt.month>=10)
data['Draft Year']=np.where(cond|cond2,data['Birth Year']+19,data['Birth Year']+18)
print(data)
Output
Player Birth Year Birth Date Draft Year
0 PlayerA 1990 1990-05-12 2008
1 PlayerB 1991 1991-10-30 2010
2 PlayerC 1992 1992-09-10 2010
3 PlayerD 1990 1990-11-15 2009
Datetime in the form yyyy-mm-dd are sortable as strings. This solution takes advantage of that fact:
df['Draft Year'] = df['Birth Year'] + np.where(df['Birth Date'].dt.strftime('%m-%d') < '09-15', 18, 19)
Quick and Dirty
Make a column that is 100 * the month and add it to the day
cutoff = df['Birth Date'].pipe(lambda d: d.dt.month * 100 + d.dt.day)
df['Draft Year'] = df['Birth Year'] + 18 + (cutoff > 915)
df
Player Birth Year Birth Date Draft Year
0 Player A 1990 1990-05-12 2008
1 Player B 1991 1991-10-30 2010
2 Player C 1992 1992-09-10 2010
3 Player D 1990 1990-11-15 2009

How do I change the date format for a Pandas Index?

I'm loading some time series data in the following way:
snp = web.DataReader("^GSPC", 'yahoo', start, end)['Adj Close']
The index is then automatically formatted as 'datetime64[ns]'
I then resample the daily data to yearly like this:
snp_yr = snp.resample('A')
The date formatting is still the same as already described. How do I change this into the year only (%Y)??
E.g. from '2015-12-31 00:00:00' to '2015'
I think you need DatetimeIndex.year and then if need convert to string add astype:
df.index = df.index.year
Sample:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10, freq='3M')
df = pd.DataFrame({'a': range(10)},index=rng)
print (df)
a
2015-02-28 0
2015-05-31 1
2015-08-31 2
2015-11-30 3
2016-02-29 4
2016-05-31 5
2016-08-31 6
2016-11-30 7
2017-02-28 8
2017-05-31 9
df.index = df.index.year.astype(str)
print (df)
a
2015 0
2015 1
2015 2
2015 3
2016 4
2016 5
2016 6
2016 7
2017 8
2017 9
print (df.index)
Index(['2015', '2015', '2015', '2015', '2016', '2016', '2016', '2016', '2017',
'2017'],
dtype='object')
Another solution with strftime:
df.index = df.index.strftime('%Y')
print (df)
a
2015 0
2015 1
2015 2
2015 3
2016 4
2016 5
2016 6
2016 7
2017 8
2017 9
print (df.index)
Index(['2015', '2015', '2015', '2015', '2016', '2016', '2016', '2016', '2017',
'2017'],
dtype='object')

Resources