How to fill empty cell value in pandas with condition - python-3.x

My sample dataset is as below. Actuall data till 2020 is available.
Item Year Amount final_sales
A1 2016 123 400
A2 2016 23 40
A3 2016 6
A4 2016 10 100
A5 2016 5 200
A1 2017 123 400
A2 2017 23
A3 2017 6
A4 2017 10
A5 2017 5 200
I have to extrapolate 2017 (and subsequent years) final_sales column data from 2016 for every Item if 2017 data not available.
In the above dataset final_sales not available for the year 2017 for A2 and A4 but available for 2016 year. How to bring in 2016 data (final_sales) value if corresponding year final_sales not available?
Expected results as below. Thanks.
Item Year Amount final_sales
A1 2016 123 400
A2 2016 23 40
A3 2016 6
A4 2016 10 100
A5 2016 5 200
A1 2017 123 400
A2 2017 23 40
A3 2017 6
A4 2017 10 100
A5 2017 5 200

It looks like you want to fill forward where there is missing data.
You can do this with 'fillna', which is available on pd.DataFrame objects.
In your case, you only want to fill forward for each item, so first group by item, and then use fillna. The method 'pad' just carries forward in order (hence why we sort first).
df['final_sales'] = df.sort_values('Year').groupby('Item')['final_sales'].fillna(method='pad')
Note that on your example data, A3 is missing for 2016 as well, so there is nothing to carry forward and it remains missing for 2017.

For me working GroupBy.ffill, only necessary sorted Year column like in question sample data:
#if necessary sorting by both columns
df = df.sort_values(['Year', 'Item'])
df['final_sales'] = df.groupby('Item')['final_sales'].ffill()
print (df)
Item Year Amount final_sales
0 A1 2016 123 400.0
1 A2 2016 23 40.0
2 A3 2016 6 NaN
3 A4 2016 10 100.0
4 A5 2016 5 200.0
5 A1 2017 123 400.0
6 A2 2017 23 40.0
7 A3 2017 6 NaN
8 A4 2017 10 100.0
9 A5 2017 5 200.0

Something like this?:
def fill_final(x):
if x['year'] != 2016:
return df[(df['year'] == 2016) & (df['Item'] == x['Item'])]['final_sales']
else: return x['final_sales']
df['final_sales'] = df.apply(lambda x: fill_final(x), axis = 1)
did not test this but should set you on the right path

Related

Excel function to dynamically SUM UP data based on matching rows and columns

I have a table with metrics shown as rows and month shown as columns.
Example is below:
Quarter
2022-01-01
2022-01-01
2022-01-01
2022-04-01
2022-04-01
2022-04-01
2022-07-01
2022-07-01
2022-07-01
2022-10-01
2022-10-01
2022-10-01
Month
2022-01-01
2022-02-01
2022-03-01
2022-04-01
2022-05-01
2022-06-01
2022-07-01
2022-08-01
2022-09-01
2022-10-01
2022-11-01
2022-12-01
Metrics
Jan 2022
Feb 2022
Mar 2022
Apr 2022
May 2022
Jun 2022
Jul 2022
Aug 2022
Sep 2022
Oct 2022
Nov 2022
Dec 2022
Revenue
1000
1000
1000
500
500
500
100
100
100
0
0
0
Cost
10
10
10
10
10
10
20
20
20
0
5
10
I want to have a dynamic summary table of quarterly data. I can use sumifs and look up the quarter month using this function:
SUMIFS([Value row range],[Quarter range],[Quarter wanted])
However, i still have to manually select the correct value row range to sum. Is it possible to select the entire table and then match the correct row based on matching labels (metric in this case)?
Insert Report Month
Dec-22
Last 3 quarter report
Metrics
Q2 2022
Q3 2022
Q4 2022
Revenue
1500
300
0
Cost
30
60
15
I'm aware of the index & match function, but it only looks for the first match and does not sum up all months in the same quarter.
Thanks for helping!
Excel 365 for MAC should have the BYCOL function,
Given:
Your data table is a Table named Metrics
Report_Month is a Named Range containing a "real date" in the month of the final month of the desired quarter.
The following formula will return your output and will adjust as you add columns to the data table.
A11: =Metrics[[#All],[Metrics]]
B11: =LET(x,EDATE(Report_Month,SEQUENCE(,3,-6,3)),TEXT(MONTH(x)/3,"\Q0 ") & YEAR(x))
B12: =BYCOL(XLOOKUP(TEXT(DATE(YEAR(Report_Month),MONTH(Report_Month)-9+SEQUENCE(3,,1,1)+SEQUENCE(,3,0,3),1),"mmm-yy"),Metrics[#Headers],INDEX(Metrics,XMATCH(A12,Metrics[Metrics]),0)),LAMBDA(arr,SUM(arr)))
Select B12 and fill down as far as needed.
Notes
DATE(YEAR(Report_Month),MONTH(Report_Month)-9+SEQUENCE(3,,1,1)+SEQUENCE(,3,0,3),1)
creates a matrix of the previous nine month starting dates with each column consisting of a given quarter:
So for 12/1/2022 =>
The TEXT function then formats the same as the column headers in the Metrics table.
XLOOKUP will then return the appropriate columns from the table into that matrix, and using the BYCOL allows us to SUM by column which is the relevant quarter.

Reshaping Multi Indexed DF [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed last month.
I have a dataframe that is structured like so (similar to a pivot table):
A
B
December 2022
January 2023
A1
B1
100
200
A1
B2
101
201
I'd like to and transpose my dataframe in a way so it reads:
Month
A
B
Value
December 2022
A1
B1
100
December 2022
A1
B2
101
January 2023
A1
B1
200
January 2023
A1
B2
201
etc. I've attempted
df.T
But it gives me:
A
A1
A1
B
B1
B2
December 2022
100
101
January 2023
200
201
You should use pd.melt:
>>> df.melt(id_vars=['A', 'B'], var_name='Month', value_name='Value')
A B Month Value
0 A1 B1 December 2022 100
1 A1 B2 December 2022 101
2 A1 B1 January 2023 200
3 A1 B2 January 2023 201
then to reorder columns, you can use this hack:
>>> df.melt(id_vars=['A', 'B'], var_name='Month', value_name='Value') \
.set_index('Month').reset_index()
Month A B Value
0 December 2022 A1 B1 100
1 December 2022 A1 B2 101
2 January 2023 A1 B1 200
3 January 2023 A1 B2 201
Update: according to #sammywemmy's comment:
var_cols = ['A', 'B']
out = df.melt(id_vars=var_cols, var_name='Month', value_name='Value') \
[['Month'] + var_cols + ['Value']]
print(out)
# Output
Month A B Value
0 December 2022 A1 B1 100
1 December 2022 A1 B2 101
2 January 2023 A1 B1 200
3 January 2023 A1 B2 201

Excel: Dynamic Range Date used in other fields: Sumproduct

I am using sumproduct formula to get the first four month, then the second four month, third four month of net sales until one month before today. This is my formula that I used:
=IFERROR(SUMPRODUCT($B3:$Y3*(COLUMN($B3:$Y3)>=AGGREGATE(15,6,COLUMN($B3:$Y3)/($B3:$Y3<>0),1)+4*(COLUMNS(B3)-1))*(COLUMN($B3:$Y3)<AGGREGATE(15,6,COLUMN($B3:$Y3)/($B3:$Y3<>0),1)+4*(COLUMNS(B3)))*($B$1:$Y$1<EOMONTH(TODAY(),-1)+1)),0)
However, I need to capture the same range as I have it for the net sales as for other measures like COGS in my example. I cannot use the formula above for the other measures like COGS as sometimes they are zero in the same range as in the Net Sales.But I need to capture the zeros here as well.
Example 1
Example 2
Net Sales
Jan
Feb
Mar
Apr
May
June
July
Aug
Sept
Oct
Nov
Dec
0
0
2
3
4
5
2
3
2
3
2
4
---> 1st period= 14 2nd period= 10
COGS (follows the same date range as Net Sales)
Jan
Feb
Mar
Apr
May
June
July
Aug
Sept
Oct
Nov
Dec
0
0
0
0
0
2
1
4
2
3
2
4
---> 1st period= 2 2nd Period= 11
You can leave the entire range check logic from the first formula and change just the value range, i.e first formula in my sample:
=IFERROR(SUMPRODUCT($A3:$L3*(COLUMN($A3:$L3)>=AGGREGATE(15,6,COLUMN($A3:$L3)/($A3:$L3<>0),1)+4*(COLUMN(A3)-1))*(COLUMN($A3:$L3)<AGGREGATE(15,6,COLUMN($A3:$L3)/($A3:$L3<>0),1)+4*(COLUMN(A3)))*($A$2:$L$2<EOMONTH(TODAY(),-1)+1)),0)
second formula for COGS:
=IFERROR(SUMPRODUCT($O3:$Z3*(COLUMN($A3:$L3)>=AGGREGATE(15,6,COLUMN($A3:$L3)/($A3:$L3<>0),1)+4*(COLUMN(A3)-1))*(COLUMN($A3:$L3)<AGGREGATE(15,6,COLUMN($A3:$L3)/($A3:$L3<>0),1)+4*(COLUMN(A3)))*($A$2:$L$2<EOMONTH(TODAY(),-1)+1)),0)

Handle ValueError while creating date in pd

I'm reading a csv file with p, day, month, and put it in a df. The goal is to create a date from day, month, current year, and I run into this error for 29th of Feb:
ValueError: cannot assemble the datetimes: day is out of range for month
I would like when this error occurs, to replace the day by the day before. How can we do that? Below are few lines of my pd and datex at the end is what I would like to get
p day month year datex
0 p1 29 02 2021 28Feb-2021
1 p2 18 07 2021 18Jul-2021
2 p3 12 09 2021 12Sep-2021
Right now, my code for the date is only the below, so I have nan where the date doesn't exist.
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
You could try something like this :
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
Indeed, you get NA :
p day year month datex
0 p1 29 2021 2 NaT
1 p2 18 2021 7 2021-07-18
2 p3 12 2021 9 2021-09-12
You could then make a particular case for these NA :
df.loc[df.datex.isnull(), 'previous_day'] = df.day -1
p day year month datex previous_day
0 p1 29 2021 2 NaT 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
df.loc[df.datex.isnull(), 'datex'] = pd.to_datetime(df[['previous_day', 'year', 'month']].rename(columns={'previous_day': 'day'}))
p day year month datex previous_day
0 p1 29 2021 2 2021-02-28 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
You have to create a new day column if you want to keep day = 29 in the day column.

Difference between value from multiple sheets

I would like to find the difference between the matched serial numbers from multiple excel sheets
sheet 1
June July
B 10 20
A 50 90
Sheet 2
June July
A 6 3
C 5 9
B 10 5
Sheet 3(results)
June July
A 44 87
B 0 15

Resources