Pandas DF Lookup - if value for record not available take the latest available - python-3.x

I am new to pandas and I've been trying to accomplish this task for a couple days now without success. In the beginning I had 3 dataframes that I was supposed to turn into only one with all the info. I managed to merge two of them correctly, which is now df1, however, for the third one there is a tricky logic that I couldnt figure out yet. The data structure is the following:
df1.head()
Out[12]:
Concat YearNb_x MonthNb_x WeekNb_x NatCoCode VariantCode \
1 BN2004384AAA112017 2017 1 1 AAA BN2004384
2 BN2004388AAA112017 2017 1 1 AAA BN2004388
4 BN2004510AAA112017 2017 1 1 AAA BN2004510
5 BN2004645AAA112017 2017 1 1 AAA BN2004645
6 BN2004780AAA112017 2017 1 1 AAA BN2004780
Suppliercode_x ModelName_x SumOfVolume Price
1 HUAWEI P9 (Eva) 745 399.991667
2 HUAWEI P9 lite (Venus) 1770 211.666667
4 SAMSUNG A3 (2016) 6210 205.000000
5 APPLE iPhone 6s Plus 2 724.166667
6 SAMSUNG Galaxy J5 (2016) 4571 190.000000
df2.head()
Out[13]:
YearNb MonthNb WeekNb NatCoCode VariantCode Suppliercode \
0 2016 1 1 BBB BN2001707 APPLE
1 2016 1 2 BBB BN2001707 APPLE
2 2016 1 3 BBB BN2001707 APPLE
3 2016 1 4 BBB BN2001707 APPLE
4 2016 1 1 BBB BN2002345 SAMSUNG
ModelName LocalPrice ProductCategoryCode
0 iPhone 4S 385.0 HS
1 iPhone 4S 385.0 HS
2 iPhone 4S 385.0 HS
3 iPhone 4S 385.0 HS
4 G. Note 2 (N7100) 395.0 HS
All info except for the prices are supposed to be the same, what I need to do is lookup for the prices (it can be by month, WeekNb can be ignored) in df2 for the same combination of items (NatCoCode, VariantCode, Supplier, Etc.) and IF the price for the respective month is not available df1 should take the LATEST available.
I was trying the following logic, which obviously doesnt work:
import pandas as pd
df1 = pd.read_excel('output2.xlsx')
df2 = pd.read_excel('localtest.xlsx')
def PriceAssignment(df1,df2):
i = 1
while i >= 5:
for i in df1['VariantCode'], df2['BNCode']:
if df1.loc[df1[i], df1['YearNb_x'], df1['WeekNb_x'], df1['NatCoCode'], df1['VariantCode']] == df2.loc[df2[i], df2['YearNb_x'], df2['WeekNb_x'], df2['NatCoCode'], df2['VariantCode']]:
df1['LocalPrice'] == df2.loc['Price']
elif df2['MonthNb']==12:
df2['YearNb'] -= i
else:
df2['MonthNb'] -= i
i += 1
return df1
The output would be something like:
From:
2017 2 OBE BN2004780BBB622017 SAMSUNG Galaxy J5 (2016) 500
2017 2 OBE BN2005184BBB622017 APPLE iPhone 6s Plus 300
2017 1 OBE BN2005190BBB622017 APPLE iPhone 7 350
To:
771 BN2004780BBB622017 2017 2 6 BBB BN2004780 SAMSUNG Galaxy J5 (2016) 67 171.9008264
772 BN2005184BBB622017 2017 2 6 BBB BN2005184 APPLE iPhone 6s Plus 13 614.8760331
773 BN2005190BBB622017 2017 2 6 BBB BN2005190 APPLE iPhone 7 1261 690.9090909
Result:
771 BN2004780BBB622017 2017 2 6 BBB BN2004780 SAMSUNG Galaxy J5 (2016) 67 171.9008264 500
772 BN2005184BBB622017 2017 2 6 BBB BN2005184 APPLE iPhone 6s Plus 13 614.8760331 300
773 BN2005190BBB622017 2017 2 6 BBB BN2005190 APPLE iPhone 7 1261 690.9090909 350
In this example, record 777 doesnt have a local price for the same month (03), in this case I would like to assign the latest available value to this item, in this case the latest value available for this item is from the month before, so this would be added in the LocalPrice column
I was trying to check if there was an available price for the same item in the last five months (subjective). The data (spreadsheets) can be found HERE
Does anyone have an idea or know a proper way on how to perform such operation?

Related

Looking for specifics records matchs in another dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have df1 as follow:
Name | ID
________|_____
Banana | 10
Orange | 21
Peach | 115
Then I have a df2 like this:
ID Price
10 2.34
10 2.34
115 6.00
I want to modify df2 to add another column name Fruit to get this as output:
ID Fruit Price
10 Banana 2.34
10 Banana 2.34
115 Peach 6.00
200 NA NA
I can use iloc to get one specific match but how to do it in all records in the df2?
Have you tried looking at the merge function ?
pd.merge(df1, df2)
Output :
Name Id Price
0 Banana 10 2.34
1 Banana 10 2.34
2 Peach 115 6.00
EDIT :
If you want to add only a specific column from df2 :
df = pd.merge(df1,df2[['Id','Price']],on='Id', how='left')
Output :
Name Id Price
0 Banana 10 2.34
1 Banana 10 2.34
2 Orange 21 NaN
3 Peach 115 6.00

Pandas - Groupby Company and drop rows according to criteria based off the Dates of values being out of order

I have a history data log and want to calculate the number of days between the progress by Company (Timestamp of the early stage must be smaller than the later stage).
Company Progress Time
AAA 3. Contract 07/10/2020
AAA 2. Discuss 03/09/2020
AAA 1. Start 02/02/2020
BBB 3. Contract 11/13/2019
BBB 3. Contract 07/01/2019
BBB 1. Start 06/22/2019
BBB 2. Discuss 04/15/2019
CCC 3. Contract 05/19/2020
CCC 2. Discuss 04/08/2020
CCC 2. Discuss 03/12/2020
CCC 1. Start 01/01/2020
Expected outputs:
Progress (1. Start --> 2. Discuss)
Company Progress Time
AAA 1. Start 02/02/2020
AAA 2. Discuss 03/09/2020
CCC 1. Start 01/01/2020
CCC 2. Discuss 03/12/2020
Progress (2. Discuss --> 3. Contract)
Company Progress Time
AAA 2. Discuss 03/09/2020
AAA 3. Contract 07/10/2020
CCC 2. Discuss 03/12/2020
CCC 3. Contract 05/19/2020
I did try some stupid ways to do the work but still need manualyl filter in excel, below is my coding:
df_stage1_stage2 = df[(df['Progress']=='1. Start')|(df['Progress']=='2. Discuss ')]
pd.pivot_table(df_stage1_stage2 ,index=['Company','Progress'],aggfunc={'Time':min})
Can anyone help with the problem? thanks
Create some masks to filter out the relevant rows. m1 and m2 filter out groups where 1. Start is not the "first" datetime if looking at in reverse order )since your dates are sorted by Company ascending and date descending). You can create more masks if you need to also check if 2. Discuss and 3. Contract are in order, instead of the current logic which is only checking to make sure that 1. is in order. But, with the data you provided that returns the correct output:
m1 = df.groupby('Company')['Progress'].transform('last')
m2 = np.where((m1 == '1. Start'), 'drop', 'keep')
df = df[m2=='drop']
df
intermediate output:
Company Progress Time
0 AAA 3. Contract 07/10/2020
1 AAA 2. Discuss 03/09/2020
2 AAA 1. Start 02/02/2020
7 CCC 3. Contract 05/19/2020
8 CCC 2. Discuss 04/08/2020
9 CCC 2. Discuss 03/12/2020
10 CCC 1. Start 01/01/2020
From there, filter as you have indicated by sorting and dropping duplicates based off a subset of the first two columns and keep the 'first' duplicate:
final df1 and df2 output:
df1
df1 = df[df['Progress'] != '3. Contract'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df1 output:
Company Progress Time
2 AAA 1. Start 02/02/2020
1 AAA 2. Discuss 03/09/2020
10 CCC 1. Start 01/01/2020
9 CCC 2. Discuss 03/12/2020
df2
df2 = df[df['Progress'] != '1. Start'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df2 output:
Company Progress Time
1 AAA 2. Discuss 03/09/2020
0 AAA 3. Contract 07/10/2020
9 CCC 2. Discuss 03/12/2020
7 CCC 3. Contract 05/19/2020
Something like this could work, assuming an already sorted df:
(full example)
data = {
'Company':['AAA', 'AAA', 'AAA', 'BBB','BBB','BBB','BBB','CCC','CCC','CCC','CCC',],
'Progress':['3. Contract', '2. Discuss', '1. Start', '3. Contract', '3. Contract', '2. Discuss', '1. Start', '3. Contract', '2. Discuss', '2. Discuss', '1. Start', ],
'Time':['07-10-2020','03-09-2020','02-02-2020','11-13-2019','07-01-2019','06-22-2019','04-15-2019','05-19-2020','04-08-2020','03-12-2020','01-01-2020',],
}
df = pd.DataFrame(data)
df['Time'] = pd.to_datetime(df['Time'])
# We want to measure from the first occurrence (last date) if duplicated:
df.drop_duplicates(subset=['Company', 'Progress'], keep='first', inplace=True)
# Except for the rows of 'start', calculate the difference in days
df['days_delta'] = np.where((df['Progress'] != '1. Start'), df.Time.diff(-1), 0)
Output:
Company Progress Time days_delta
0 AAA 3. Contract 2020-07-10 123 days
1 AAA 2. Discuss 2020-03-09 36 days
2 AAA 1. Start 2020-02-02 0 days
3 BBB 3. Contract 2019-11-13 144 days
5 BBB 2. Discuss 2019-06-22 68 days
6 BBB 1. Start 2019-04-15 0 days
7 CCC 3. Contract 2020-05-19 41 days
8 CCC 2. Discuss 2020-04-08 98 days
10 CCC 1. Start 2020-01-01 0 days
If you do not want the 'days' word in output use:
df['days_delta'] = df['days_delta'].dt.days
First Problem
#Coerce Time to Datetime
df['Time']=pd.to_datetime(df['Time'])
#`groupby().nth[]` `to slice the consecutive order`
df2=(df.merge(df.groupby(['Company'])['Time'].nth([-2,-1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Apply the universal rule for this problem which is, after groupby nth, drop any agroup with duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
#Calculate the diff() in Time in each group
df2['diff'] = df2.sort_values(by='Progress').groupby('Company')['Time'].diff().dt.days.fillna(0)#.groupby('Company')['Time'].diff() / np.timedelta64(1, 'D')
#Filter out the groups where start and Discuss Time are in conflict
df2[~df2.Company.isin(df2.loc[df2['diff']<0, 'Company'].unique())]
Company Progress Time diff
1 AAA 1.Start 2020-02-02 0.0
0 AAA 2.Discuss 2020-03-09 36.0
5 CCC 1.Start 2020-01-01 0.0
4 CCC 2.Discuss 2020-03-12 71.0
Second Problem
#Groupbynth to slice right consecutive groups
df2=(df.merge(df.groupby(['Company'])['Time'].nth([0,1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Drop any groups after grouping that have duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
Company Progress Time
1 AAA 2.Discuss 2020-03-09
0 AAA 3.Contract 2020-07-10
5 CCC 2.Discuss 2020-04-08
4 CCC 3.Contract 2020-05-19

Groupby one column and forward replace values in multiple columns based on condition using Pandas

Given a dataframe as follows:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 xd dt 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh pd 2020 5
Say there are typo errors in columns city and district for rows in the year columns which is 2020, so I want groupby id and ffill those columns with previous values.
How could I do that in Pandas? Thanks a lot.
The desired output will like this:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5
The following code works, but I'm not sure if it's the best solutions.
If you have others, welcome to share. Thanks.
df.loc[df['year'].isin(['2020']), ['city', 'district']] = np.nan
df[['city', 'district']] = df[['city', 'district']].fillna(df.groupby('id')[['city', 'district']].ffill())
Out:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5

Add rows according to other rows

My DataFrame object similar to this one:
Product StoreFrom StoreTo Date
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
3 out Apple StoreE StoreU 20170802
4 in Apple StoreE StoreU 20170812
I want to avoid duplications, in 3rd and 4th row show same action. I try to reach
Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
5 in Apple StoreE StoreU 20170812 10
and I got more than 10k entry. I could not find similar work to this. Any help will be very useful.
d1 = df.assign(Date=pd.to_datetime(df.Date.astype(str)))
d2 = d1.assign(Days=d1.groupby(cols).Date.apply(lambda x: x - x.iloc[0]))
d2.drop_duplicates(cols, 'last')
io Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 2017-06-02 0 days
2 out cherry StoreW StoreO 2017-06-14 0 days
4 in Apple StoreE StoreU 2017-08-12 10 days

SUM(IF(ColA=ColA AND ColB=ColB,ColC,0)

This SUMIF calculation has stumped me within Excel (2013).
A B C D E
Created Source Conv Rev RPConv
Jan,1 2014 Apples 3 5.00 =Rev/Conv
Jan,1 2014 Oranges 2 4.00 =Rev/Conv
Jan,7 2014 Apples 3 5.00 =Rev/Conv
Feb,1 2014 Apples 5 5.00 =Rev/Conv
Feb,1 2014 Oranges 3 4.00 =Rev/Conv
CURRENT: =SUM(IF(MONTH($A:$A)=1 AND $B:$B='Apples',$D:$D,0)
What I expect to return is:
5.00+5.00
but unfortunately it rejects the statement altogether.
Given the tag and assuming Month 1 is January 2014:
=SUMIFS(D:D,A:A,">"&41639,A:A,"<"&41671,B:B,"Apples")
=SUM(IF(AND(MONTH($A:$A)=1,$B:$B="Apples"),$D:$D,0))

Resources