Pandas - Groupby Company and drop rows according to criteria based off the Dates of values being out of order - python-3.x

I have a history data log and want to calculate the number of days between the progress by Company (Timestamp of the early stage must be smaller than the later stage).
Company Progress Time
AAA 3. Contract 07/10/2020
AAA 2. Discuss 03/09/2020
AAA 1. Start 02/02/2020
BBB 3. Contract 11/13/2019
BBB 3. Contract 07/01/2019
BBB 1. Start 06/22/2019
BBB 2. Discuss 04/15/2019
CCC 3. Contract 05/19/2020
CCC 2. Discuss 04/08/2020
CCC 2. Discuss 03/12/2020
CCC 1. Start 01/01/2020
Expected outputs:
Progress (1. Start --> 2. Discuss)
Company Progress Time
AAA 1. Start 02/02/2020
AAA 2. Discuss 03/09/2020
CCC 1. Start 01/01/2020
CCC 2. Discuss 03/12/2020
Progress (2. Discuss --> 3. Contract)
Company Progress Time
AAA 2. Discuss 03/09/2020
AAA 3. Contract 07/10/2020
CCC 2. Discuss 03/12/2020
CCC 3. Contract 05/19/2020
I did try some stupid ways to do the work but still need manualyl filter in excel, below is my coding:
df_stage1_stage2 = df[(df['Progress']=='1. Start')|(df['Progress']=='2. Discuss ')]
pd.pivot_table(df_stage1_stage2 ,index=['Company','Progress'],aggfunc={'Time':min})
Can anyone help with the problem? thanks

Create some masks to filter out the relevant rows. m1 and m2 filter out groups where 1. Start is not the "first" datetime if looking at in reverse order )since your dates are sorted by Company ascending and date descending). You can create more masks if you need to also check if 2. Discuss and 3. Contract are in order, instead of the current logic which is only checking to make sure that 1. is in order. But, with the data you provided that returns the correct output:
m1 = df.groupby('Company')['Progress'].transform('last')
m2 = np.where((m1 == '1. Start'), 'drop', 'keep')
df = df[m2=='drop']
df
intermediate output:
Company Progress Time
0 AAA 3. Contract 07/10/2020
1 AAA 2. Discuss 03/09/2020
2 AAA 1. Start 02/02/2020
7 CCC 3. Contract 05/19/2020
8 CCC 2. Discuss 04/08/2020
9 CCC 2. Discuss 03/12/2020
10 CCC 1. Start 01/01/2020
From there, filter as you have indicated by sorting and dropping duplicates based off a subset of the first two columns and keep the 'first' duplicate:
final df1 and df2 output:
df1
df1 = df[df['Progress'] != '3. Contract'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df1 output:
Company Progress Time
2 AAA 1. Start 02/02/2020
1 AAA 2. Discuss 03/09/2020
10 CCC 1. Start 01/01/2020
9 CCC 2. Discuss 03/12/2020
df2
df2 = df[df['Progress'] != '1. Start'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df2 output:
Company Progress Time
1 AAA 2. Discuss 03/09/2020
0 AAA 3. Contract 07/10/2020
9 CCC 2. Discuss 03/12/2020
7 CCC 3. Contract 05/19/2020

Something like this could work, assuming an already sorted df:
(full example)
data = {
'Company':['AAA', 'AAA', 'AAA', 'BBB','BBB','BBB','BBB','CCC','CCC','CCC','CCC',],
'Progress':['3. Contract', '2. Discuss', '1. Start', '3. Contract', '3. Contract', '2. Discuss', '1. Start', '3. Contract', '2. Discuss', '2. Discuss', '1. Start', ],
'Time':['07-10-2020','03-09-2020','02-02-2020','11-13-2019','07-01-2019','06-22-2019','04-15-2019','05-19-2020','04-08-2020','03-12-2020','01-01-2020',],
}
df = pd.DataFrame(data)
df['Time'] = pd.to_datetime(df['Time'])
# We want to measure from the first occurrence (last date) if duplicated:
df.drop_duplicates(subset=['Company', 'Progress'], keep='first', inplace=True)
# Except for the rows of 'start', calculate the difference in days
df['days_delta'] = np.where((df['Progress'] != '1. Start'), df.Time.diff(-1), 0)
Output:
Company Progress Time days_delta
0 AAA 3. Contract 2020-07-10 123 days
1 AAA 2. Discuss 2020-03-09 36 days
2 AAA 1. Start 2020-02-02 0 days
3 BBB 3. Contract 2019-11-13 144 days
5 BBB 2. Discuss 2019-06-22 68 days
6 BBB 1. Start 2019-04-15 0 days
7 CCC 3. Contract 2020-05-19 41 days
8 CCC 2. Discuss 2020-04-08 98 days
10 CCC 1. Start 2020-01-01 0 days
If you do not want the 'days' word in output use:
df['days_delta'] = df['days_delta'].dt.days

First Problem
#Coerce Time to Datetime
df['Time']=pd.to_datetime(df['Time'])
#`groupby().nth[]` `to slice the consecutive order`
df2=(df.merge(df.groupby(['Company'])['Time'].nth([-2,-1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Apply the universal rule for this problem which is, after groupby nth, drop any agroup with duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
#Calculate the diff() in Time in each group
df2['diff'] = df2.sort_values(by='Progress').groupby('Company')['Time'].diff().dt.days.fillna(0)#.groupby('Company')['Time'].diff() / np.timedelta64(1, 'D')
#Filter out the groups where start and Discuss Time are in conflict
df2[~df2.Company.isin(df2.loc[df2['diff']<0, 'Company'].unique())]
Company Progress Time diff
1 AAA 1.Start 2020-02-02 0.0
0 AAA 2.Discuss 2020-03-09 36.0
5 CCC 1.Start 2020-01-01 0.0
4 CCC 2.Discuss 2020-03-12 71.0
Second Problem
#Groupbynth to slice right consecutive groups
df2=(df.merge(df.groupby(['Company'])['Time'].nth([0,1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Drop any groups after grouping that have duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
Company Progress Time
1 AAA 2.Discuss 2020-03-09
0 AAA 3.Contract 2020-07-10
5 CCC 2.Discuss 2020-04-08
4 CCC 3.Contract 2020-05-19

Related

Excel formula to find specific data based on max row result

Good day beautiful people of Scotl... Stackoverflow.
I have faced issue in Excel which I have no idea how to solve. I tried many formulas but I believe that the problem is in my mind, since I have troubles to imagine the logical way it should follow.
I have attached a screenshot to clarify my problem:
Excel screenshot
Description of a screenshot
Column B - data name,
Rows C3:H3 - product name,
Table C4:H15 - some data (description, dates, etc.).
Column I is my extra and it is not mandatory to be there.
Desired result
I want to get data from table above to the table below but if there is one or more "DataX", I want Excel to pick the "DataX" where the biggest amount of rows are filled up (I have marked them blue for each DataX).
For example, for:
Data 1 - row 4,
Data 2 - row 7,
Data 3 - (obviously) row 9,
Data 4 - rows 11,
Data 5 - row 13.
If one or more records will match (all rows are empty / filled up), I don't care which row will be presented as a result.
What I have tried
I have added calculation (column I) which shows how many rows were updated and I was trying to find combination of v,hlookup + max but it wasn't working correctly.
I also created VBA code for it, which was working... almost good but then I received information that macros are no-go zone for this project.
Logic
I strongly believe that the logic should be as following:
Find matching DataX,
Find max value in row I (or include it in formula),
Find corresponding rows / columns for this record.
A
B
C
D
E
F
G
H
I
2
CAT 1
CAT 2
CAT 3
CAT 4
CAT 5
CAT 6
Count not blank
3
1
2
3
4
5
6
4
Data 1
AAA
BBB
CCC
EEE
FFF
=$H$3-COUNTBLANK(C4:H4)
5
Data 1
BBB
CCC
DDD
=$H$3-COUNTBLANK(C5:H5)
6
Data 1
AAA
BBB
EEE
FFF
=$H$3-COUNTBLANK(C6:H6)
7
Data 2
AAA
BBB
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C7:H7)
8
Data 2
AAA
BBB
CCC
DDD
FFF
=$H$3-COUNTBLANK(C8:H8)
9
Data 3
AAA
BBB
CCC
EEE
FFF
=$H$3-COUNTBLANK(C9:H9)
10
Data 4
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C10:H10)
11
Data 4
AAA
BBB
CCC
DDD
FFF
=$H$3-COUNTBLANK(C11:H11)
12
Data 4
AAA
BBB
CCC
EEE
FFF
=$H$3-COUNTBLANK(C12:H12)
13
Data 5
AAA
BBB
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C13:H13)
14
Data 5
BBB
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C14:H14)
15
Data 5
AAA
BBB
DDD
EEE
FFF
=$H$3-COUNTBLANK(C15:H15)
Hello dear son of Scotl.. overflow!
Please add to column J (range J4:J15) this additional formula
=CONCATENATE(B4,I4)
and then paste this to C19:
=INDEX(C$4:C$15,MATCH(CONCATENATE($B19,MAX(IF($B$4:$B$15=$B19,$I$4:$I$15,0))), $J$4:$J$15,0))
paste it as an array formula, i.e. press Ctrl+Shift+Enter simultaneously. Then populate it to the rest of the desired range.
The numbers in my example table do not mean anything, it's the number in I that matters.
Regards!!

Pandas: Sort a dataframe based on multiple columns

I know that this question has been asked several times. But none of the answers match my case.
I've a pandas dataframe with columns,department and employee_count. I need to sort the employee_count column in descending order. But if there is a tie between 2 employee_counts then they should be sorted alphabetically based on department.
Department Employee_Count
0 abc 10
1 adc 10
2 bca 11
3 cde 9
4 xyz 15
required output:
Department Employee_Count
0 xyz 15
1 bca 11
2 abc 10
3 adc 10
4 cde 9
This is what I've tried.
df = df.sort_values(['Department','Employee_Count'],ascending=[True,False])
But this just sorts the departments alphabetically.
I've also tried to sort by Department first and then by Employee_Count. Like this:
df = df.sort_values(['Department'],ascending=[True])
df = df.sort_values(['Employee_Count'],ascending=[False])
This doesn't give me correct output either:
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10
0 abc 10
3 cde 9
It gives 'adc' first and then 'abc'.
Kindly help me.
You can swap columns in list and also values in ascending parameter:
Explanation:
Order of columns names is order of sorting, first sort descending by Employee_Count and if some duplicates in Employee_Count then sorting by Department only duplicates rows ascending.
df1 = df.sort_values(['Employee_Count', 'Department'], ascending=[False, True])
print (df1)
Department Employee_Count
4 xyz 15
2 bca 11
0 abc 10 <-
1 adc 10 <-
3 cde 9
Or for test if use second False then duplicated rows are sorting descending:
df2 = df.sort_values(['Employee_Count', 'Department',],ascending=[False, False])
print (df2)
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10 <-
0 abc 10 <-
3 cde 9

Pandas DF Lookup - if value for record not available take the latest available

I am new to pandas and I've been trying to accomplish this task for a couple days now without success. In the beginning I had 3 dataframes that I was supposed to turn into only one with all the info. I managed to merge two of them correctly, which is now df1, however, for the third one there is a tricky logic that I couldnt figure out yet. The data structure is the following:
df1.head()
Out[12]:
Concat YearNb_x MonthNb_x WeekNb_x NatCoCode VariantCode \
1 BN2004384AAA112017 2017 1 1 AAA BN2004384
2 BN2004388AAA112017 2017 1 1 AAA BN2004388
4 BN2004510AAA112017 2017 1 1 AAA BN2004510
5 BN2004645AAA112017 2017 1 1 AAA BN2004645
6 BN2004780AAA112017 2017 1 1 AAA BN2004780
Suppliercode_x ModelName_x SumOfVolume Price
1 HUAWEI P9 (Eva) 745 399.991667
2 HUAWEI P9 lite (Venus) 1770 211.666667
4 SAMSUNG A3 (2016) 6210 205.000000
5 APPLE iPhone 6s Plus 2 724.166667
6 SAMSUNG Galaxy J5 (2016) 4571 190.000000
df2.head()
Out[13]:
YearNb MonthNb WeekNb NatCoCode VariantCode Suppliercode \
0 2016 1 1 BBB BN2001707 APPLE
1 2016 1 2 BBB BN2001707 APPLE
2 2016 1 3 BBB BN2001707 APPLE
3 2016 1 4 BBB BN2001707 APPLE
4 2016 1 1 BBB BN2002345 SAMSUNG
ModelName LocalPrice ProductCategoryCode
0 iPhone 4S 385.0 HS
1 iPhone 4S 385.0 HS
2 iPhone 4S 385.0 HS
3 iPhone 4S 385.0 HS
4 G. Note 2 (N7100) 395.0 HS
All info except for the prices are supposed to be the same, what I need to do is lookup for the prices (it can be by month, WeekNb can be ignored) in df2 for the same combination of items (NatCoCode, VariantCode, Supplier, Etc.) and IF the price for the respective month is not available df1 should take the LATEST available.
I was trying the following logic, which obviously doesnt work:
import pandas as pd
df1 = pd.read_excel('output2.xlsx')
df2 = pd.read_excel('localtest.xlsx')
def PriceAssignment(df1,df2):
i = 1
while i >= 5:
for i in df1['VariantCode'], df2['BNCode']:
if df1.loc[df1[i], df1['YearNb_x'], df1['WeekNb_x'], df1['NatCoCode'], df1['VariantCode']] == df2.loc[df2[i], df2['YearNb_x'], df2['WeekNb_x'], df2['NatCoCode'], df2['VariantCode']]:
df1['LocalPrice'] == df2.loc['Price']
elif df2['MonthNb']==12:
df2['YearNb'] -= i
else:
df2['MonthNb'] -= i
i += 1
return df1
The output would be something like:
From:
2017 2 OBE BN2004780BBB622017 SAMSUNG Galaxy J5 (2016) 500
2017 2 OBE BN2005184BBB622017 APPLE iPhone 6s Plus 300
2017 1 OBE BN2005190BBB622017 APPLE iPhone 7 350
To:
771 BN2004780BBB622017 2017 2 6 BBB BN2004780 SAMSUNG Galaxy J5 (2016) 67 171.9008264
772 BN2005184BBB622017 2017 2 6 BBB BN2005184 APPLE iPhone 6s Plus 13 614.8760331
773 BN2005190BBB622017 2017 2 6 BBB BN2005190 APPLE iPhone 7 1261 690.9090909
Result:
771 BN2004780BBB622017 2017 2 6 BBB BN2004780 SAMSUNG Galaxy J5 (2016) 67 171.9008264 500
772 BN2005184BBB622017 2017 2 6 BBB BN2005184 APPLE iPhone 6s Plus 13 614.8760331 300
773 BN2005190BBB622017 2017 2 6 BBB BN2005190 APPLE iPhone 7 1261 690.9090909 350
In this example, record 777 doesnt have a local price for the same month (03), in this case I would like to assign the latest available value to this item, in this case the latest value available for this item is from the month before, so this would be added in the LocalPrice column
I was trying to check if there was an available price for the same item in the last five months (subjective). The data (spreadsheets) can be found HERE
Does anyone have an idea or know a proper way on how to perform such operation?

Split a column's values by a special character and group by pandas

I have a df like this,
Owner Messages
AAA (YY) Duplicates
AAA Missing Number; (VV) Corrected Value; (YY) Duplicates
AAA (YY) Duplicates
BBB (YY) Duplicates
BBB Missing Measure; Missing Number
When I do a normal groupby like this,
df_grouped = df.groupby([' Owner', 'Messages']).size().reset_index(name='count')
df_grouped
I get this as expected,
Owner Messages count
0 AAA (YY) Duplicates 2
1 AAA Missing Number; (VV) Corrected Value; (YY) Duplicates 1
2 BBB (YY) Duplicates 1
3 BBB Missing Measure; Missing Number 1
However, I need something (desired output) like this splitting by ; inside Messages column.
Owner Messages count
0 AAA (YY) Duplicates 3
1 AAA Missing Number 1
2 AAA (VV) Corrected Value 1
3 BBB (YY) Duplicates 1
4 BBB Missing Measure 1
5 BBB Missing Number 1
So far, based on this post, #LeoRochael's answer, it splits Messages column's values by ; and puts into a list. Anyhow, I can not get the individual count after splitting.
Any ideas how to get my desired output?
You need to unnest your original dataframe , then we just do group size
s=df.set_index('Owner').Messages.str.split('; ',expand=True).stack().to_frame('Messages').reset_index()
s.groupby(['Owner','Messages']).size()
Out[1213]:
Owner Messages
AAA (VV) Corrected Value 1
(YY) Duplicates 3
Missing Number 1
BBB (YY) Duplicates 1
Missing Measure 1
Missing Number 1
dtype: int64
from collections import Counter
import pandas as pd
pd.Series(
Counter([(o, m) for o, M in df.values for m in M.split('; ')])
).rename_axis(['Owner', 'Message']).reset_index(name='Count')
Owner Message Count
0 AAA (VV) Corrected Value 1
1 AAA (YY) Duplicates 3
2 AAA Missing Number 1
3 BBB (YY) Duplicates 1
4 BBB Missing Measure 1
5 BBB Missing Number 1

Delete entire row if Consequent Data is Matching in Excel 2007

I have a problem is if my columns Consequent Data is Matching then it should Delete row.
for e.g.
Before
column A column B
aaa 10
aaa 10
aaa 5
bbb 6
aaa 10
bbb 5
After
column A column B
aaa 10
aaa 5
bbb 6
bbb 5
Select all the data in column a and b and then on the data ribbon select remove duplicates.

Resources