iteratively merging varying number of rows - python-3.x

earlier discussion with help of #Joe Ferndz here:
merging varying number of rows and columns by multiple conditions in python
how the dataset looks like
connector type q_text a_text var1
1 1111 1 aaaa None xxxx
2 9999 2 None tttt jjjj
3 1111 2 None uuuu None
4 9999 1 bbbb None yyyy
5 9999 1 cccc None zzzz
Logic merge every row with type = 1 to its corresponding (same value in connector) type = 2. Code that does this:
df.loc[df['type'] == 2, 'var1.1'] = df['var1']
my_cols = ['q_text','a_text','var1']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['q_text'], inplace=True)
df.reset_index(drop=True,inplace=True)
how the dataset then looks like
connector q_text a_text var1 var1.1
1 1111 aaaa uuuu xxxx None
2 9999 bbbb tttt yyyy jjjj
3 9999 cccc None zzzz zzzz
Problem with multiple rows having type = 1 but only one row having type = 2 (same connector value). therefore i need to merge type = 2 row eventually multiple times.
Question Why does it merge only one row?
how the dataset should look like (compare row 3 values and you will see what i mean)
connector q_text a_text var1 var1.1
1 1111 aaaa uuuu xxxx None
2 9999 bbbb tttt yyyy jjjj
3 9999 cccc tttt zzzz jjjj
a_text follows left-join logic, values can be overridden without adding an extra column. Contrary, var1 values are non-exclusionary with regard to the rows connector value, that is why i want to have extra column (var1.1) for those values (jjjj). There are rows with a unique connector value that will never be merged, but I want to keep those.

You want to merge rows with type = 1 to rows having type = 2, but in the code/logic you showed doesn't involve use of pandas.merge method, which will actually do what you desire.
First segregate the rows with type = 1 and type = 2 into 2 different dataframes df1 and df2. Then simply merge these 2 dataframes on connector values. It will automatically map multiple rows having type = 1 in df1 with only one row having type = 2 in df2 (with same connector values). Also since you want to keep rows with a unique connector value that will never be merged, use how='outer' param to perform an outer merge and keep all values.
After merge, select what all columns you finally want and rename them accordingly:
df1 = df.loc[df.type == 1].copy()
df2 = df.loc[df.type == 2].copy()
merged_df = pd.merge(df1, df2, on='connector', how='outer')
merged_df = merged_df.loc[:,['connector','q_text_x','a_text_y','var1_x','var1_y']]
merged_df.rename(columns={'q_text_x':'q_text','a_text_y':'a_text','var1_x':'var1','var1_y':'var1.1'}, inplace=True)
>>> merged_df
connector q_text a_text var1 var1.1
0 1111 aaaa uuuu xxxx None
1 9999 bbbb tttt yyyy jjjj
2 9999 cccc tttt zzzz jjjj

Related

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

Filter pandas group with if else condition

I have a pandas dataframe like this:
ID
Tier1
Tier2
1111
RF
B
1111
OK
B
2222
RF
B
2222
RF
E
3333
OK
B
3333
LO
B
I need to cut down the table so the IDs are unique, but do so with the following hierarchy: RF>OK>LO for Tier1. Then B>E for Tier2.
So the expected output will be:
ID
Tier1
Tier2
1111
RF
B
2222
RF
B
2222
RF
E
3333
OK
B
then:
ID
Tier1
Tier2
1111
RF
B
2222
RF
B
3333
OK
B
I am struggling to figure out how to this. My initial attempt is to group the table with grouped = df.groupby('ID') and then:
grouped = df.groupby('ID')
for key, group in grouped:
check_rf = group['Tier1']=='RF'
check_ok = group['Tier1']=='OK'
if check_rf.any():
group = group[group['Tier1']=='RF']
elif check_ok.any():
#and so on
I think this is working to filter each group, but I have no idea how the groups can then relate back to the parent table (df). And I am sure there is a better way to do this.
Thanks!
Let's use pd.Categorical & drop_duplicates
df['Tier1'] = pd.Categorical(df['Tier1'],['RF','OK','LO'],ordered=True)
df['Tier2'] = pd.Categorical(df['Tier2'],['B','E'],ordered=True)
df1 = df.sort_values(['Tier1','Tier2']).drop_duplicates(subset=['ID'],keep='first')
print(df1)
ID Tier1 Tier2
0 1111 RF B
2 2222 RF B
4 3333 OK B
Looking at Tier1 you can see the ordering.
print(df['Tier1'])
0 RF
1 OK
2 RF
3 RF
4 OK
5 LO
Name: Tier1, dtype: category
Categories (3, object): ['RF' < 'OK' < 'LO']
You can use two groupby+agg Pandas calls. Since the ordering RF>OK>LO and B>E are respectively compliant the (reverse) lexicographic ordering, you can use the trivial min/max functions for the aggregation (otherwise you can write your own custom min-max functions).
Here is how to do that (using a 2-pass filtering):
tmp = df.groupby(['ID', 'Tier2']).agg(max).reset_index() # Step 1
output = tmp.groupby(['ID', 'Tier1']).agg(min).reset_index() # Step 2
Here is the result in output:
ID Tier1 Tier2
0 1111 RF B
1 2222 RF B
2 3333 OK B

merging varying number of rows and columns by multiple conditions in python

updated Problem: Why does it not merge a_date, a_par, a_cons, a_ment and a_le. These are appended as columns without values but in the original dataset they have values.
Here is how the dataset looks like
connector type q_text a_text var1 var2
1 1111 1 aa None xx ps
2 9999 2 None tt jjjj pppp
3 1111 2 None uu None oo
4 9999 1 bb None yy Rt
5 9999 1 cc None zz tR
Goal: how the dataset should look like
connector q_text a_text var1 var1.1 var2 var2.1
1 1111 aa uu xx None ps oo
2 9999 bb tt yy jjjj Rt pppp
3 9999 cc tt zz jjjj tR pppp
Logic: Column type has either value 1 or 2 with multiple rows having value 1 but only one row (with the same value in connector) has value 2
Here are the main merging rules:
Merge every row of type=1 with its corresponding (connector) type=2 row.
Since multiple rows of type=1 have the same connector value, I don't want to merge solely one row of type=1 but all of them, each with the sole type==2 row.
Since some columns (e.g. a_text) follow left-join logic, values can be overridden without adding an extra column.
Since var2 values cannot be merged by left-join because they are non-exclusionary with regard to the rows connector value, i want to have extra columns (var1.1, var2.1) for those values (pppp, jjjj).
In summary (and having in mind that i only speak of rows that have the same connector values): If q_text is None i first, want to replace the values in a_text with the a_text value (see above table tt and uu) of the corresponding row (same connector value) and secondly, want to append some other values (var1 and var2) of the very same corresponding row as new columns.
Also, there are rows with a unique connector value that is not going to be matched. I want to keep those rows though.
I only want to "drop" the type=2 rows that get merged with their corresponding type=1 row**(s)**. In other words: I dont want to keep the rows of type=2 that have a match and get merged into their corresponding (connector) type=1 rows. I want to keep all other rows though.
Solution by #victor__von__doom here
merging varying number of rows by multiple conditions in python
was answered when i originally wanted to keep all of the "type"=2 columns(values).
Code i used: merged Perso, q_text and a_text
df.loc[df['type'] == 2, 'a_date'] = df['q_date']
df.loc[df['type'] == 2, 'a_par'] = df['par']
df.loc[df['type'] == 2, 'a_cons'] = df['cons']
df.loc[df['type'] == 2, 'a_ment'] = df['pret']
df.loc[df['type'] == 2, 'a_le'] = df['q_le']
my_cols = ['Perso', 'q_text','a_text', 'a_le', 'q_le', 'q_date', 'par', 'cons', 'pret', 'q_le', 'a_date','a_par', 'a_cons', 'a_ment', 'a_le']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['a_text', 'Perso'],inplace=True)
df.reset_index(drop=True,inplace=True)
Data: This is a representation of the core dataset. Unfortunately i cannot share the actual data due to privacy laws.
Perso
ID
per
q_le
a_le
pret
par
form
q_date
name
IO_ID
part
area
q_text
a_text
country
cons
dig
connector
type
J Ws
1-1/4/2001-11-12/1
1999-2009
None
4325
'Mi, h', 'd'
Cew
Thre
2001-11-12
None
345
rede
s — H
None
wr ede
Terd e
e r
2001-11-12.1.g9
999999999
2
S ts
9-3/6/2003-10-14/1
1994-2004
None
23
'sd, h'
d-g
Thre
2003-10-14
None
34555
The
l? I
None
Tre
Thr ede
re
2001-04-16.1.a9
333333333
2
On d
6-1/6/2005-09-03/1
1992-2006
None
434
'uu h'
d-g
Thre
2005-09-03
None
7313
Thde
l? I
None
T e
Th rede
dre
2001-08-07.1.e4
111111111
2
None
3-4/4/2000-07-07/1
1992-2006
1223
None
'uu h'
dfs
Thre
2000-07-07
Th r
7413
Thde
Tddde
Thd de
None
Thre de
2001-07-06.1.j3
111111111
1
None
2-1/6/2001-11-12/1
1999-2009
1444
None
'Mi, h', 'd'
d-g
Thre
2001-11-12
T rj
7431
Thde
l? I
Th dde
None
Thr ede
2001-11-12.1.s7
999999999
1
None
1-6/4/2007-11-01/1
1993-2010
2353
None
None
d-g
Thre
2007-11-01
Thrj
444
Thed
l. I
Tgg gg
None
Thre de
we e
2001-06-11.1.g9
654982984
1
EDIT v2 with additional columns
This version ensures the values in the additional columns are not impacted.
c = ['connector','type','q_text','a_text','var1','var2','cumsum','country','others']
d = [[1111, 1, 'aa', None, 'xx', 'ps', 0, 'US', 'other values'],
[9999, 2, None, 'tt', 'jjjj', 'pppp', 0, 'UK', 'no values'],
[1111, 2, None, 'uu', None, 'oo', 1, 'US', 'some values'],
[9999, 1, 'bb', None, 'yy', 'Rt', 1, 'UK', 'more values'],
[9999, 1, 'cc', None, 'zz', 'tR', 2, 'UK', 'less values']]
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.DataFrame(d,columns=c)
print (df)
df.loc[df['type'] == 2, 'var1.1'] = df['var1']
df.loc[df['type'] == 2, 'var2.1'] = df['var2']
my_cols = ['q_text','a_text','var1','var2','var1.1','var2.1']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['q_text'],inplace=True)
df.reset_index(drop=True,inplace=True)
print (df)
Original DataFrame:
connector type q_text a_text var1 var2 cumsum country others
0 1111 1 aa None xx ps 0 US other values
1 9999 2 None tt jjjj pppp 0 UK no values
2 1111 2 None uu None oo 1 US some values
3 9999 1 bb None yy Rt 1 UK more values
4 9999 1 cc None zz tR 2 UK less values
Updated DataFrame
connector type q_text a_text var1 var2 cumsum country others var1.1 var2.1
0 1111 1 aa uu xx ps 0 US other values None oo
1 9999 1 bb tt yy Rt 1 UK more values jjjj pppp
2 9999 1 cc tt zz tR 2 UK less values jjjj pppp

merging varying number of rows by multiple conditions in python

Problem: merging varying number of rows by multiple conditions
Here is a stylistic example of how the dataset looks like
"index" "connector" "type" "q_text" "a_text" "varx" ...
1 1111 1 aa NA xx
2 9999 2 NA tt NA
3 1111 2 NA uu NA
4 9999 1 bb NA yy
5 9999 1 cc NA zz
Goal: how the dataset should look like
"index" "connector" "type" "type.1" "q_text" "q_text.1" "a_text" "a_text.1 " "varx" "varx.1" ...
1 1111 1 2 aa NA NA uu xx NA
2 9999 1 2 bb NA NA tt yy NA
3 9999 1 2 cc NA NA tt zz NA
Logic: Column "type" has either value 1 or 2 while multiple rows have value 1 but only one row (with the same value in "connector") has value 2
If
same values in "connector"
then
merge
rows of "type"=2 with rows of "type"=1
but
(because multiple rows of "type"=1 have the same value in "connector")
duplicate
the corresponding rows of type=2
and
merge
all of the other rows that also have the same value in "connector" and are of "type"=1
My results: Not all are merged because multiple rows of "type"=1 are associated with UNIQUE rows of "type"=2
Most similar questions are answered using SQL, which i cannot use here.
df2 = df.copy()
df.index.astype(str)
df2.index.astype(str)
pd.merge(df,df2, how='left', on='connector',right_index=True, left_index=True)
df3 = pd.merge(df.set_index('connector'),df2.set_index('connector'), right_index=True, left_index=True).reset_index()
dfNew = df.merge(df2, how='left', left_on=['connector'], right_on = ['connector'])
Can i achieve my goal by goupby() ?
Solution by #victor__von__doom
if __name__ == '__main__':
df = df.groupby('connector', sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[['here', 'are', 'all', 'columns', 'except', 'for', 'the', 'connector', 'column']] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
First off, it is really messy to just keep concatenating new columns onto your original DataFrame when rows are merged, especially when the number of columns is very large. Furthermore, if you end up merging 3 rows for 1 connector value and 4 rows for another (for example), the only way to include all values is to make empty columns for some rows, which is never a good idea. Instead, I've made it so that the merged rows get combined into tuples, which can then be parsed efficiently while keeping the size of your DataFrame manageable:
import numpy as np
import pandas as pd
if __name__ == '__main__':
data = np.array([[1,2,3,4,5], [1111,9999,1111,9999,9999],
[1,2,2,1,1], ['aa', 'NA', 'NA', 'bb', 'cc'],
['NA', 'tt', 'uu', 'NA', 'NA'],
['xx', 'NA', 'NA', 'yy', 'zz']])
df = pd.DataFrame(data.T, columns = ["index", "connector",
"type", "q_text", "a_text", "varx"])
df = df.groupby("connector", sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[["type", "q_text", "a_text", "varx"]] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
The final DataFrame looks like:
connector type q_text a_text varx ...
0 1111 (1, 2) (aa, NA) (NA, uu) (xx, NA) ...
1 9999 (2, 1, 1) (NA, bb, cc) (tt, NA, NA) (NA, yy, zz) ...
Which is much more compact and readable.

Pandas - Groupby Company and drop rows according to criteria based off the Dates of values being out of order

I have a history data log and want to calculate the number of days between the progress by Company (Timestamp of the early stage must be smaller than the later stage).
Company Progress Time
AAA 3. Contract 07/10/2020
AAA 2. Discuss 03/09/2020
AAA 1. Start 02/02/2020
BBB 3. Contract 11/13/2019
BBB 3. Contract 07/01/2019
BBB 1. Start 06/22/2019
BBB 2. Discuss 04/15/2019
CCC 3. Contract 05/19/2020
CCC 2. Discuss 04/08/2020
CCC 2. Discuss 03/12/2020
CCC 1. Start 01/01/2020
Expected outputs:
Progress (1. Start --> 2. Discuss)
Company Progress Time
AAA 1. Start 02/02/2020
AAA 2. Discuss 03/09/2020
CCC 1. Start 01/01/2020
CCC 2. Discuss 03/12/2020
Progress (2. Discuss --> 3. Contract)
Company Progress Time
AAA 2. Discuss 03/09/2020
AAA 3. Contract 07/10/2020
CCC 2. Discuss 03/12/2020
CCC 3. Contract 05/19/2020
I did try some stupid ways to do the work but still need manualyl filter in excel, below is my coding:
df_stage1_stage2 = df[(df['Progress']=='1. Start')|(df['Progress']=='2. Discuss ')]
pd.pivot_table(df_stage1_stage2 ,index=['Company','Progress'],aggfunc={'Time':min})
Can anyone help with the problem? thanks
Create some masks to filter out the relevant rows. m1 and m2 filter out groups where 1. Start is not the "first" datetime if looking at in reverse order )since your dates are sorted by Company ascending and date descending). You can create more masks if you need to also check if 2. Discuss and 3. Contract are in order, instead of the current logic which is only checking to make sure that 1. is in order. But, with the data you provided that returns the correct output:
m1 = df.groupby('Company')['Progress'].transform('last')
m2 = np.where((m1 == '1. Start'), 'drop', 'keep')
df = df[m2=='drop']
df
intermediate output:
Company Progress Time
0 AAA 3. Contract 07/10/2020
1 AAA 2. Discuss 03/09/2020
2 AAA 1. Start 02/02/2020
7 CCC 3. Contract 05/19/2020
8 CCC 2. Discuss 04/08/2020
9 CCC 2. Discuss 03/12/2020
10 CCC 1. Start 01/01/2020
From there, filter as you have indicated by sorting and dropping duplicates based off a subset of the first two columns and keep the 'first' duplicate:
final df1 and df2 output:
df1
df1 = df[df['Progress'] != '3. Contract'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df1 output:
Company Progress Time
2 AAA 1. Start 02/02/2020
1 AAA 2. Discuss 03/09/2020
10 CCC 1. Start 01/01/2020
9 CCC 2. Discuss 03/12/2020
df2
df2 = df[df['Progress'] != '1. Start'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df2 output:
Company Progress Time
1 AAA 2. Discuss 03/09/2020
0 AAA 3. Contract 07/10/2020
9 CCC 2. Discuss 03/12/2020
7 CCC 3. Contract 05/19/2020
Something like this could work, assuming an already sorted df:
(full example)
data = {
'Company':['AAA', 'AAA', 'AAA', 'BBB','BBB','BBB','BBB','CCC','CCC','CCC','CCC',],
'Progress':['3. Contract', '2. Discuss', '1. Start', '3. Contract', '3. Contract', '2. Discuss', '1. Start', '3. Contract', '2. Discuss', '2. Discuss', '1. Start', ],
'Time':['07-10-2020','03-09-2020','02-02-2020','11-13-2019','07-01-2019','06-22-2019','04-15-2019','05-19-2020','04-08-2020','03-12-2020','01-01-2020',],
}
df = pd.DataFrame(data)
df['Time'] = pd.to_datetime(df['Time'])
# We want to measure from the first occurrence (last date) if duplicated:
df.drop_duplicates(subset=['Company', 'Progress'], keep='first', inplace=True)
# Except for the rows of 'start', calculate the difference in days
df['days_delta'] = np.where((df['Progress'] != '1. Start'), df.Time.diff(-1), 0)
Output:
Company Progress Time days_delta
0 AAA 3. Contract 2020-07-10 123 days
1 AAA 2. Discuss 2020-03-09 36 days
2 AAA 1. Start 2020-02-02 0 days
3 BBB 3. Contract 2019-11-13 144 days
5 BBB 2. Discuss 2019-06-22 68 days
6 BBB 1. Start 2019-04-15 0 days
7 CCC 3. Contract 2020-05-19 41 days
8 CCC 2. Discuss 2020-04-08 98 days
10 CCC 1. Start 2020-01-01 0 days
If you do not want the 'days' word in output use:
df['days_delta'] = df['days_delta'].dt.days
First Problem
#Coerce Time to Datetime
df['Time']=pd.to_datetime(df['Time'])
#`groupby().nth[]` `to slice the consecutive order`
df2=(df.merge(df.groupby(['Company'])['Time'].nth([-2,-1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Apply the universal rule for this problem which is, after groupby nth, drop any agroup with duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
#Calculate the diff() in Time in each group
df2['diff'] = df2.sort_values(by='Progress').groupby('Company')['Time'].diff().dt.days.fillna(0)#.groupby('Company')['Time'].diff() / np.timedelta64(1, 'D')
#Filter out the groups where start and Discuss Time are in conflict
df2[~df2.Company.isin(df2.loc[df2['diff']<0, 'Company'].unique())]
Company Progress Time diff
1 AAA 1.Start 2020-02-02 0.0
0 AAA 2.Discuss 2020-03-09 36.0
5 CCC 1.Start 2020-01-01 0.0
4 CCC 2.Discuss 2020-03-12 71.0
Second Problem
#Groupbynth to slice right consecutive groups
df2=(df.merge(df.groupby(['Company'])['Time'].nth([0,1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Drop any groups after grouping that have duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
Company Progress Time
1 AAA 2.Discuss 2020-03-09
0 AAA 3.Contract 2020-07-10
5 CCC 2.Discuss 2020-04-08
4 CCC 3.Contract 2020-05-19

Resources