pandas dataframe groups check the number of unique values of a column is one but exclude empty strings - python-3.x

I have the following df,
id invoice_no
1 6636
1 6637
2 6639
2 6639
3
3
4 6635
4 6635
4 6635
the invoice_no for id 3 are all empty strings or spaces; I want to
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
but also consider spaces and empty string invoice_no in each group as same_invoice_no = False; I am wondering how to do that. The result will look like,
id invoice_no same_invoice_no
1 6636 False
1 6637 False
2 6639 True
2 6639 True
3 False
3 False
4 6635 True
4 6635 True
4 6635 True

Empty strings equate to True but NaNs don't. Replace empty strings by Numpy nan
df.replace('', np.nan, inplace = True)
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
id invoice_no same_invoice_no
0 1 6636.0 False
1 1 6637.0 False
2 2 6639.0 True
3 2 6639.0 True
4 3 NaN False
5 3 NaN False
6 4 6635.0 True
7 4 6635.0 True
8 4 6635.0 True

Related

Checking for specific value change between columns in pandas

I've got 4 columns with numeric values between 1 and 4, and I'm trying to see which rows change from a value of 1 to a value of 4 progressing from column a to column d within those 4 columns. Currently I'm pulling the difference between each of the columns and looking for a value of 3. Is there a better way to do this?
Here's what I'm looking for (with 0's in place of nan):
ID a b c d check
1 1 0 1 4 True
2 1 0 1 1 False
3 1 1 1 4 True
4 1 3 3 4 True
5 0 0 1 4 True
6 1 2 3 3 False
7 1 0 0 4 True
8 1 4 4 4 True
9 1 4 3 4 True
10 1 4 1 1 True
You can just do cummax
col = ['a','b','c','d']
s = df[col].cummax(1)
df['new'] = s[col[:3]].eq(1).any(1) & s[col[-1]].eq(4)
Out[523]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
dtype: bool
You can try compare the index of 4 and 1 in apply
cols = ['a', 'b', 'c', 'd']
def get_index(lst, num):
return lst.index(num) if num in lst else -1
df['Check'] = df[cols].apply(lambda row: get_index(row.tolist(), 4) > get_index(row.tolist(), 1), axis=1)
print(df)
ID a b c d check Check
0 1 1 0 1 4 True True
1 2 1 0 1 1 False False
2 3 1 1 1 4 True True
3 4 1 3 3 4 True True
4 5 0 0 1 4 True True
5 6 1 2 3 3 False False
6 7 1 0 0 4 True True
7 8 1 4 4 4 True True
8 9 1 4 3 4 True True

pandas drop rows based on condition on groupby

I have a DataFrame like below
I am trying to groupby cell column and drop the "NA" values where group size > 1.
required Output :
How to get my expected output? How to filter on a condition and drop rows in groupby statement?
From your DataFrame, first we group by cell to get the size of each groups :
>>> df_grouped = df.groupby(['cell'], as_index=False).size()
>>> df_grouped
cell size
0 A 3
1 B 1
2 D 3
Then, we merge the result with the original DataFrame like so :
>>> df_merged = pd.merge(df, df_grouped, on='cell', how='left')
>>> df_merged
cell value kpi size
0 A 5.0 thpt 3
1 A 6.0 ret 3
2 A NaN thpt 3
3 B NaN acc 1
4 D 8.0 int 3
5 D NaN ps 3
6 D NaN yret 3
To finish, we filter the Dataframe to get the expected result :
>>> df_filtered = df_merged[~((df_merged['value'].isna()) & (df_merged['size'] > 1))]
>>> df_filtered[['cell', 'value', 'kpi']]
cell value kpi
0 A 5.0 thpt
1 A 6.0 ret
3 B NaN acc
4 D 8.0 int
Use boolean mask:
>>> df[df.groupby('cell').cumcount().eq(0) | df['value'].notna()]
cell value kpi
0 A crud thpt
1 A 6 ret
3 B NaN acc
4 D hi int
Details:
m1 = df.groupby('cell').cumcount().eq(0)
m2 = df['value'].notna()
df.assign(keep_at_least_one=m1, keep_notna=m2, keep_rows=m1|m2)
# Output:
cell value kpi keep_at_least_one keep_notna keep_rows
0 A crud thpt True True True
1 A 6 ret False True True
2 A NaN thpt False False False
3 B NaN acc True False True
4 D hi int True True True
5 D NaN ps False False False
6 D NaN yret False False False

Counting True or False

I have the following dataframe:
True_False
2018-01-02 True
2018-01-03 True
2018-01-04 False
2018-01-05 False
2018-01-08 False
... ...
2020-01-20 True
2020-01-21 True
2020-01-22 True
2020-01-23 True
2020-01-24 False
504 rows × 1 columns
I want to know how many successive True or False but not total it must stop counting after it toggles True or False. As such i want to eventually calculate mean(), max() and min() days. is it possible to show this data in Pandas?
Solution if all datetimes are consecutive:
You can create helper Series for consecutive groups by Series.shift and Series.cumsum, then get counts by GroupBy.size:
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s)
True_False True_False
False 2 3
4 1
True 1 2
3 4
dtype: int64
And last aggregate min, max and mean per first level of MultiIndex:
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 2 3 1
True 3 4 2
If datetimes are not consecutive first step is DataFrame.asfreq:
df = df.asfreq('d')
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 1.333333 2 1
True 3.000000 4 2

Get date from most recent ID with corresponding Boolean-Updated2x

I am trying to find the corresponding date of the most recent ID which has corresponding True value
I've utilized df.id.rolling to locate my desired duplicates in my date range window. I just need to identify how far the duplicates are from the most recent occurrence of duplicates.
This is what my starting df looks like
df_input:
date id duplicate
1/10/18 1 true
1/12/18 2 true
1/20/18 1 false
1/31/18 1 false
This is what i'm trying to get to
df_output:
date id duplicate most_recent
1/10/18 1 true Nan
1/12/18 2 true Nan
1/20/18 1 false 1/10/18
1/31/18 1 false 1/10/18
Any tips are helpful!
Edited: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Thanks for the tip but this doesn't seem to find the most recent instances only the first instance in a series this returns the first event:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 1/10/18
I'm looking for:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 2/1/18
Thanks for the help, I don't think I full realized or explain my problem fully.
Updated ~~~~~
The coded provided works so maybe I should re-post but I need to be able to find the most recent and append a column then I need to be able to find it again based on conditions laid out in an If + For loop statement. See below for the code example
list2 = []
df.loc[~df.duplicates,'most_recent']=df['date'].where(df.duplicates).groupby(df['id']).ffill()
for index, row in df.iterrows():
dup = row['duplicates']
date = row['date']
ndate = row['most_recent']
d1 = date - ndate
if d1 > timedelta(days= 14):
x= True
if x == True:
list2.append(x)
else:
list2.append(dup)
df.loc[~df.duplicates,'most_recent']=df['date'].where(df.duplicates).groupby(df['id']).ffill()
Example ouput:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 2/1/18
Some code
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 True 2/1/18
What I will do using ffill
df.loc[~df.duplicate,'most_recent']=df['date'].where(df.duplicate).groupby(df['id']).ffill()
df
Out[740]:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
Use transform function for your code
df.loc[df.duplicate,'column_name_you are looking for ']=df.groupby('id').date.transform('first')
df

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0
You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

Resources