pandas dataframe groups check the number of unique values of a column is one but exclude empty strings

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings - python-3.x

I have the following df,
id invoice_no
1 6636
1 6637
2 6639
2 6639
3
3
4 6635
4 6635
4 6635
the invoice_no for id 3 are all empty strings or spaces; I want to
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
but also consider spaces and empty string invoice_no in each group as same_invoice_no = False; I am wondering how to do that. The result will look like,
id invoice_no same_invoice_no
1 6636 False
1 6637 False
2 6639 True
2 6639 True
3 False
3 False
4 6635 True
4 6635 True
4 6635 True

Empty strings equate to True but NaNs don't. Replace empty strings by Numpy nan
df.replace('', np.nan, inplace = True)
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
id invoice_no same_invoice_no
0 1 6636.0 False
1 1 6637.0 False
2 2 6639.0 True
3 2 6639.0 True
4 3 NaN False
5 3 NaN False
6 4 6635.0 True
7 4 6635.0 True
8 4 6635.0 True

Related

Checking for specific value change between columns in pandas

I've got 4 columns with numeric values between 1 and 4, and I'm trying to see which rows change from a value of 1 to a value of 4 progressing from column a to column d within those 4 columns. Currently I'm pulling the difference between each of the columns and looking for a value of 3. Is there a better way to do this?
Here's what I'm looking for (with 0's in place of nan):
ID a b c d check
1 1 0 1 4 True
2 1 0 1 1 False
3 1 1 1 4 True
4 1 3 3 4 True
5 0 0 1 4 True
6 1 2 3 3 False
7 1 0 0 4 True
8 1 4 4 4 True
9 1 4 3 4 True
10 1 4 1 1 True

You can just do cummax
col = ['a','b','c','d']
s = df[col].cummax(1)
df['new'] = s[col[:3]].eq(1).any(1) & s[col[-1]].eq(4)
Out[523]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
dtype: bool

You can try compare the index of 4 and 1 in apply
cols = ['a', 'b', 'c', 'd']
def get_index(lst, num):
return lst.index(num) if num in lst else -1
df['Check'] = df[cols].apply(lambda row: get_index(row.tolist(), 4) > get_index(row.tolist(), 1), axis=1)
print(df)
ID a b c d check Check
0 1 1 0 1 4 True True
1 2 1 0 1 1 False False
2 3 1 1 1 4 True True
3 4 1 3 3 4 True True
4 5 0 0 1 4 True True
5 6 1 2 3 3 False False
6 7 1 0 0 4 True True
7 8 1 4 4 4 True True
8 9 1 4 3 4 True True

pandas drop rows based on condition on groupby

I have a DataFrame like below
I am trying to groupby cell column and drop the "NA" values where group size > 1.
required Output :
How to get my expected output? How to filter on a condition and drop rows in groupby statement?

From your DataFrame, first we group by cell to get the size of each groups :
>>> df_grouped = df.groupby(['cell'], as_index=False).size()
>>> df_grouped
cell size
0 A 3
1 B 1
2 D 3
Then, we merge the result with the original DataFrame like so :
>>> df_merged = pd.merge(df, df_grouped, on='cell', how='left')
>>> df_merged
cell value kpi size
0 A 5.0 thpt 3
1 A 6.0 ret 3
2 A NaN thpt 3
3 B NaN acc 1
4 D 8.0 int 3
5 D NaN ps 3
6 D NaN yret 3
To finish, we filter the Dataframe to get the expected result :
>>> df_filtered = df_merged[~((df_merged['value'].isna()) & (df_merged['size'] > 1))]
>>> df_filtered[['cell', 'value', 'kpi']]
cell value kpi
0 A 5.0 thpt
1 A 6.0 ret
3 B NaN acc
4 D 8.0 int

Use boolean mask:
>>> df[df.groupby('cell').cumcount().eq(0) | df['value'].notna()]
cell value kpi
0 A crud thpt
1 A 6 ret
3 B NaN acc
4 D hi int
Details:
m1 = df.groupby('cell').cumcount().eq(0)
m2 = df['value'].notna()
df.assign(keep_at_least_one=m1, keep_notna=m2, keep_rows=m1|m2)
# Output:
cell value kpi keep_at_least_one keep_notna keep_rows
0 A crud thpt True True True
1 A 6 ret False True True
2 A NaN thpt False False False
3 B NaN acc True False True
4 D hi int True True True
5 D NaN ps False False False
6 D NaN yret False False False

Counting True or False

I have the following dataframe:
True_False
2018-01-02 True
2018-01-03 True
2018-01-04 False
2018-01-05 False
2018-01-08 False
... ...
2020-01-20 True
2020-01-21 True
2020-01-22 True
2020-01-23 True
2020-01-24 False
504 rows × 1 columns
I want to know how many successive True or False but not total it must stop counting after it toggles True or False. As such i want to eventually calculate mean(), max() and min() days. is it possible to show this data in Pandas?

Solution if all datetimes are consecutive:
You can create helper Series for consecutive groups by Series.shift and Series.cumsum, then get counts by GroupBy.size:
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s)
True_False True_False
False 2 3
4 1
True 1 2
3 4
dtype: int64
And last aggregate min, max and mean per first level of MultiIndex:
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 2 3 1
True 3 4 2
If datetimes are not consecutive first step is DataFrame.asfreq:
df = df.asfreq('d')
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 1.333333 2 1
True 3.000000 4 2

Get date from most recent ID with corresponding Boolean-Updated2x

I am trying to find the corresponding date of the most recent ID which has corresponding True value
I've utilized df.id.rolling to locate my desired duplicates in my date range window. I just need to identify how far the duplicates are from the most recent occurrence of duplicates.
This is what my starting df looks like
df_input:
date id duplicate
1/10/18 1 true
1/12/18 2 true
1/20/18 1 false
1/31/18 1 false
This is what i'm trying to get to
df_output:
date id duplicate most_recent
1/10/18 1 true Nan
1/12/18 2 true Nan
1/20/18 1 false 1/10/18
1/31/18 1 false 1/10/18
Any tips are helpful!
Edited: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Thanks for the tip but this doesn't seem to find the most recent instances only the first instance in a series this returns the first event:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 1/10/18
I'm looking for:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 2/1/18
Thanks for the help, I don't think I full realized or explain my problem fully.
Updated ~~~~~
The coded provided works so maybe I should re-post but I need to be able to find the most recent and append a column then I need to be able to find it again based on conditions laid out in an If + For loop statement. See below for the code example
list2 = []
df.loc[~df.duplicates,'most_recent']=df['date'].where(df.duplicates).groupby(df['id']).ffill()
for index, row in df.iterrows():
dup = row['duplicates']
date = row['date']
ndate = row['most_recent']
d1 = date - ndate
if d1 > timedelta(days= 14):
x= True
if x == True:
list2.append(x)
else:
list2.append(dup)
df.loc[~df.duplicates,'most_recent']=df['date'].where(df.duplicates).groupby(df['id']).ffill()
Example ouput:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 2/1/18
Some code
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 True 2/1/18

What I will do using ffill
df.loc[~df.duplicate,'most_recent']=df['date'].where(df.duplicate).groupby(df['id']).ffill()
df
Out[740]:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18

Use transform function for your code
df.loc[df.duplicate,'column_name_you are looking for ']=df.groupby('id').date.transform('first')
df

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0

You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings - python-3.x

Related

Checking for specific value change between columns in pandas

pandas drop rows based on condition on groupby

Counting True or False

Get date from most recent ID with corresponding Boolean-Updated2x

How to delete the entire row if any of its value is 0 in pandas

Categories

Resources