Get date from most recent ID with corresponding Boolean-Updated2x

Get date from most recent ID with corresponding Boolean-Updated2x - python-3.x

I am trying to find the corresponding date of the most recent ID which has corresponding True value
I've utilized df.id.rolling to locate my desired duplicates in my date range window. I just need to identify how far the duplicates are from the most recent occurrence of duplicates.
This is what my starting df looks like
df_input:
date id duplicate
1/10/18 1 true
1/12/18 2 true
1/20/18 1 false
1/31/18 1 false
This is what i'm trying to get to
df_output:
date id duplicate most_recent
1/10/18 1 true Nan
1/12/18 2 true Nan
1/20/18 1 false 1/10/18
1/31/18 1 false 1/10/18
Any tips are helpful!
Edited: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Thanks for the tip but this doesn't seem to find the most recent instances only the first instance in a series this returns the first event:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 1/10/18
I'm looking for:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 2/1/18
Thanks for the help, I don't think I full realized or explain my problem fully.
Updated ~~~~~
The coded provided works so maybe I should re-post but I need to be able to find the most recent and append a column then I need to be able to find it again based on conditions laid out in an If + For loop statement. See below for the code example
list2 = []
df.loc[~df.duplicates,'most_recent']=df['date'].where(df.duplicates).groupby(df['id']).ffill()
for index, row in df.iterrows():
dup = row['duplicates']
date = row['date']
ndate = row['most_recent']
d1 = date - ndate
if d1 > timedelta(days= 14):
x= True
if x == True:
list2.append(x)
else:
list2.append(dup)
df.loc[~df.duplicates,'most_recent']=df['date'].where(df.duplicates).groupby(df['id']).ffill()
Example ouput:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 False 2/1/18
Some code
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18
4 2/1/18 1 True Nan
5 2/8/18 1 True 2/1/18

What I will do using ffill
df.loc[~df.duplicate,'most_recent']=df['date'].where(df.duplicate).groupby(df['id']).ffill()
df
Out[740]:
date id duplicate most_recent
0 1/10/18 1 True NaN
1 1/12/18 2 True NaN
2 1/20/18 1 False 1/10/18
3 1/31/18 1 False 1/10/18

Use transform function for your code
df.loc[df.duplicate,'column_name_you are looking for ']=df.groupby('id').date.transform('first')
df

Related

Checking for specific value change between columns in pandas

I've got 4 columns with numeric values between 1 and 4, and I'm trying to see which rows change from a value of 1 to a value of 4 progressing from column a to column d within those 4 columns. Currently I'm pulling the difference between each of the columns and looking for a value of 3. Is there a better way to do this?
Here's what I'm looking for (with 0's in place of nan):
ID a b c d check
1 1 0 1 4 True
2 1 0 1 1 False
3 1 1 1 4 True
4 1 3 3 4 True
5 0 0 1 4 True
6 1 2 3 3 False
7 1 0 0 4 True
8 1 4 4 4 True
9 1 4 3 4 True
10 1 4 1 1 True

You can just do cummax
col = ['a','b','c','d']
s = df[col].cummax(1)
df['new'] = s[col[:3]].eq(1).any(1) & s[col[-1]].eq(4)
Out[523]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
dtype: bool

You can try compare the index of 4 and 1 in apply
cols = ['a', 'b', 'c', 'd']
def get_index(lst, num):
return lst.index(num) if num in lst else -1
df['Check'] = df[cols].apply(lambda row: get_index(row.tolist(), 4) > get_index(row.tolist(), 1), axis=1)
print(df)
ID a b c d check Check
0 1 1 0 1 4 True True
1 2 1 0 1 1 False False
2 3 1 1 1 4 True True
3 4 1 3 3 4 True True
4 5 0 0 1 4 True True
5 6 1 2 3 3 False False
6 7 1 0 0 4 True True
7 8 1 4 4 4 True True
8 9 1 4 3 4 True True

Python how to count boolean in dataframe and compute percentage

I would like to do 2 manupilation with pandas but I cannot manage to do it.
I have a dataframe that looks like this :
df = pd.DataFrame({'Name': {0: 'Eric', 1: 'Mattieu',2: 'Eric',3: 'Mattieu', 4: 'Mattieu',5: 'Eric',6:'Mattieu',7:'Franck',8:'Franck',9:'Jack',10:'Jack'},
'Value': {0: False, 1:False,2:True,3:False, 4:True,5: True,6: False,7:True,8:True,9:False,10:False},
})
df=df.sort_values(["Name"])
df
output:
Name Value
0 Eric False
2 Eric True
5 Eric True
7 Franck True
8 Franck True
9 Jack False
10 Jack False
1 Mattieu False
3 Mattieu False
4 Mattieu True
6 Mattieu False
Manupilation 1 : I would like to have the Number of True , False, Total value and the mean of True Value for each Name, like this :
Name Nbr True Nbr False Total Value Mean (True/(False+True))
0 Eric 2 1 3 0.75
1 Franck 2 0 2 1.00
2 Jack 0 2 2 0.00
3 Mattieu 1 3 4 0.25
Manupilation 2: I would like to get a group by mean of the column "Mean (True/(False+True))" grouped by "Total value", like this :
Group by Total Value Mean of grouped Total Value
0 2 0.50
1 3 0.75
2 4 0.25
Thanks in advance for your help

First one can be done with crosstab
s1 = pd.crosstab(df['Name'], df['Value'], margins=True).drop('All').assign(Mean = lambda x : x[True]/x['All'])
Out[266]:
Value False True All Mean
Name
Eric 1 2 3 0.666667
Franck 0 2 2 1.000000
Jack 2 0 2 0.000000
Mattieu 3 1 4 0.250000
2nd dataframe do with groupby
s2 = s1.groupby('All').apply(lambda x : sum(x[True])/sum(x['All'])).reset_index(name='Mean of ALL')
Out[274]:
All Mean of ALL
0 2 0.500000
1 3 0.666667
2 4 0.250000

Counting True or False

I have the following dataframe:
True_False
2018-01-02 True
2018-01-03 True
2018-01-04 False
2018-01-05 False
2018-01-08 False
... ...
2020-01-20 True
2020-01-21 True
2020-01-22 True
2020-01-23 True
2020-01-24 False
504 rows × 1 columns
I want to know how many successive True or False but not total it must stop counting after it toggles True or False. As such i want to eventually calculate mean(), max() and min() days. is it possible to show this data in Pandas?

Solution if all datetimes are consecutive:
You can create helper Series for consecutive groups by Series.shift and Series.cumsum, then get counts by GroupBy.size:
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s)
True_False True_False
False 2 3
4 1
True 1 2
3 4
dtype: int64
And last aggregate min, max and mean per first level of MultiIndex:
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 2 3 1
True 3 4 2
If datetimes are not consecutive first step is DataFrame.asfreq:
df = df.asfreq('d')
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 1.333333 2 1
True 3.000000 4 2

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings

I have the following df,
id invoice_no
1 6636
1 6637
2 6639
2 6639
3
3
4 6635
4 6635
4 6635
the invoice_no for id 3 are all empty strings or spaces; I want to
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
but also consider spaces and empty string invoice_no in each group as same_invoice_no = False; I am wondering how to do that. The result will look like,
id invoice_no same_invoice_no
1 6636 False
1 6637 False
2 6639 True
2 6639 True
3 False
3 False
4 6635 True
4 6635 True
4 6635 True

Empty strings equate to True but NaNs don't. Replace empty strings by Numpy nan
df.replace('', np.nan, inplace = True)
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
id invoice_no same_invoice_no
0 1 6636.0 False
1 1 6637.0 False
2 2 6639.0 True
3 2 6639.0 True
4 3 NaN False
5 3 NaN False
6 4 6635.0 True
7 4 6635.0 True
8 4 6635.0 True

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0

You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Get date from most recent ID with corresponding Boolean-Updated2x - python-3.x

What I will do using ffill df.loc[~df.duplicate,'most_recent']=df['date'].where(df.duplicate).groupby(df['id']).ffill() df Out[740]: date id duplicate most_recent 0 1/10/18 1 True NaN 1 1/12/18 2 True NaN 2 1/20/18 1 False 1/10/18 3 1/31/18 1 False 1/10/18

Use transform function for your code df.loc[df.duplicate,'column_name_you are looking for ']=df.groupby('id').date.transform('first') df

Related

Checking for specific value change between columns in pandas

Python how to count boolean in dataframe and compute percentage

Counting True or False

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings

How to delete the entire row if any of its value is 0 in pandas

Categories

Resources