pandas dataframe create a new column whose values are based on groupby sum on another column - python-3.x

I am trying to create a new column amount_0_flag for a df, the values in that column are based on groupby another column key, for which if amount sum is 0, assigned True to amount_0_flag, otherwise False. The df looks like,
key amount amount_0_flag negative_amount
1 1.0 True False
1 1.0 True True
2 2.0 False True
2 3.0 False False
2 4.0 False False
so when df.groupby('key'), cluster with key=1, will be assigned True to amount_0_flag for each element of the cluster, since within the cluster, one element has negative 1 and another element has postive 1 as their amounts.
df.groupby('key')['amount'].sum()
only gives the sum of amount for each cluster not considering values in negative_amount and I am wondering how to also find the cluster and its rows with 0 sum amounts consdering negative_amount values using pandas/numpy.

Let's try this where I created a 'new_column' showing the comparison to your 'amount_0_flag':
df['new_column'] = (df.assign(amount_n = df.amount * np.where(df.negative_amount,-1,1))
.groupby('key')['amount_n']
.transform(lambda x: sum(x)<=0))
Output:
key amount amount_0_flag negative_amount new_column
0 1 1.0 True False True
1 1 1.0 True True True
2 2 2.0 False True False
3 2 3.0 False False False
4 2 4.0 False False False

Related

how to get row index of a Pandas dataframe from a regex match

This question has been asked but I didn't find the answers complete. I have a dataframe that has unnecessary values in the first row and I want to find the row index of the animals:
df = pd.DataFrame({'a':['apple','rhino','gray','horn'],
'b':['honey','elephant', 'gray','trunk'],
'c':['cheese','lion', 'beige','mane']})
a b c
0 apple honey cheese
1 rhino elephant lion
2 gray gray beige
3 horn trunk mane
ani_pat = r"rhino|zebra|lion"
That means I want to find "1" - the row index that matches the pattern. One solution I saw here was like this; applying to my problem...
def findIdx(df, pattern):
return df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).values.nonzero()
animal = findIdx(df, ani_pat)
print(animal)
(array([1, 1], dtype=int64), array([0, 2], dtype=int64))
That output is a tuple of NumPy arrays. I've got the basics of NumPy and Pandas, but I'm not sure what to do with this or how it relates to the df above.
I altered that lambda expression like this:
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE))
a b c
0 False False False
1 True False True
2 False False False
3 False False False
That makes a little more sense. but still trying to get the row index of the True values. How can I do that?
We can select from the filter the DataFrame index where there are rows that have any True value in them:
idx = df.index[
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
]
idx:
Int64Index([1], dtype='int64')
any on axis 1 will take the boolean DataFrame and reduce it to a single dimension based on the contents of the rows.
Before any:
a b c
0 False False False
1 True False True
2 False False False
3 False False False
After any:
0 False
1 True
2 False
3 False
dtype: bool
We can then use these boolean values as a mask for index (selecting indexes which have a True value):
Int64Index([1], dtype='int64')
If needed we can use tolist to get a list instead:
idx = df.index[
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
].tolist()
idx:
[1]

Python: How to set n previous rows as True if row x is True in a DataFrame

My df (using pandas):
Value Class
1 False
5 False
7 False
2 False
4 False
3 True
2 False
If a row has Class as True, I want to set all n previous rows as true as well. Let's say n = 3, then the desired output is:
Value Class
1 False
5 False
7 True
2 True
4 True
3 True
2 False
I've looked up similar questions but they seem to focus on adding new columns. I would like to avoid that and just change the values of the existing one. My knowledge is pretty limited so I don't know how to tackle this.
Idea is replace False to missing values by Series.where and then use back filling function with limit parameter by Series.bfill, last replace missing values to False and convert values to boolean:
n = 3
df['Class'] = df['Class'].where(df['Class']).bfill(limit=n).fillna(0).astype(bool)
print (df)
Value Class
0 1 False
1 5 False
2 7 True
3 2 True
4 4 True
5 3 True
6 2 False

How to find the number of rows that has been updated in pandas

How can we find the number of rows that got updated in pandas.
New['Updated']= np.where((New.Class=='B')&(New.Flag=='Y'),'N',np.where((New.Class=='R')&(New.Flag=='N'),'Y',New.Flag))
data.Flag=data['Tracking_Nbr'].map(New.set_index('Tracking_Nbr').Updated)
You need store the Flag before the change , here I using Flag1
df2['Updated']=np.where((df2.Class=='B')&(df2.Flag=='Y'),'N',np.where((df2.Class=='R')&(df2.Flag=='N'),'Y',df2.Flag))
df1['Flag1']=df1['Flag']
df1.Flag=df1['Tracking_Nbr'].map(df2.set_index('Tracking_Nbr').Updated)
df1[df1['Flag1']!=df1['Flag']]
More information
df1['Flag1']!=df1['Flag']
Out[716]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
dtype: bool

pandas create a Boolean column for a df based on one condition on a column of another df

I have two dfs, A and B. A is like,
date id
2017-10-31 1
2017-11-01 2
2017-08-01 3
B is like,
type id
1 1
2 2
3 3
I like to create a new boolean column has_b for A, set the column value to True if its corresponding row (A joins B on id) in B does not have type == 1, and its time delta is > 90 days comparing to datetime.utcnow().day; and False otherwise, here is my solution
B = B[B['type'] != 1]
A['has_b'] = A.merge(B[['id', 'type']], how='left', on='id')['date'].apply(lambda x: datetime.utcnow().day - x.day > 90)
A['has_b'].fillna(value=False, inplace=True)
expect to see A result in,
date id has_b
2017-10-31 1 False
2017-11-01 2 False
2017-08-01 3 True
I am wondering if there is a better way to do this, in terms of more concise and efficient code.
First merge A and B on id -
i = A.merge(B, on='id')
Now, compute has_b -
x = i.type.ne(1)
y = (pd.to_datetime('today') - i.date).dt.days.gt(90)
i['has_b'] = (x & y)
Merge back i and A -
C = A.merge(i[['id', 'has_b']], on='id')
C
date id has_b
0 2017-10-31 1 False
1 2017-11-01 2 False
2 2017-08-01 3 True
Details
x will return a boolean mask for the first condition.
i.type.ne(1)
0 False
1 True
2 True
Name: type, dtype: bool
y will return a boolean mask for the second condition. Use to_datetime('today') to get the current date, subtract this from the date column, and access the days component with dt.days.
(pd.to_datetime('today') - i.date).dt.days.gt(90)
0 False
1 False
2 True
Name: date, dtype: bool
In case, A and B's IDs do not align, you may need a left merge instead of an inner merge, for the last step -
C = A.merge(i[['id', 'has_b']], on='id', how='left')
C's has_b column will contain NaNs in this case.

How to check if pandas dataframe rows have certain values in various columns, scalability

I have implemented the CN2 classification algorithm, it induces rules to classify the data of the form:
IF Attribute1 = a AND Attribute4 = b THEN class = class 1
My current implementation loops through a pandas DataFrame containing the training data using the iterrows() function and returns True or False for each row if it satisfies the rule or not, however, I am aware this is a highly inefficient solution. I would like to vectorise the code, my current attempt is like so:
DataFrame = df
age prescription astigmatism tear rate
1 1 2 1
2 2 1 1
2 1 1 2
rule = {'age':[1],'prescription':[1],'astigmatism':[1,2],'tear rate':[1,2]}
df.isin(rule)
This produces:
age prescription astigmatism tear rate
True True True True
False False True True
False True True True
I have coded the rule to be a dictionary which contains a single value for target attributes and the set of all possible values for non-target attributes.
The result I would like is a single True or False for each row if the conditions of the rule are met or not and the index of the rows which evaluate to all True. Currently I can only get a DataFrame with a T/F for each value. To be concrete, in the example i have shown, I wish the result to be the index of the first row which is the only row which satisfies the rule.
I think you need check if at least one value per row is True use DataFrame.any:
mask = df.isin(rule).any(axis=1)
print (mask)
0 True
1 True
2 True
dtype: bool
Or for check if all values are Trues use DataFrame.all:
mask = df.isin(rule).all(axis=1)
print (mask)
0 True
1 False
2 False
dtype: bool
For filtering is possible use boolean indexing:
df = df[mask]

Resources