How to find the number of rows that has been updated in pandas - python-3.x

How can we find the number of rows that got updated in pandas.
New['Updated']= np.where((New.Class=='B')&(New.Flag=='Y'),'N',np.where((New.Class=='R')&(New.Flag=='N'),'Y',New.Flag))
data.Flag=data['Tracking_Nbr'].map(New.set_index('Tracking_Nbr').Updated)

You need store the Flag before the change , here I using Flag1
df2['Updated']=np.where((df2.Class=='B')&(df2.Flag=='Y'),'N',np.where((df2.Class=='R')&(df2.Flag=='N'),'Y',df2.Flag))
df1['Flag1']=df1['Flag']
df1.Flag=df1['Tracking_Nbr'].map(df2.set_index('Tracking_Nbr').Updated)
df1[df1['Flag1']!=df1['Flag']]
More information
df1['Flag1']!=df1['Flag']
Out[716]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
dtype: bool

Related

Excel Formula based on previous rows

There are 3 columns:
Date, Name, Bonus_Point?
If a player scores a 4 or lower in the Name Column for three consecutive Dates, then Bonus_Point will return a 'Yes' or 'No'
For example, for 1/30/22, there would be a 'Yes' because there were 3 previous instances (including 1/30/22) where the score is less than or equal to 4.
But for 2/2/22, Bonus_Point? would be 'No' because on the third day, Name scored a 5.
Assuming your columns are A through C, and the row 1 is the header row and your data is in rows 2 and down, enter this formula in C4:
=AND(B2<=4,B3<=4,B4<=4)
Then fill down. (See further down for "yes" and "no")
Date
Name
Bonus_Point?
1/28/22
3
1/29/22
3
1/30/22
3
TRUE
1/31/22
3
TRUE
2/1/22
4
TRUE
2/2/22
5
FALSE
2/3/22
2
FALSE
2/4/22
5
FALSE
2/5/22
4
FALSE
2/6/22
3
FALSE
2/7/22
2
TRUE
2/8/22
3
TRUE
2/9/22
4
TRUE
2/10/22
3
TRUE
2/11/22
2
TRUE
2/12/22
2
TRUE
3/13/22
3
TRUE
If you want "Yes" and "No", you can do that through formatting or add it to the formula:
=IF(AND(B2<=4,B3<=4,B4<=4),"Yes","No")

Python: How to set n previous rows as True if row x is True in a DataFrame

My df (using pandas):
Value Class
1 False
5 False
7 False
2 False
4 False
3 True
2 False
If a row has Class as True, I want to set all n previous rows as true as well. Let's say n = 3, then the desired output is:
Value Class
1 False
5 False
7 True
2 True
4 True
3 True
2 False
I've looked up similar questions but they seem to focus on adding new columns. I would like to avoid that and just change the values of the existing one. My knowledge is pretty limited so I don't know how to tackle this.
Idea is replace False to missing values by Series.where and then use back filling function with limit parameter by Series.bfill, last replace missing values to False and convert values to boolean:
n = 3
df['Class'] = df['Class'].where(df['Class']).bfill(limit=n).fillna(0).astype(bool)
print (df)
Value Class
0 1 False
1 5 False
2 7 True
3 2 True
4 4 True
5 3 True
6 2 False

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

pandas dataframe create a new column whose values are based on groupby sum on another column

I am trying to create a new column amount_0_flag for a df, the values in that column are based on groupby another column key, for which if amount sum is 0, assigned True to amount_0_flag, otherwise False. The df looks like,
key amount amount_0_flag negative_amount
1 1.0 True False
1 1.0 True True
2 2.0 False True
2 3.0 False False
2 4.0 False False
so when df.groupby('key'), cluster with key=1, will be assigned True to amount_0_flag for each element of the cluster, since within the cluster, one element has negative 1 and another element has postive 1 as their amounts.
df.groupby('key')['amount'].sum()
only gives the sum of amount for each cluster not considering values in negative_amount and I am wondering how to also find the cluster and its rows with 0 sum amounts consdering negative_amount values using pandas/numpy.
Let's try this where I created a 'new_column' showing the comparison to your 'amount_0_flag':
df['new_column'] = (df.assign(amount_n = df.amount * np.where(df.negative_amount,-1,1))
.groupby('key')['amount_n']
.transform(lambda x: sum(x)<=0))
Output:
key amount amount_0_flag negative_amount new_column
0 1 1.0 True False True
1 1 1.0 True True True
2 2 2.0 False True False
3 2 3.0 False False False
4 2 4.0 False False False

How to check if pandas dataframe rows have certain values in various columns, scalability

I have implemented the CN2 classification algorithm, it induces rules to classify the data of the form:
IF Attribute1 = a AND Attribute4 = b THEN class = class 1
My current implementation loops through a pandas DataFrame containing the training data using the iterrows() function and returns True or False for each row if it satisfies the rule or not, however, I am aware this is a highly inefficient solution. I would like to vectorise the code, my current attempt is like so:
DataFrame = df
age prescription astigmatism tear rate
1 1 2 1
2 2 1 1
2 1 1 2
rule = {'age':[1],'prescription':[1],'astigmatism':[1,2],'tear rate':[1,2]}
df.isin(rule)
This produces:
age prescription astigmatism tear rate
True True True True
False False True True
False True True True
I have coded the rule to be a dictionary which contains a single value for target attributes and the set of all possible values for non-target attributes.
The result I would like is a single True or False for each row if the conditions of the rule are met or not and the index of the rows which evaluate to all True. Currently I can only get a DataFrame with a T/F for each value. To be concrete, in the example i have shown, I wish the result to be the index of the first row which is the only row which satisfies the rule.
I think you need check if at least one value per row is True use DataFrame.any:
mask = df.isin(rule).any(axis=1)
print (mask)
0 True
1 True
2 True
dtype: bool
Or for check if all values are Trues use DataFrame.all:
mask = df.isin(rule).all(axis=1)
print (mask)
0 True
1 False
2 False
dtype: bool
For filtering is possible use boolean indexing:
df = df[mask]

Resources