Iterating over columns and comparing each row value of that column to another column's value in Pandas - python-3.x

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.

You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

Related

Error in Iterating through Pandas DataFrame with if statement

I have the following pandas DataFrame.
Id UserId Name Date Class TagBased
0 2 23 Autobiographer 2016-01-12T18:44:49.267 3 False
1 3 22 Autobiographer 2016-01-12T18:44:49.267 3 False
2 4 21 Autobiographer 2016-01-12T18:44:49.267 3 False
3 5 20 Autobiographer 2016-01-12T18:44:49.267 3 False
4 6 19 Autobiographer 2016-01-12T18:44:49.267 3 False
I want to iterate through "TagBased" column and put the User Ids in a list where TagBased=True.
I have used the following code but I am getting no output which is incorrect because there are 18 True values in TagBased.
user_tagBased = []
for i in range(len(df)):
if (df['TagBased'] is True):
user_TagBased.append(df['UserId'])
print(user_TagBased)
Output: []
As others are suggesting, using Pandas conditional filtering is the best choice here without using loops! However, to still explain why your code did not work as expected:
You are appending df['UserId'] in a for-loop while df['UserId'] is a column. Same goes for df['TagBased'] check, which is also a column.
I assume you want to append the userId at the current row in the for-loop.
You can do that by iterating through the df rows:
user_tagBased = []
for index, row in df.iterrows():
if row['TagBased'] == 'True': # Because it is a string and not a boolean here
user_tagBased.append(row['UserId'])
Try this, you don't need to use loops for this:
user_list = df[df['TagBased']==True]['UserId'].tolist()
print(user_list)
[19, 19]
There is no need to use any loop.
Note that:
df.TagBased - yields a Series of bool type - TagBased column
(I assume that TagBased column is of bool type).
df[df.TagBased] - is an example of boolean indexing - it retrieves
rows where TagBased is True
df[df.TagBased].UserId - limits the above result to just UserId,
almost what you want, but this is a Series, whereas you want a
list.
So the code to produce your expected result, with saving in the destination
variable, is:
user_tagBased = df[df.TagBased].UserId.to_list()

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Count String Values in a Numeric Column Pandas

I have a dataframe:
Name Hours_Worked
1 James 3
2 Sam 2.5
3 Billy T
4 Sarah A
5 Felix 5
1st how do I count the number of rows in which I have non-numeric values?
2nd how do I filter to identify the rows that contain non-numeric values?
Use to_numeric with errors='coerce' for convert non numeric to NaNs and create mask by isna:
mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isna()
#oldier pandas versions
#mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isnull()
Then count Trues values by sum:
a = mask.sum()
print (a)
2
And filter by boolean indexing:
df1 = df[mask]
print (df1)
Name Hours_Worked
3 Billy T
4 Sarah A
Detail:
print (mask)
1 False
2 False
3 True
4 True
5 False
Name: Hours_Worked, dtype: bool
Another way for check numeric:
def check_num(x):
try:
float(x)
return False
except ValueError:
return True
mask = df['Hours_Worked'].apply(check_num)
At the end of the day I did this to kind of evaluate string in my numeric column:
df['Hr_String'] = pd.to_numeric(df['Hours_Worked'], errors='coerce')
I wanted it in a new column so I could filter and could a little more fluid for me:
df[df['Hr_String'].isnull()]
It returns:
Name Hours_Worked Hr_String
2 Billy T NaN
3 Sarah A NaN
I then did
df['Hr_String'].isnull().sum()
It returns:
2
Then I wanted the percentage of total rows so I did this:
teststr['Hr_String'].isnull().sum() / teststr.shape[0]
It returns:
0.4
Overall this approach worked for me it helped me understand what string values are messing with my numeric column and allows me to see the percentage which if it was really small I may just drop the rows for my analysis. If the percentage was large, I'd have to figure out if I can impute them or figure something else out for them.

How to check if pandas dataframe rows have certain values in various columns, scalability

I have implemented the CN2 classification algorithm, it induces rules to classify the data of the form:
IF Attribute1 = a AND Attribute4 = b THEN class = class 1
My current implementation loops through a pandas DataFrame containing the training data using the iterrows() function and returns True or False for each row if it satisfies the rule or not, however, I am aware this is a highly inefficient solution. I would like to vectorise the code, my current attempt is like so:
DataFrame = df
age prescription astigmatism tear rate
1 1 2 1
2 2 1 1
2 1 1 2
rule = {'age':[1],'prescription':[1],'astigmatism':[1,2],'tear rate':[1,2]}
df.isin(rule)
This produces:
age prescription astigmatism tear rate
True True True True
False False True True
False True True True
I have coded the rule to be a dictionary which contains a single value for target attributes and the set of all possible values for non-target attributes.
The result I would like is a single True or False for each row if the conditions of the rule are met or not and the index of the rows which evaluate to all True. Currently I can only get a DataFrame with a T/F for each value. To be concrete, in the example i have shown, I wish the result to be the index of the first row which is the only row which satisfies the rule.
I think you need check if at least one value per row is True use DataFrame.any:
mask = df.isin(rule).any(axis=1)
print (mask)
0 True
1 True
2 True
dtype: bool
Or for check if all values are Trues use DataFrame.all:
mask = df.isin(rule).all(axis=1)
print (mask)
0 True
1 False
2 False
dtype: bool
For filtering is possible use boolean indexing:
df = df[mask]

Variable lengths don't respond to filters...then they do. Can someone explain?

Edit: All sorted, Imanol Luengo set me straight. Here's the end result, in all it's glory.
I don't understand the counts of my variables, maybe someone can explain? I'm filtering two columns of pass/fails for two locations. I want a count of all 4 pass/fails.
Here's the header of the columns. There are 126 values in total:
WT Result School
0 p Milan
1 p Roma
2 p Milan
3 p Milan
4 p Roma
Code so far:
data2 = pd.DataFrame(data[['WT Result', 'School']])
data2.dropna(inplace=True)
# Milan Counts
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
milan_p = (m_p==True)
milan_pass = np.count_nonzero(milan_p) # Count of Trues for Milano
# Rome Counts
r_p = (data2['School']=='Roma') & (data2['WT Result']=='p')
rome_p = (r_p==True)
rome_pass = np.count_nonzero(rome_p) # Count of Trues for Rome
So what I've done, after stripping the excess columns (data2), is:
filter by location and == 'p' (vars m_p and r_p)
filter then by ==True (vars milan_p and rome_p)
Do a count_nonzero() for a count of 'True' (vars milan_pass and rome_pass)
Here's what I don't understand - these are the lengths of the variables:
data2: 126
m_p: 126
r_p: 126
milan_p: 126
rome_p: 126
milan_pass: 55
rome_pass: 47
Why do the lengths remain 126 once the filtering starts? To me, this shows that neither the filtering by location or by 'p' worked. But when I do the final count_nonzero() the results are suddenly separated into location. What is happening?
You are not filtering, you are masking. Step by step:
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
Here m_p is a boolean array with the same length of a column from data2. Each element of m_p is set to True if it satisfies those 2 conditions, or to False otherwise.
milan_p = (m_p==True)
The above line is completely redundant. m_p is already a boolean array, comparing it to True will just create a copy of it. Thus, milan_p will be another boolean array with the same length as m_p.
milan_pass = np.count_nonzero(milan_p)
This just prints the number of nonzeros (e.g. True) elements of milan_p. Ofcourse, it matches the number of elements that you want to filter, but you are not filtering anything here.
Exactly the same applies to rome condition.
If you want to filter rows in pandas, you have to slice the dataframe with your newly generated mask:
filtered_milan = data2[m_p]
or alternatively
filtered_milan = data2[milan_p] # as m_p == milan_p
The above lines select the rows that have a True value in the mask (or condition), ignoring the False rows in the process.
The same applies for the second problem, rome.

Resources