How to check if pandas dataframe rows have certain values in various columns, scalability - python-3.x

I have implemented the CN2 classification algorithm, it induces rules to classify the data of the form:
IF Attribute1 = a AND Attribute4 = b THEN class = class 1
My current implementation loops through a pandas DataFrame containing the training data using the iterrows() function and returns True or False for each row if it satisfies the rule or not, however, I am aware this is a highly inefficient solution. I would like to vectorise the code, my current attempt is like so:
DataFrame = df
age prescription astigmatism tear rate
1 1 2 1
2 2 1 1
2 1 1 2
rule = {'age':[1],'prescription':[1],'astigmatism':[1,2],'tear rate':[1,2]}
df.isin(rule)
This produces:
age prescription astigmatism tear rate
True True True True
False False True True
False True True True
I have coded the rule to be a dictionary which contains a single value for target attributes and the set of all possible values for non-target attributes.
The result I would like is a single True or False for each row if the conditions of the rule are met or not and the index of the rows which evaluate to all True. Currently I can only get a DataFrame with a T/F for each value. To be concrete, in the example i have shown, I wish the result to be the index of the first row which is the only row which satisfies the rule.

I think you need check if at least one value per row is True use DataFrame.any:
mask = df.isin(rule).any(axis=1)
print (mask)
0 True
1 True
2 True
dtype: bool
Or for check if all values are Trues use DataFrame.all:
mask = df.isin(rule).all(axis=1)
print (mask)
0 True
1 False
2 False
dtype: bool
For filtering is possible use boolean indexing:
df = df[mask]

Related

how to get row index of a Pandas dataframe from a regex match

This question has been asked but I didn't find the answers complete. I have a dataframe that has unnecessary values in the first row and I want to find the row index of the animals:
df = pd.DataFrame({'a':['apple','rhino','gray','horn'],
'b':['honey','elephant', 'gray','trunk'],
'c':['cheese','lion', 'beige','mane']})
a b c
0 apple honey cheese
1 rhino elephant lion
2 gray gray beige
3 horn trunk mane
ani_pat = r"rhino|zebra|lion"
That means I want to find "1" - the row index that matches the pattern. One solution I saw here was like this; applying to my problem...
def findIdx(df, pattern):
return df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).values.nonzero()
animal = findIdx(df, ani_pat)
print(animal)
(array([1, 1], dtype=int64), array([0, 2], dtype=int64))
That output is a tuple of NumPy arrays. I've got the basics of NumPy and Pandas, but I'm not sure what to do with this or how it relates to the df above.
I altered that lambda expression like this:
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE))
a b c
0 False False False
1 True False True
2 False False False
3 False False False
That makes a little more sense. but still trying to get the row index of the True values. How can I do that?
We can select from the filter the DataFrame index where there are rows that have any True value in them:
idx = df.index[
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
]
idx:
Int64Index([1], dtype='int64')
any on axis 1 will take the boolean DataFrame and reduce it to a single dimension based on the contents of the rows.
Before any:
a b c
0 False False False
1 True False True
2 False False False
3 False False False
After any:
0 False
1 True
2 False
3 False
dtype: bool
We can then use these boolean values as a mask for index (selecting indexes which have a True value):
Int64Index([1], dtype='int64')
If needed we can use tolist to get a list instead:
idx = df.index[
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
].tolist()
idx:
[1]

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

pandas dataframe create a new column whose values are based on groupby sum on another column

I am trying to create a new column amount_0_flag for a df, the values in that column are based on groupby another column key, for which if amount sum is 0, assigned True to amount_0_flag, otherwise False. The df looks like,
key amount amount_0_flag negative_amount
1 1.0 True False
1 1.0 True True
2 2.0 False True
2 3.0 False False
2 4.0 False False
so when df.groupby('key'), cluster with key=1, will be assigned True to amount_0_flag for each element of the cluster, since within the cluster, one element has negative 1 and another element has postive 1 as their amounts.
df.groupby('key')['amount'].sum()
only gives the sum of amount for each cluster not considering values in negative_amount and I am wondering how to also find the cluster and its rows with 0 sum amounts consdering negative_amount values using pandas/numpy.
Let's try this where I created a 'new_column' showing the comparison to your 'amount_0_flag':
df['new_column'] = (df.assign(amount_n = df.amount * np.where(df.negative_amount,-1,1))
.groupby('key')['amount_n']
.transform(lambda x: sum(x)<=0))
Output:
key amount amount_0_flag negative_amount new_column
0 1 1.0 True False True
1 1 1.0 True True True
2 2 2.0 False True False
3 2 3.0 False False False
4 2 4.0 False False False

finding values in pandas series - Python3

i have this excruciatingly annoying problem (i'm quite new to python)
df=pd.DataFrame[{'col1':['1','2','3','4']}]
col1=df['col1']
Why does col1[1] in col1 return False?
For check values use boolean indexing:
#get value where index is 1
print (col1[1])
2
#more common with loc
print (col1.loc[1])
2
print (col1 == '2')
0 False
1 True
2 False
3 False
Name: col1, dtype: bool
And if need get rows:
print (col1[col1 == '2'])
1 2
Name: col1, dtype: object
For check multiple values with or:
print (col1.isin(['2', '4']))
0 False
1 True
2 False
3 True
Name: col1, dtype: bool
print (col1[col1.isin(['2', '4'])])
1 2
3 4
Name: col1, dtype: object
And something about in for testing membership docs:
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use the method isin():
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
#1 is in index
print (1 in col1)
True
#5 is not in index
print (5 in col1)
False
#string 2 is not in index
print ('2' in col1)
False
#number 2 is in index
print (2 in col1)
True
You try to find string 2 in index values:
print (col1[1])
2
print (type(col1[1]))
<class 'str'>
print (col1[1] in col1)
False
I might be missing something, and this is years later, but as I read the question, you are trying to get the in keyword to work on your panda series? So probably want to do:
col1[1] in col1.values
Because as mentioned above, pandas is looking through the index, and you need to specifically ask it to look at the values of the series, not the index.

Resources