How to check if pandas dataframe rows have certain values in various columns, scalability

How to check if pandas dataframe rows have certain values in various columns, scalability - python-3.x

I have implemented the CN2 classification algorithm, it induces rules to classify the data of the form:
IF Attribute1 = a AND Attribute4 = b THEN class = class 1
My current implementation loops through a pandas DataFrame containing the training data using the iterrows() function and returns True or False for each row if it satisfies the rule or not, however, I am aware this is a highly inefficient solution. I would like to vectorise the code, my current attempt is like so:
DataFrame = df
age prescription astigmatism tear rate
1 1 2 1
2 2 1 1
2 1 1 2
rule = {'age':[1],'prescription':[1],'astigmatism':[1,2],'tear rate':[1,2]}
df.isin(rule)
This produces:
age prescription astigmatism tear rate
True True True True
False False True True
False True True True
I have coded the rule to be a dictionary which contains a single value for target attributes and the set of all possible values for non-target attributes.
The result I would like is a single True or False for each row if the conditions of the rule are met or not and the index of the rows which evaluate to all True. Currently I can only get a DataFrame with a T/F for each value. To be concrete, in the example i have shown, I wish the result to be the index of the first row which is the only row which satisfies the rule.

I think you need check if at least one value per row is True use DataFrame.any:
mask = df.isin(rule).any(axis=1)
print (mask)
0 True
1 True
2 True
dtype: bool
Or for check if all values are Trues use DataFrame.all:
mask = df.isin(rule).all(axis=1)
print (mask)
0 True
1 False
2 False
dtype: bool
For filtering is possible use boolean indexing:
df = df[mask]

Related

how to get row index of a Pandas dataframe from a regex match

This question has been asked but I didn't find the answers complete. I have a dataframe that has unnecessary values in the first row and I want to find the row index of the animals:
df = pd.DataFrame({'a':['apple','rhino','gray','horn'],
'b':['honey','elephant', 'gray','trunk'],
'c':['cheese','lion', 'beige','mane']})
a b c
0 apple honey cheese
1 rhino elephant lion
2 gray gray beige
3 horn trunk mane
ani_pat = r"rhino|zebra|lion"
That means I want to find "1" - the row index that matches the pattern. One solution I saw here was like this; applying to my problem...
def findIdx(df, pattern):
return df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).values.nonzero()
animal = findIdx(df, ani_pat)
print(animal)
(array([1, 1], dtype=int64), array([0, 2], dtype=int64))
That output is a tuple of NumPy arrays. I've got the basics of NumPy and Pandas, but I'm not sure what to do with this or how it relates to the df above.
I altered that lambda expression like this:
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE))
a b c
0 False False False
1 True False True
2 False False False
3 False False False
That makes a little more sense. but still trying to get the row index of the True values. How can I do that?

We can select from the filter the DataFrame index where there are rows that have any True value in them:
idx = df.index[
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
]
idx:
Int64Index([1], dtype='int64')
any on axis 1 will take the boolean DataFrame and reduce it to a single dimension based on the contents of the rows.
Before any:
a b c
0 False False False
1 True False True
2 False False False
3 False False False
After any:
0 False
1 True
2 False
3 False
dtype: bool
We can then use these boolean values as a mask for index (selecting indexes which have a True value):
Int64Index([1], dtype='int64')
If needed we can use tolist to get a list instead:
idx = df.index[
df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
].tolist()
idx:
[1]

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something

If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.

You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

pandas dataframe create a new column whose values are based on groupby sum on another column

I am trying to create a new column amount_0_flag for a df, the values in that column are based on groupby another column key, for which if amount sum is 0, assigned True to amount_0_flag, otherwise False. The df looks like,
key amount amount_0_flag negative_amount
1 1.0 True False
1 1.0 True True
2 2.0 False True
2 3.0 False False
2 4.0 False False
so when df.groupby('key'), cluster with key=1, will be assigned True to amount_0_flag for each element of the cluster, since within the cluster, one element has negative 1 and another element has postive 1 as their amounts.
df.groupby('key')['amount'].sum()
only gives the sum of amount for each cluster not considering values in negative_amount and I am wondering how to also find the cluster and its rows with 0 sum amounts consdering negative_amount values using pandas/numpy.

Let's try this where I created a 'new_column' showing the comparison to your 'amount_0_flag':
df['new_column'] = (df.assign(amount_n = df.amount * np.where(df.negative_amount,-1,1))
.groupby('key')['amount_n']
.transform(lambda x: sum(x)<=0))
Output:
key amount amount_0_flag negative_amount new_column
0 1 1.0 True False True
1 1 1.0 True True True
2 2 2.0 False True False
3 2 3.0 False False False
4 2 4.0 False False False

finding values in pandas series - Python3

i have this excruciatingly annoying problem (i'm quite new to python)
df=pd.DataFrame[{'col1':['1','2','3','4']}]
col1=df['col1']
Why does col1[1] in col1 return False?

For check values use boolean indexing:
#get value where index is 1
print (col1[1])
2
#more common with loc
print (col1.loc[1])
2
print (col1 == '2')
0 False
1 True
2 False
3 False
Name: col1, dtype: bool
And if need get rows:
print (col1[col1 == '2'])
1 2
Name: col1, dtype: object
For check multiple values with or:
print (col1.isin(['2', '4']))
0 False
1 True
2 False
3 True
Name: col1, dtype: bool
print (col1[col1.isin(['2', '4'])])
1 2
3 4
Name: col1, dtype: object
And something about in for testing membership docs:
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use the method isin():
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
#1 is in index
print (1 in col1)
True
#5 is not in index
print (5 in col1)
False
#string 2 is not in index
print ('2' in col1)
False
#number 2 is in index
print (2 in col1)
True
You try to find string 2 in index values:
print (col1[1])
2
print (type(col1[1]))
<class 'str'>
print (col1[1] in col1)
False

I might be missing something, and this is years later, but as I read the question, you are trying to get the in keyword to work on your panda series? So probably want to do:
col1[1] in col1.values
Because as mentioned above, pandas is looking through the index, and you need to specifically ask it to look at the values of the series, not the index.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to check if pandas dataframe rows have certain values in various columns, scalability - python-3.x

Related

how to get row index of a Pandas dataframe from a regex match

Looking for NaN values in a specific column in df [duplicate]

Iterating over columns and comparing each row value of that column to another column's value in Pandas

pandas dataframe create a new column whose values are based on groupby sum on another column

finding values in pandas series - Python3

Categories

Resources