i have this excruciatingly annoying problem (i'm quite new to python)
df=pd.DataFrame[{'col1':['1','2','3','4']}]
col1=df['col1']
Why does col1[1] in col1 return False?
For check values use boolean indexing:
#get value where index is 1
print (col1[1])
2
#more common with loc
print (col1.loc[1])
2
print (col1 == '2')
0 False
1 True
2 False
3 False
Name: col1, dtype: bool
And if need get rows:
print (col1[col1 == '2'])
1 2
Name: col1, dtype: object
For check multiple values with or:
print (col1.isin(['2', '4']))
0 False
1 True
2 False
3 True
Name: col1, dtype: bool
print (col1[col1.isin(['2', '4'])])
1 2
3 4
Name: col1, dtype: object
And something about in for testing membership docs:
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use the method isin():
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
#1 is in index
print (1 in col1)
True
#5 is not in index
print (5 in col1)
False
#string 2 is not in index
print ('2' in col1)
False
#number 2 is in index
print (2 in col1)
True
You try to find string 2 in index values:
print (col1[1])
2
print (type(col1[1]))
<class 'str'>
print (col1[1] in col1)
False
I might be missing something, and this is years later, but as I read the question, you are trying to get the in keyword to work on your panda series? So probably want to do:
col1[1] in col1.values
Because as mentioned above, pandas is looking through the index, and you need to specifically ask it to look at the values of the series, not the index.
Related
I have a pandas dataframe, and I want to create a new column with values 'in list' or 'not in list', based on whether an entry in the first column is in a list. To illustrate I have a toy example below. I have a solution which works, however it seems very cumbersome and not very pythonic. I do also get a SettingWithCopyWarning. Is there a better or more recommended way to achieve this in python?
#creating a toy dataframe with one column
df = pd.DataFrame({'col_1': [1,2,3,4,6]})
#the list we want to check if any value in col_1 is in
list_ = [2,3,3,3]
#creating a new empty column
df['col_2'] = None
col_1 col_2
0 1 None
1 2 None
2 3 None
3 4 None
4 6 None
My solution is to loop through the first column and populate the second
for index, i in enumerate(df['col_1']):
if i in list_:
df['col_2'].iloc[index] = 'in list'
else:
df['col_2'].iloc[index] = 'not in list'
col_1 col_2
0 1 not in list
1 2 in list
2 3 in list
3 4 not in list
4 6 not in list
Which produces the correct result, but I would like to learn a more pythonic way of achieving this.
Use Series.isin with Series.map:
In [1197]: df['col_2'] = df.col_1.isin(list_).map({False: 'not in list', True: 'in list'})
In [1198]: df
Out[1198]:
col_1 col_2
0 1 not in list
1 2 in list
2 3 in list
3 4 not in list
4 6 not in list
Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .
I have a dataframe:
Name Hours_Worked
1 James 3
2 Sam 2.5
3 Billy T
4 Sarah A
5 Felix 5
1st how do I count the number of rows in which I have non-numeric values?
2nd how do I filter to identify the rows that contain non-numeric values?
Use to_numeric with errors='coerce' for convert non numeric to NaNs and create mask by isna:
mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isna()
#oldier pandas versions
#mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isnull()
Then count Trues values by sum:
a = mask.sum()
print (a)
2
And filter by boolean indexing:
df1 = df[mask]
print (df1)
Name Hours_Worked
3 Billy T
4 Sarah A
Detail:
print (mask)
1 False
2 False
3 True
4 True
5 False
Name: Hours_Worked, dtype: bool
Another way for check numeric:
def check_num(x):
try:
float(x)
return False
except ValueError:
return True
mask = df['Hours_Worked'].apply(check_num)
At the end of the day I did this to kind of evaluate string in my numeric column:
df['Hr_String'] = pd.to_numeric(df['Hours_Worked'], errors='coerce')
I wanted it in a new column so I could filter and could a little more fluid for me:
df[df['Hr_String'].isnull()]
It returns:
Name Hours_Worked Hr_String
2 Billy T NaN
3 Sarah A NaN
I then did
df['Hr_String'].isnull().sum()
It returns:
2
Then I wanted the percentage of total rows so I did this:
teststr['Hr_String'].isnull().sum() / teststr.shape[0]
It returns:
0.4
Overall this approach worked for me it helped me understand what string values are messing with my numeric column and allows me to see the percentage which if it was really small I may just drop the rows for my analysis. If the percentage was large, I'd have to figure out if I can impute them or figure something else out for them.
Say we have dataframe one df1 and dataframe two df2.
import pandas as pd
dict1= {'group':['A','A','B','C','C','C'],'col2':[1,7,4,2,1,0],'col3':[1,1,3,4,5,3]}
df1 = pd.DataFrame(data=dict1).set_index('group')
dict2 = {'group':['A','A','B','C','C','C'],'col2':[1,7,400,2,1,0],'col3':[1,1,3,4,5,3500]}
df2 = pd.DataFrame(data=dict2).set_index('group')
df1
col2 col3
group
A 1 1
A 7 1
B 4 3
C 2 4
C 1 5
C 0 3
df2
col2 col3
group
A 1 1
A 7 1
B 400 3
C 2 4
C 1 5
C 0 3500
In pandas it is easy to compare the equality of these two dataframes with df1.equals(df2). In this case False.
However, we can see that some in this groups (A in the given toy example) are equal and some are not (groups B and C). I want to check for equality between these groups. In other words, check the equality between the dataframes with index A and B etc.
Here is my attempt. We wish to group the data
g1 = df1.groupby('group')
g2 = df2.groupby('group')
Naively trying g1.equals(g2) gives the error Cannot access callable attribute 'equals' of 'DataFrameGroupBy' objects, try using the 'apply' method.
However, if we try
g1.apply(lambda x: x.equals(g2))
We get a series
group
A False
B False
C False
dtype: bool
However the first entry should be True since the first case group A is equal between the two dataframes.
I can see that I could laboriously construct nested loops to do this, but that's slow. I feel there a way to do this in pandas without usings loops? I think I am misusing the apply method?
You can call get_group on g2 to retrieve the group to compare, you can access the group name using the attribute .name:
In[316]:
g1.apply(lambda x: x.equals(g2.get_group(x.name)))
Out[316]:
group
A True
B False
C False
dtype: bool
EDIT
To handle non-existent groups:
In[320]:
g1.apply(lambda x: x.equals(g2.get_group(x.name)) if x.name in g2.groups else False)
Out[320]:
group
A True
B False
C False
dtype: bool
Example:
In[323]:
dict1= {'group':['A','A','B','C','C','C','D'],'col2':[1,7,4,2,1,0,-1],'col3':[1,1,3,4,
5,3,-1]}
df1 = pd.DataFrame(data=dict1).set_index('group')
g1 = df1.groupby('group')
g1.apply(lambda x: x.equals(g2.get_group(x.name)) if x.name in g2.groups else False)
Out[323]:
group
A True
B False
C False
D False
dtype: bool
Here .groups returns a dict of the groups, the keys are the group name/labels, we can test for existence using x.name in g2.groups and modify the lambda to handle non-existent groups
I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()