Error in Iterating through Pandas DataFrame with if statement - python-3.x

I have the following pandas DataFrame.
Id UserId Name Date Class TagBased
0 2 23 Autobiographer 2016-01-12T18:44:49.267 3 False
1 3 22 Autobiographer 2016-01-12T18:44:49.267 3 False
2 4 21 Autobiographer 2016-01-12T18:44:49.267 3 False
3 5 20 Autobiographer 2016-01-12T18:44:49.267 3 False
4 6 19 Autobiographer 2016-01-12T18:44:49.267 3 False
I want to iterate through "TagBased" column and put the User Ids in a list where TagBased=True.
I have used the following code but I am getting no output which is incorrect because there are 18 True values in TagBased.
user_tagBased = []
for i in range(len(df)):
if (df['TagBased'] is True):
user_TagBased.append(df['UserId'])
print(user_TagBased)
Output: []

As others are suggesting, using Pandas conditional filtering is the best choice here without using loops! However, to still explain why your code did not work as expected:
You are appending df['UserId'] in a for-loop while df['UserId'] is a column. Same goes for df['TagBased'] check, which is also a column.
I assume you want to append the userId at the current row in the for-loop.
You can do that by iterating through the df rows:
user_tagBased = []
for index, row in df.iterrows():
if row['TagBased'] == 'True': # Because it is a string and not a boolean here
user_tagBased.append(row['UserId'])

Try this, you don't need to use loops for this:
user_list = df[df['TagBased']==True]['UserId'].tolist()
print(user_list)
[19, 19]

There is no need to use any loop.
Note that:
df.TagBased - yields a Series of bool type - TagBased column
(I assume that TagBased column is of bool type).
df[df.TagBased] - is an example of boolean indexing - it retrieves
rows where TagBased is True
df[df.TagBased].UserId - limits the above result to just UserId,
almost what you want, but this is a Series, whereas you want a
list.
So the code to produce your expected result, with saving in the destination
variable, is:
user_tagBased = df[df.TagBased].UserId.to_list()

Related

Populate a new column in a pandas data frame with a binary value if another columns value is in a list or set

I have a pandas dataframe, and I want to create a new column with values 'in list' or 'not in list', based on whether an entry in the first column is in a list. To illustrate I have a toy example below. I have a solution which works, however it seems very cumbersome and not very pythonic. I do also get a SettingWithCopyWarning. Is there a better or more recommended way to achieve this in python?
#creating a toy dataframe with one column
df = pd.DataFrame({'col_1': [1,2,3,4,6]})
#the list we want to check if any value in col_1 is in
list_ = [2,3,3,3]
#creating a new empty column
df['col_2'] = None
col_1 col_2
0 1 None
1 2 None
2 3 None
3 4 None
4 6 None
My solution is to loop through the first column and populate the second
for index, i in enumerate(df['col_1']):
if i in list_:
df['col_2'].iloc[index] = 'in list'
else:
df['col_2'].iloc[index] = 'not in list'
col_1 col_2
0 1 not in list
1 2 in list
2 3 in list
3 4 not in list
4 6 not in list
Which produces the correct result, but I would like to learn a more pythonic way of achieving this.
Use Series.isin with Series.map:
In [1197]: df['col_2'] = df.col_1.isin(list_).map({False: 'not in list', True: 'in list'})
In [1198]: df
Out[1198]:
col_1 col_2
0 1 not in list
1 2 in list
2 3 in list
3 4 not in list
4 6 not in list

Removing repetitive/duplicate occurance in excel using python

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.
This is my input excel:
And need output like this:
This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :
by creating a mask where you return a true/false boolean if the row is == the row above.
assuming your dataframe is called df
mask = df['NAME'].ne(df['NAME'].shift())
df.loc[~mask,'NAME'] = ''
explanation :
what we are doing above is the following,
first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=
lets see this in action.
import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']
now, lets create a dataframe similar to yours.
df = pd.DataFrame({'NAME' : names,
'DEFAULT' : defaults,
'CLASS' : classes,
'AGE' : [np.random.randint(1,5) for len in names],
'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.
so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.
we do this by calling [.shift][2] hyperlinked for more info.
what this basically does is shift the rows by its index with a defined variable number, lets call this n.
if we called df['NAME'].shift(1)
0 NaN
1 Rekha
2 Rekha
3 Jaya
4 Jaya
5 Sushma
6 Nita
7 Nita
we can see here that that Rekha has moved down
so putting that all together,
df['NAME'].ne(df['NAME'].shift())
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
we assign this to a self defined variable called mask you could call this whatever you want.
we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.
however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.
NAME DEFAULT CLASS AGE GROUP
1 Rekha third 1 4
3 Jaya fifth 1 1
6 Nita fifth 1 2
7 Nita fourth 1 4
all we need to do now is change these rows to blanks as your initial requirment, and we are left with.
NAME DEFAULT CLASS AGE GROUP
0 Rekha forth 2 2
1 third 1 4
2 Jaya c-default forth 3 3
3 fifth 1 1
4 Sushma fourth3 1
5 Nita c-default third 4 2
6 fifth 1 2
7 fourth1 4
hope that helps!

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

Count String Values in a Numeric Column Pandas

I have a dataframe:
Name Hours_Worked
1 James 3
2 Sam 2.5
3 Billy T
4 Sarah A
5 Felix 5
1st how do I count the number of rows in which I have non-numeric values?
2nd how do I filter to identify the rows that contain non-numeric values?
Use to_numeric with errors='coerce' for convert non numeric to NaNs and create mask by isna:
mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isna()
#oldier pandas versions
#mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isnull()
Then count Trues values by sum:
a = mask.sum()
print (a)
2
And filter by boolean indexing:
df1 = df[mask]
print (df1)
Name Hours_Worked
3 Billy T
4 Sarah A
Detail:
print (mask)
1 False
2 False
3 True
4 True
5 False
Name: Hours_Worked, dtype: bool
Another way for check numeric:
def check_num(x):
try:
float(x)
return False
except ValueError:
return True
mask = df['Hours_Worked'].apply(check_num)
At the end of the day I did this to kind of evaluate string in my numeric column:
df['Hr_String'] = pd.to_numeric(df['Hours_Worked'], errors='coerce')
I wanted it in a new column so I could filter and could a little more fluid for me:
df[df['Hr_String'].isnull()]
It returns:
Name Hours_Worked Hr_String
2 Billy T NaN
3 Sarah A NaN
I then did
df['Hr_String'].isnull().sum()
It returns:
2
Then I wanted the percentage of total rows so I did this:
teststr['Hr_String'].isnull().sum() / teststr.shape[0]
It returns:
0.4
Overall this approach worked for me it helped me understand what string values are messing with my numeric column and allows me to see the percentage which if it was really small I may just drop the rows for my analysis. If the percentage was large, I'd have to figure out if I can impute them or figure something else out for them.

Resources