Count String Values in a Numeric Column Pandas - python-3.x

I have a dataframe:
Name Hours_Worked
1 James 3
2 Sam 2.5
3 Billy T
4 Sarah A
5 Felix 5
1st how do I count the number of rows in which I have non-numeric values?
2nd how do I filter to identify the rows that contain non-numeric values?

Use to_numeric with errors='coerce' for convert non numeric to NaNs and create mask by isna:
mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isna()
#oldier pandas versions
#mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isnull()
Then count Trues values by sum:
a = mask.sum()
print (a)
2
And filter by boolean indexing:
df1 = df[mask]
print (df1)
Name Hours_Worked
3 Billy T
4 Sarah A
Detail:
print (mask)
1 False
2 False
3 True
4 True
5 False
Name: Hours_Worked, dtype: bool
Another way for check numeric:
def check_num(x):
try:
float(x)
return False
except ValueError:
return True
mask = df['Hours_Worked'].apply(check_num)

At the end of the day I did this to kind of evaluate string in my numeric column:
df['Hr_String'] = pd.to_numeric(df['Hours_Worked'], errors='coerce')
I wanted it in a new column so I could filter and could a little more fluid for me:
df[df['Hr_String'].isnull()]
It returns:
Name Hours_Worked Hr_String
2 Billy T NaN
3 Sarah A NaN
I then did
df['Hr_String'].isnull().sum()
It returns:
2
Then I wanted the percentage of total rows so I did this:
teststr['Hr_String'].isnull().sum() / teststr.shape[0]
It returns:
0.4
Overall this approach worked for me it helped me understand what string values are messing with my numeric column and allows me to see the percentage which if it was really small I may just drop the rows for my analysis. If the percentage was large, I'd have to figure out if I can impute them or figure something else out for them.

Related

I want to merge 4 rows to form 1 row with 4 sub-rows in pandas Dataframe

This is my dataframe
I have tried this but it didn't work:
df1['quarter'].str.contains('/^[-+](20)$/', re.IGNORECASE).groupby(df1['quarter'])
Thanks in advance
Hi and welcome to the forum! If I understood your question correctly, you want to form groups per year?
Of course, you can simply do a group by per year as you already have the column.
Assuming you didn't have the year column, you can simply group by the whole string except the last 2 characters of the quarter column. Like this (I created a toy dataset for the answer):
import pandas as pd
d = {'quarter' : pd.Series(['1947q1', '1947q2', '1947q3', '1947q4','1948q1']),
'some_value' : pd.Series([1,3,2,4,5])}
df = pd.DataFrame(d)
df
This is our toy dataframe:
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
4 1948q1 5
Now we simply group by the year, but we substract the last 2 characters:
grouped = df.groupby(df.quarter.str[:-2])
for name, group in grouped:
print(name)
print(group, '\n')
Output:
1947
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
1948
quarter some_value
4 1948q1 5
Additional comment: I used an operation that you can always apply to strings. Check this, for example:
s = 'Hi there, Dhruv!'
#Prints the first 2 characters of the string
print(s[:2])
#Output: "Hi"
#Prints everything after the third character
print(s[3:])
#Output: "there, Dhruv!"
#Prints the text between the 10th and the 15th character
print(s[10:15])
#Output: "Dhruv"

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

Extract dates in rows of panda dataframe when column value below certain range

I have a panda dataframe df with the contents below;
Date Factor Expiry Grade
0 12/31/1991 2.138766 3/30/1992 -3.33%
1 10/29/1992 2.031381 2/8/1993 -1.06%
2 5/20/1993 2.075670 6/4/1993 -6.38%
3 10/11/1994 1.441644 11/22/1994 -7.80%
4 1/11/1995 1.669600 1/20/1995 -7.39%
5 5/15/1995 1.655237 8/8/1995 -8.68%
6 10/17/1996 0.942000 10/22/1996 -7.39%
7 2/19/1998 0.838838 5/26/1998 13.19%
8 7/9/1998 1.303637 8/28/1998 -6.73%
9 12/29/1998 1.517232 1/21/1999 -11.03%
10 4/26/1999 1.613346 5/24/1999 -7.55%
11 7/8/1999 2.136339 9/23/1999 5.43%
12 3/22/2000 5.097782 3/29/2000 -6.44%
I would like to extract out dates under the Date column corresponding to the row with Grade <=-8%.
The desirable output will be a list of string like this;
output_dates = ['5/15/1995', '12/29/1998']
I am using python v3.6
Use rstrip for remove last %, convert to float and comapre by le (<=) for boolean mask, filter by boolean indexing:
out = df.loc[df['Grade'].str.rstrip('%').astype(float).le(-8), 'Date']
print (out)
5 5/15/1995
9 12/29/1998
Name: Date, dtype: object
Or for list:
out = df.loc[df.Grade.str.rstrip('%').astype(float).le(-8), 'Date'].tolist()
print (out)
['5/15/1995', '12/29/1998']
Use
In [464]: df.loc[df.Grade.str[:-1].astype(float).lt(-8), 'Date']
Out[464]:
5 5/15/1995
9 12/29/1998
Name: Date, dtype: object
In [465]: df.loc[df.Grade.str[:-1].astype(float).lt(-8), 'Date'].tolist()
Out[465]: ['5/15/1995', '12/29/1998']
Or, use
df.Grade.str.replace('%', '').astype(float)

Filter columns based on a value (Pandas): TypeError: Could not compare ['a'] with block values

I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()

Resources