Extract dates in rows of panda dataframe when column value below certain range - python-3.x

I have a panda dataframe df with the contents below;
Date Factor Expiry Grade
0 12/31/1991 2.138766 3/30/1992 -3.33%
1 10/29/1992 2.031381 2/8/1993 -1.06%
2 5/20/1993 2.075670 6/4/1993 -6.38%
3 10/11/1994 1.441644 11/22/1994 -7.80%
4 1/11/1995 1.669600 1/20/1995 -7.39%
5 5/15/1995 1.655237 8/8/1995 -8.68%
6 10/17/1996 0.942000 10/22/1996 -7.39%
7 2/19/1998 0.838838 5/26/1998 13.19%
8 7/9/1998 1.303637 8/28/1998 -6.73%
9 12/29/1998 1.517232 1/21/1999 -11.03%
10 4/26/1999 1.613346 5/24/1999 -7.55%
11 7/8/1999 2.136339 9/23/1999 5.43%
12 3/22/2000 5.097782 3/29/2000 -6.44%
I would like to extract out dates under the Date column corresponding to the row with Grade <=-8%.
The desirable output will be a list of string like this;
output_dates = ['5/15/1995', '12/29/1998']
I am using python v3.6

Use rstrip for remove last %, convert to float and comapre by le (<=) for boolean mask, filter by boolean indexing:
out = df.loc[df['Grade'].str.rstrip('%').astype(float).le(-8), 'Date']
print (out)
5 5/15/1995
9 12/29/1998
Name: Date, dtype: object
Or for list:
out = df.loc[df.Grade.str.rstrip('%').astype(float).le(-8), 'Date'].tolist()
print (out)
['5/15/1995', '12/29/1998']

Use
In [464]: df.loc[df.Grade.str[:-1].astype(float).lt(-8), 'Date']
Out[464]:
5 5/15/1995
9 12/29/1998
Name: Date, dtype: object
In [465]: df.loc[df.Grade.str[:-1].astype(float).lt(-8), 'Date'].tolist()
Out[465]: ['5/15/1995', '12/29/1998']
Or, use
df.Grade.str.replace('%', '').astype(float)

Related

how can I get the operator type and apply formula into pandas dataframe

I have a python string like this: str1='PRODUCT1_PRD/2+PRODUCT2_NON-PROD-PRODUCT3_NON-PRD/2'
Here I want to get operator in a dynamic way and based on the operator I want to perform certain operation.I have a pandas dataframe df like this:
PRODUCT PRD NON-PORD
PRODUCT1 3 5
PRODUCT2 4 6
PRODUCT3 5 8
Output I want a variable var1=(3/2)+6-(8/2)=3.5 after applying the above formula. How can I do this most efficient way?
Want to note one thing: I have multiple formulas like what I mentioned, all are inside a list of strings. So I have to apply all those formulas one by one.
First create MultiIndex Series by DataFrame.set_index with DataFrame.stack and join index values by _ in map:
s = df.set_index('PRODUCT').stack()
s.index = s.index.map('_'.join)
print (s)
PRODUCT1_PRD 3
PRODUCT1_NON-PROD 5
PRODUCT2_PRD 4
PRODUCT2_NON-PROD 6
PRODUCT3_PRD 5
PRODUCT3_NON-PROD 8
dtype: int64
Then replace values in string by Series and call pandas.eval:
str1='PRODUCT1_PRD/2+PRODUCT2_NON-PROD-PRODUCT3_NON-PROD/2'
for k, v in s.items():
str1 = str1.replace(k, str(v))
print (str1)
3/2+6-8/2
print (pd.eval(str1))
3.5

Count String Values in a Numeric Column Pandas

I have a dataframe:
Name Hours_Worked
1 James 3
2 Sam 2.5
3 Billy T
4 Sarah A
5 Felix 5
1st how do I count the number of rows in which I have non-numeric values?
2nd how do I filter to identify the rows that contain non-numeric values?
Use to_numeric with errors='coerce' for convert non numeric to NaNs and create mask by isna:
mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isna()
#oldier pandas versions
#mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isnull()
Then count Trues values by sum:
a = mask.sum()
print (a)
2
And filter by boolean indexing:
df1 = df[mask]
print (df1)
Name Hours_Worked
3 Billy T
4 Sarah A
Detail:
print (mask)
1 False
2 False
3 True
4 True
5 False
Name: Hours_Worked, dtype: bool
Another way for check numeric:
def check_num(x):
try:
float(x)
return False
except ValueError:
return True
mask = df['Hours_Worked'].apply(check_num)
At the end of the day I did this to kind of evaluate string in my numeric column:
df['Hr_String'] = pd.to_numeric(df['Hours_Worked'], errors='coerce')
I wanted it in a new column so I could filter and could a little more fluid for me:
df[df['Hr_String'].isnull()]
It returns:
Name Hours_Worked Hr_String
2 Billy T NaN
3 Sarah A NaN
I then did
df['Hr_String'].isnull().sum()
It returns:
2
Then I wanted the percentage of total rows so I did this:
teststr['Hr_String'].isnull().sum() / teststr.shape[0]
It returns:
0.4
Overall this approach worked for me it helped me understand what string values are messing with my numeric column and allows me to see the percentage which if it was really small I may just drop the rows for my analysis. If the percentage was large, I'd have to figure out if I can impute them or figure something else out for them.

How to calculate time difference between the rows groupby rowname and extract only the most recent ones?

I want to calculate the number of days between 2 rows with a grouby function and extract only 1 row with the latest date. I need not want all the rows with the same row value instead want the one which is more recent with the number of days as new column.
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-27 16:36:04
2 A 2016-11-29 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-28 16:35:46
I want the output as
id no of days
0 A 4(approx)
1 B 3(approx)
So what i want is only the column 2 with id A which has the most recent change in time and date and omit rest of rows.
IIUC
df.time=pd.to_datetime(df.time)
df.groupby('id').time.apply(lambda x : (x.max()-x.min()).days)
Out[1186]:
id
A 4
B 3
Name: time, dtype: int64

min() function in pandas column

I have a dataframe like the following (df1):
col1 val
0 A AX
1 A 2
2 A 11
3 A 13
4 A BX
5 A 20
I want to pick the row with minimum value. Hence I wrote the following:
df2 = df1.groupby(['col1'])['val'].min()
The output I get from this is,
col1
A 11
Name: Level, dtype: object
It seems like the values AX, BX is causing it to read it as object. Hence, it is doing the sort and find '11' as minimum. How to modify it, so that it can do numerical sorting and outputs ?
A 2
Thanks in advance.
You need convert column to numeric first, because min working with strings nice and return characters having lowest ASCII value:
df2 = pd.to_numeric(df1['val'], errors='coerce').groupby(df1['col1']).min().astype(int)
print (df2)
col1
A 2
Name: val, dtype: int32
More information about min in strings is here.

Filter columns based on a value (Pandas): TypeError: Could not compare ['a'] with block values

I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()

Resources