Converting dataframe fraction to float - python-3.x

I would like to convert the string values in column b to float. Wondering how should I make it.
A B
1 16-1/4
2 3-1/4
3 21-1/4
4 8-1/4

Update:
Give map a try to avoid limit 100 rows on pd.eval
df['C'] = df.B.str.replace('-', '+').map(pd.eval)
Original:
As your comment, it seems you adding the fraction to the whole number, so the solution would be
df['C'] = pd.eval(df.B.str.replace('-', '+'))
Out[5]:
A B C
0 1 16-1/4 16.25
1 2 3-1/4 3.25
2 3 21-1/4 21.25
3 4 8-1/4 8.25

Use built-in Python function eval():
df.B = df.B.apply(eval)
Test:
In[1]: df
A B
0 1 16-1/4
1 2 3-1/4
2 3 21-1/4
3 4 8-1/4
In[2]: df.B = df.B.apply(eval)
In[3]: df
A B
0 1 15.75
1 2 2.75
2 3 20.75
3 4 7.75

Related

Adding NaN changes dtype of column in Pandas dataframe

I have an int dataframe:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
But if I set a value to NaN, the whole column is cast to floats! Apparently int columns can't have NaN values. But why is that?
>>> df.iloc[2,1] = np.nan
>>> df
0 1 2
0 0 1.0 2
1 3 4.0 5
2 6 NaN 8
3 9 10.0 11
For performance reasons (which make a big impact in this case), Pandas wants your columns to be from the same type, and thus will do its best to keep it that way. NaN is a float value, and all your integers can be harmlessly converted to floats, so that's what happens.
If it can't, you get what needs to happen to make this work:
>>> x = pd.DataFrame(np.arange(4).reshape(2,2))
>>> x
0 1
0 0 1
1 2 3
>>> x[1].dtype
dtype('int64')
>>> x.iloc[1, 1] = 'string'
>>> x
0 1
0 0 1
1 2 string
>>> x[1].dtype
dtype('O')
since 1 can't be converted to a string in a reasonable manner (without guessing what the user wants), the type is converted to object which is general and doesn't allow for any optimizations. This gives you what is needed to make what you want work though (a multi-type column):
>>> x[1] = x[1].astype('O') # Alternatively use a non-float NaN object
>>> x.iloc[1, 1] = np.nan # or float('nan')
>>> x
0 1
0 0 1
1 2 NaN
This is usually not recommended at all though if you don't have to.
Not best but visually better is to use pd.NA rather than np.NaN:
>>> df.iloc[2,1] = pd.NA
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 <NA> 8
3 9 10 11
Seems to be good but:
>>> df.dtypes
0 int64
1 object # <- not float, but object
2 int64
dtype: object
You can read this page from the documentation.

Access substring in a dataframe column to create a new column

I've a dataframe
df = pd.DataFrame(np.random.randint(0,10,size=(5, 1)), columns=list('A'))
df.insert(0, 'n', ['this-text in presence 20-30%, and another string','id XDTV/HGF, publication',
'this-text, 37$degree','this-text K0.5, coefficient 0.007',' '])
>>> df
n A
0 this-text in presence 20-30%, and another string 2
1 id XDTV/HGF, publication 1
2 this-text, 37$degree 4
3 coefficient 0.007,this-text K0.5 1
4 2
I'd like to create a new column
>>> df
new A
0 this-text 2
1 1
2 this-text 4
3 this-text 1
4 2
I could save the column n in a list and check if each item of the list contains the substring this-text. But I'd like to know if there are better ways of doing this.
Suggestions will be really helpful.
Try with str.findall or extract
df['new']=df.n.str.findall('this-text').str[0]
#df.n.str.extract('(this-text)')[0]
df
Out[373]:
n A new
0 this-text in presence 20-30%, and another string 7 this-text
1 id XDTV/HGF, publication 4 NaN
2 this-text, 37$degree 6 this-text
3 this-text K0.5, coefficient 0.007 0 this-text
4 7 NaN

Selective multiplication of a pandas dataframe

I have a pandas Dataframe and Series of the form
df = pd.DataFrame({'Key':[2345,2542,5436,2468,7463],
'Segment':[0] * 5,
'Values':[2,4,6,6,4]})
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 4
2 5436 0 6
3 2468 0 6
4 7463 0 4
s = pd.Series([5436, 2345])
print (s)
0 5436
1 2345
dtype: int64
In the original df, I want to multiply the 3rd column(Values) by 7 except for the keys which are present in the series. So my final df should look like
What should be the best way to achieve this in Python 3.x?
Use DataFrame.loc with Series.isin for filter Value column with inverted condition for non membership with multiple by scalar:
df.loc[~df['Key'].isin(s), 'Values'] *= 7
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 28
2 5436 0 6
3 2468 0 42
4 7463 0 28
Another method could be using numpy.where():
df['Values'] *= np.where(~df['Key'].isin([5436, 2345]), 7,1)

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources