Adding NaN changes dtype of column in Pandas dataframe - python-3.x

I have an int dataframe:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
But if I set a value to NaN, the whole column is cast to floats! Apparently int columns can't have NaN values. But why is that?
>>> df.iloc[2,1] = np.nan
>>> df
0 1 2
0 0 1.0 2
1 3 4.0 5
2 6 NaN 8
3 9 10.0 11

For performance reasons (which make a big impact in this case), Pandas wants your columns to be from the same type, and thus will do its best to keep it that way. NaN is a float value, and all your integers can be harmlessly converted to floats, so that's what happens.
If it can't, you get what needs to happen to make this work:
>>> x = pd.DataFrame(np.arange(4).reshape(2,2))
>>> x
0 1
0 0 1
1 2 3
>>> x[1].dtype
dtype('int64')
>>> x.iloc[1, 1] = 'string'
>>> x
0 1
0 0 1
1 2 string
>>> x[1].dtype
dtype('O')
since 1 can't be converted to a string in a reasonable manner (without guessing what the user wants), the type is converted to object which is general and doesn't allow for any optimizations. This gives you what is needed to make what you want work though (a multi-type column):
>>> x[1] = x[1].astype('O') # Alternatively use a non-float NaN object
>>> x.iloc[1, 1] = np.nan # or float('nan')
>>> x
0 1
0 0 1
1 2 NaN
This is usually not recommended at all though if you don't have to.

Not best but visually better is to use pd.NA rather than np.NaN:
>>> df.iloc[2,1] = pd.NA
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 <NA> 8
3 9 10 11
Seems to be good but:
>>> df.dtypes
0 int64
1 object # <- not float, but object
2 int64
dtype: object
You can read this page from the documentation.

Related

Multiplying 2 pandas dataframes generates nan

I have 2 dataframes as below
import pandas as pd
dat = pd.DataFrame({'val1' : [1,2,1,2,4], 'val2' : [1,2,1,2,4]})
dat1 = pd.DataFrame({'val3' : [1,2,1,2,4]})
Now with each column of dat and want to multiply dat1. So I did below
dat * dat1
However this generates nan value for all elements.
Could you please help on what is the correct approach? I could run a for loop with each column of dat, but I wonder if there are any better method available to perform the same.
Thanks for your pointer.
When doing multiplication (or any arithmetic operation), pandas does index alignment. This goes for both the index and columns in case of dataframes. If matches, it multiplies; otherwise puts NaN and the result has the union of the indices and columns of the operands.
So, to "avoid" this alignment, make dat1 a label-unaware data structure, e.g., a NumPy array:
In [116]: dat * dat1.to_numpy()
Out[116]:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16
To see what's "really" being multiplied, you can align yourself:
In [117]: dat.align(dat1)
Out[117]:
( val1 val2 val3
0 1 1 NaN
1 2 2 NaN
2 1 1 NaN
3 2 2 NaN
4 4 4 NaN,
val1 val2 val3
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 1
3 NaN NaN 2
4 NaN NaN 4)
(extra: you have the indices same for dat & dat1; please change one of them's index, and then align again to see the union-behaviour.)
You need to change two things:
use mul with axis=0
use a Series instead of dat1 (else multiplication will try to align the indices, there is no common ones between your two dataframes
out = dat.mul(dat1['val3'], axis=0)
output:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16

Converting dataframe fraction to float

I would like to convert the string values in column b to float. Wondering how should I make it.
A B
1 16-1/4
2 3-1/4
3 21-1/4
4 8-1/4
Update:
Give map a try to avoid limit 100 rows on pd.eval
df['C'] = df.B.str.replace('-', '+').map(pd.eval)
Original:
As your comment, it seems you adding the fraction to the whole number, so the solution would be
df['C'] = pd.eval(df.B.str.replace('-', '+'))
Out[5]:
A B C
0 1 16-1/4 16.25
1 2 3-1/4 3.25
2 3 21-1/4 21.25
3 4 8-1/4 8.25
Use built-in Python function eval():
df.B = df.B.apply(eval)
Test:
In[1]: df
A B
0 1 16-1/4
1 2 3-1/4
2 3 21-1/4
3 4 8-1/4
In[2]: df.B = df.B.apply(eval)
In[3]: df
A B
0 1 15.75
1 2 2.75
2 3 20.75
3 4 7.75

Pairwise operations in Scikit-Learn and different filtering conditions on each pair

I have the following 2 data frames, say df1
a b c d
0 0 1 2 3
1 4 0 0 7
2 8 9 10 11
3 0 0 0 15
and df2
a b c d
0 5 1 2 3
What I am interested in doing is a pairwise operation on each row in df1 with the single row in df2. However, if a column in a row of df1 is 0, then that column is used in neither the df1 row nor df2 row to perform the pairwise operation. So each pairwise operation will work on pairs of rows of different length. Let me break it down how the 4 comparison should be.
Comparison 1
0 1 2 3 vs 5 1 2 3
The pairwise operation is done on 1 2 3 vs 1 2 3 as column a has a 0
Comparison 2
4 0 0 7 vs 5 1 2 3 is done on 4 7 vs 5 3 as we have 2 columns that need to be dropped
Comparison 3
8 9 10 11 vs 5 1 2 3 is done on 8 9 10 11 vs 5 1 2 3 as no columns are dropped
Comparison 4
0 0 0 15 vs 5 1 2 3 is done on 15 vs 3 as all but one column is dropped
The result of each pairwise operation is a scalar so the result is some sort of structure whether it be list, array, data frame, whatever with 4 (or the number of rows in df1) values. Also, I should note that values in df2 are irrelevant and no filtering is done based upon the value of any column in df2.
For simplicity, you could try looping over each row in the dataframe and do something like this:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# loop over each row in 'a'
for i in range(len(a)):
# find indicies of non-zero elements of the row
non_zero = np.nonzero(a.iloc[i].to_numpy())[0]
# perform pair-wise addition between non-zero elements in 'a' and the same elements in 'b'
print(np.array(a.iloc[i])[(non_zero)] + np.array(b.iloc[0])[(non_zero)])
Here I used pair-wise addition but you could replace the addition with an operation of your choosing.
Edit:
We may want to vectorize this to avoid the loop if the dataframes are large. Here is an idea for that, where we convert zero values to nan so they are ignored in the row-wise operation:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# find indicies of zeros
zeros = (a==0).values
# set zeros to nan
a[zeros] = np.nan
# tile and reshape 'b' so its the same shape as 'a'
b = pd.DataFrame(np.tile(b, len(a)).reshape(np.shape(a)), columns=b.columns)
# set the zero indices to nan
b[zeros] = np.nan
print('a:')
print(a)
print('b:')
print(b)
# now do some row-wise operation. For example take the sum of each row
print(np.sum(a+b, axis=1))
Output:
a:
a b c d
0 NaN 1.0 2.0 3
1 4.0 NaN NaN 7
2 8.0 9.0 10.0 11
3 NaN NaN NaN 15
b:
a b c d
0 NaN 1.0 2.0 3
1 5.0 NaN NaN 3
2 5.0 1.0 2.0 3
3 NaN NaN NaN 3
sum:
0 12.0
1 19.0
2 49.0
3 18.0
dtype: float64

How to replace selected rows of pandas dataframe with a np array, sequentially?

I have a pandas dataframe
A B C
0 NaN 2 6
1 3.0 4 0
2 NaN 0 4
3 NaN 1 2
where I have a column A that has NaN values in some rows (not necessarily consecutive).
I want to replace these values not with a constant value (which pd.fillna does), but rather with the values from a numpy array.
So the desired outcome is:
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
I'm not sure the .replace method will help here as well, since that seems to replace value <-> value via dictionary. Whereas here I want to sequentially change NaN to its corresponding value (by index) in the np array.
I tried:
MWE:
huh = pd.DataFrame([[np.nan, 2, 6],
[3, 4, 0],
[np.nan, 0, 4],
[np.nan, 1, 2]],
columns=list('ABC'))
huh.A[huh.A.isnull()] = np.array([1,5,7]) # what i want to do, but this gives error
gives the error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
'''
I read the docs but I can't understand how to do this with .loc.
How do I do this properly, preferably without a for loop?
Other info:
The number of elements in the np array will always match the number of NaN in the dataframe, so your answer does not need to check for this.
You are really close, need DataFrame.loc for avoid chained assignments:
huh.loc[huh.A.isnull(), 'A'] = np.array([1,5,7])
print (huh)
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
zip
This should account for uneven lengths
m = huh.A.isna()
a = np.array([1, 5, 7])
s = pd.Series(dict(zip(huh.index[m], a)))
huh.fillna({'A': s})
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources