counting NaNs in 'for' loop in python - python-3.x

I am trying to iterate over the rows in df and count consecutive rows when a certain value is NaN or 0 and start the count over if the value will change from NaN or 0. I would like to get something like this:
Value Period
0 1
0 2
0 3
NaN 4
21 NaN
4 NaN
0 1
0 2
NaN 3
I wrote the function which takes a dataframe as an argument and returns it with an additional column which denotes the count:
def calc_period(df):
period_x = []
sum_x = 0
for i in range(1,df.shape[0]):
if df.iloc[i,0] == np.nan or df.iloc[i,0] == 0:
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
period_x.append(sum_x)
df['period_x'] = period_x
return df
The function works well when the value is 0. But when the value is NaN the count is also NaN and I get the following result:
Value Period
0 1
0 2
0 3
NaN NaN
NaN NaN

Here is a revised version of your code:
import pandas as pd
import numpy as np
import math
def is_nan_or_zero(val):
return math.isnan(val) or val == 0
def calc_period(df):
is_first_nan_or_zero = is_nan_or_zero(df.iloc[0, 0])
period_x = [1 if is_first_nan_or_zero else np.nan]
sum_x = 1 if is_first_nan_or_zero else 0
for i in range(1,df.shape[0]):
val = df.iloc[i,0]
if is_nan_or_zero(val):
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
df['period_x'] = period_x
return df
There were 2 fixes:
Replacing df.iloc[i,0] == np.nan with math.isnan(val)
Remove period_x.append(sum_x) at the end, and add the first period value instead (since we start iterating from the second value)

Related

how to compare / update with two diffrent Dataframes?

I'm trying to make a code to check the data update, and update it if needs to be.
The problem is the efficiency. I only got a idea of nested loop, and there should be a better way to do this.
there are two DataFrames; df_new, df_old.
I want to update df_old's data with newer one.
Also, I want to make a ChangeLog if there's a change. (and if not, just a timestamp).
Here's my sample code:
import pandas as pd
from datetime import datetime
df_new = pd.DataFrame({"id":[11,22,33,44,55], "a2":[2,3,8,9,99], "a3":[2,4,2,5,99]})
df_old = pd.DataFrame({"id":[11,22,33,44], "a2":[2,3,4,7], "a3":[2,2,2,7],"CHANGELOG":["","","",""]})
for row in df_new.itertuples():
flag = 0
for row2 in df_old.itertuples():
if row[1] == row2[1]:
p = str(datetime.now().date()) + "\n"
if row[2] != row2[2]:
p += "a2 : " + str(row[2]) + " -> " + str(row2[2]) + "\n"
df_old.at[row2[0],"a2"] = str(row[2])
if row[3] != row2[3]:
p += "a3 : " + str(row[3]) + " -> " + str(row2[3]) + "\n"
df_old.at[row2[0],"a3"] = str(row[3])
df_old.at[row2[0],"CHANGELOG"] = p
flag = 1
break
if flag == 0:
df_old = df_old.append(pd.DataFrame([row],columns = row._fields),ignore_index=True)
df_old.at[len(df_old)-1,"CHANGELOG"] = str(datetime.now().date()) + "\n" + "Created"
The code actually worked. But only with small datasets. if I run with tens of thousands rows (each), as you've already assumed, it takes too much time.
I've searched that there's pd.compare, but seems like it only works with two dataframes with same rows/columns. And... I'm stuck now.
Are there any functions or references that I can use?
Thank you in advance.
Indeed, as stated in the docs, pd.compare "[c]an only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames. So, let's first achieve this:
import pandas as pd
from datetime import date
df_old = pd.DataFrame({"id":[11,22,33,44],
"a2":[2,3,4,7],
"a3":[2,2,2,7],
"CHANGELOG":["","","",""]})
df_new = pd.DataFrame({"id":[11,22,33,44,55],
"a2":[2,3,8,9,99],
"a3":[2,4,2,5,99]})
# get slice of `df_old` with just columns that need comparing
df_slice = df_old.iloc[:,:3]
# get missing indices from `df_new`, build empty df and append to `df_slice`
missing_indices = set(df_new.index).difference(set(df_slice.index))
df_missing = pd.DataFrame(columns = df_slice.columns, index=missing_indices)
df_slice = pd.concat([df_slice,df_missing],axis=0)
print(df_slice)
id a2 a3
0 11 2 2
1 22 3 2
2 33 4 2
3 44 7 7
4 NaN NaN NaN
Now, we can use pd.compare:
# compare the two dfs, keep_shape=True: same rows remain in result
comp = df_slice.compare(df_new, keep_shape=True)
print(comp)
id a2 a3
self other self other self other
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN 2 4.0
2 NaN NaN 4 8.0 NaN NaN
3 NaN NaN 7 9.0 7 5.0
4 NaN 55.0 NaN 99.0 NaN 99.0
Finally, let's apply a custom function to the comp df to generate the strings for the column CHANGELOG. Something like below:
# create func to build changelog strings per row
def create_change_log(row: pd.Series) -> str:
"""
Parameters
----------
row : pd.Series
e.g. comp.iloc[1] with ('id','self') etc. as index
Returns
-------
str
change_log_string per row.
"""
# start string for each row
string = str(date.today()) + "\n"
# get length pd.Series
length = len(row)
# get range for loop over 'self', so index 0,2,4 if len == 6
self_iloc = [*range(0,length,2)]
# get level 0 from index to retrieve orig col names: ['id'] etc.
cols = row.index.get_level_values(0)
# for loop, check scenarios
for i in self_iloc:
temp = str()
# both 'self' and 'other' vals are NaN, nothing changed
if row.isna().all():
break
# all of 'self' == NaN, entire row is new
if row.iloc[self_iloc].isna().all():
temp = 'Created\n'
string += temp
break
# set up comp for specific cols: comp 'self.1' == 'other.1' etc.
self_val, other_val = row[i], row[i+1]
# add `pd.notna()`, since np.nan == np.nan is actually `False`!
if self_val != other_val and pd.notna(self_val):
temp = f'{cols[i]} : {self_val} -> {other_val}\n'
string += temp
return string
Applied to comp:
change_log = comp.apply(lambda row: create_change_log(row), axis=1)
change_log.name = 'CHANGELOG'
# result
df_old_adj = pd.concat([df_new,change_log],axis=1)
print(df_old_adj)
id a2 a3 CHANGELOG
0 11 2 2 2022-08-29\n
1 22 3 4 2022-08-29\na3 : 2 -> 4.0\n
2 33 8 2 2022-08-29\na2 : 4 -> 8.0\n
3 44 9 5 2022-08-29\na2 : 7 -> 9.0\na3 : 7 -> 5.0\n
4 55 99 99 2022-08-29\nCreated\n
PS.1: my result has e.g. 2022-08-29\na3 : 2 -> 4.0\n where you generate 2022-08-29\na3 : 4 -> 2\n. The former seems to me correct; you want to convey: value 2 in column a3 has become (->) 4, no? Anyway, you can just switch the vars in {self_val} -> {other_val}, of course.
PS.2: comp is turning ints into floats automatically for other (= df_new). Hence, we end up with 2 -> 4.0 rather than 2 -> 4. I'd say the best solution to 'fix' this depends on the type of values you are expecting.

Create Python function to look for ANY NULL value in a group

I am trying to write a function that will check a specified column for nulls, within a group in a dataframe. The example dataframe has two columns, ID and VALUE. Multiple rows exist per ID. I want to know if ANY of the rows for a particular ID have a NULL value in VALUE.
I have tried building the function with iterrows().
df = pd.DataFrame({'ID':[1,2,2,3,3,3],
'VALUE':[50,None,30,20,10,None]})
def nullValue(col):
for i, row in col.iterrows():
if ['VALUE'] is None:
return 1
else:
return 0
df2 = df.groupby('ID').apply(nullVALUE)
df2.columns = ['ID','VALUE','isNULL']
df2
I am expecting to retrieve a dataframe with three columns, ID, VALUE, and isNULL. If any row in a grouped ID has a null, all of the rows for that ID should have a 1 under isNull.
Example:
ID VALUE isNULL
1 50.0 0
2 NaN 1
2 30.0 1
3 20.0 1
3 10.0 1
3 NaN 1
A quick solution, borrowed partially from this answer is to use groupby with transform:
df = pd.DataFrame({'ID':[1,2,2,3,3,3,3],
'VALUE':[50,None,None,30,20,10,None]})
df['isNULL'] = (df.VALUE.isnull().groupby([df['ID']]).transform('sum') > 0).astype(int)
Out[51]:
ID VALUE isNULL
0 1 50.0 0
1 2 NaN 1
2 2 NaN 1
3 3 30.0 1
4 3 20.0 1
5 3 10.0 1
6 3 NaN 1

How to display cross-tabs of all interactions?

I have a dataset which (in simplified form) looks like this:
import pandas as pd
df = pd.DataFrame({"target":[20,30,40], "x1":[1,0,1], "x2":[0,1,1], "x3":[0,0,1]}
And I want to find the average value of target for all possible two-variable (x_i, x_j)interactions. So the output should look like this:
How would I go about doing this in Pandas?
You can use pivot_table and for add not exist values reindex by MultiIndex created by from_product:
df = df.pivot_table(index='x1',columns=['x2','x3'], values='target')
mux = pd.MultiIndex.from_product(df.columns.levels, names=df.columns.names)
df = df.reindex(columns=mux)
print (df)
x2 0 1
x3 0 1 0 1
x1
0 NaN NaN 30.0 NaN
1 20.0 NaN NaN 40.0
If want replace NaNs to 0:
df = df.pivot_table(index='x1',columns=['x2','x3'], values='target', fill_value=0)
mux = pd.MultiIndex.from_product(df.columns.levels, names=df.columns.names)
df = df.reindex(columns=mux, fill_value=0)
print (df)
x2 0 1
x3 0 1 0 1
x1
0 0 0 30 0
1 20 0 0 40

Removing negative values in pandas column keeping NaN

I was wondering how I can remove rows which have a negative value but keep the NaNs. At the moment I am using:
DF = DF.ix[DF['RAF01Time'] >= 0]
But this removes the NaNs.
Thanks in advance.
You need boolean indexing with another condition with isnull:
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
Sample:
DF = pd.DataFrame({'RAF01Time':[-1,2,3,np.nan]})
print (DF)
RAF01Time
0 -1.0
1 2.0
2 3.0
3 NaN
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
Another solution with query:
DF = DF.query("~(RAF01Time < 0)")
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
You can just use < 0 and then take the inverse of the condition.
DF = DF[~(DF['RAF01Time'] < 0)]

Element-wise Maximum of Two DataFrames Ignoring NaNs

I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321

Resources