Replace NULL or NA in a column wrt to other column in pandas data frame [duplicate] - python-3.x

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I have a table:
df = pd.DataFrame([[0.1, 2, 55, 0,np.nan],
[0.2, 4, np.nan, 1,99],
[0.6, np.nan, 22, 5,88],
[1.4, np.nan, np.nan, 4,77]],
columns=list('ABCDE'))
A B C D E
0 0.1 2.0 55.0 0 NaN
1 0.2 NaN NaN 1 99.0
2 0.6 NaN 22.0 5 88.0
3 1.4 NaN NaN 4 77.0
I want to replace NaN values in Column B based on condition on Column A.
Example:
When B is NULL and value in `column A > 0.2 and < 0.6` replace "NaN" in column B as 5
When B is NULL value in `column A > 0.6 and < 2` replace "NaN" in column B as 10
I tried something like this:
if df["A"]>=val1 and pd.isnull(df['B']):
df["B"]=5
elif df["A"]>=val2 and df["A"]<val3 and pd.isnull(df['B']):
df["B"]=10
elif df["A"]<val4 and pd.isnull(df['B']):
df["B"]=15
The above code is not working.
Please let me know is there any other alternative approach using for loop or apply functions to iterate over pandas dataframe.

You can use mask:
df['B'] = df['B'].mask((df['A']>0.2) & (df['A']<0.6), df['B'].fillna(5))
df['B'] = df['B'].mask((df['A']>0.6) & (df['A']<2), df['B'].fillna(10))
or you can try np.where but it will involve a long condition I guess.

Related

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Pairwise operations in Scikit-Learn and different filtering conditions on each pair

I have the following 2 data frames, say df1
a b c d
0 0 1 2 3
1 4 0 0 7
2 8 9 10 11
3 0 0 0 15
and df2
a b c d
0 5 1 2 3
What I am interested in doing is a pairwise operation on each row in df1 with the single row in df2. However, if a column in a row of df1 is 0, then that column is used in neither the df1 row nor df2 row to perform the pairwise operation. So each pairwise operation will work on pairs of rows of different length. Let me break it down how the 4 comparison should be.
Comparison 1
0 1 2 3 vs 5 1 2 3
The pairwise operation is done on 1 2 3 vs 1 2 3 as column a has a 0
Comparison 2
4 0 0 7 vs 5 1 2 3 is done on 4 7 vs 5 3 as we have 2 columns that need to be dropped
Comparison 3
8 9 10 11 vs 5 1 2 3 is done on 8 9 10 11 vs 5 1 2 3 as no columns are dropped
Comparison 4
0 0 0 15 vs 5 1 2 3 is done on 15 vs 3 as all but one column is dropped
The result of each pairwise operation is a scalar so the result is some sort of structure whether it be list, array, data frame, whatever with 4 (or the number of rows in df1) values. Also, I should note that values in df2 are irrelevant and no filtering is done based upon the value of any column in df2.
For simplicity, you could try looping over each row in the dataframe and do something like this:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# loop over each row in 'a'
for i in range(len(a)):
# find indicies of non-zero elements of the row
non_zero = np.nonzero(a.iloc[i].to_numpy())[0]
# perform pair-wise addition between non-zero elements in 'a' and the same elements in 'b'
print(np.array(a.iloc[i])[(non_zero)] + np.array(b.iloc[0])[(non_zero)])
Here I used pair-wise addition but you could replace the addition with an operation of your choosing.
Edit:
We may want to vectorize this to avoid the loop if the dataframes are large. Here is an idea for that, where we convert zero values to nan so they are ignored in the row-wise operation:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# find indicies of zeros
zeros = (a==0).values
# set zeros to nan
a[zeros] = np.nan
# tile and reshape 'b' so its the same shape as 'a'
b = pd.DataFrame(np.tile(b, len(a)).reshape(np.shape(a)), columns=b.columns)
# set the zero indices to nan
b[zeros] = np.nan
print('a:')
print(a)
print('b:')
print(b)
# now do some row-wise operation. For example take the sum of each row
print(np.sum(a+b, axis=1))
Output:
a:
a b c d
0 NaN 1.0 2.0 3
1 4.0 NaN NaN 7
2 8.0 9.0 10.0 11
3 NaN NaN NaN 15
b:
a b c d
0 NaN 1.0 2.0 3
1 5.0 NaN NaN 3
2 5.0 1.0 2.0 3
3 NaN NaN NaN 3
sum:
0 12.0
1 19.0
2 49.0
3 18.0
dtype: float64

Delete row from dataframe having "None" value in all the columns - Python

I need to delete the row completely in a dataframe having "None" value in all the columns. I am using the following code -
df.dropna(axis=0,how='all',thresh=None,subset=None,inplace=True)
This does not bring any difference to the dataframe. The rows with "None" value are still there.
How to achieve this?
There Nones should be strings, so use replace first:
df = df.replace('None', np.nan).dropna(how='all')
df = pd.DataFrame({
'a':['None','a', 'None'],
'b':['None','g', 'None'],
'c':['None','v', 'b'],
})
print (df)
a b c
0 None None None
1 a g v
2 None None b
df1 = df.replace('None', np.nan).dropna(how='all')
print (df1)
a b c
1 a g v
2 NaN NaN b
Or test values None with not equal and DataFrame.any:
df1 = df[df.ne('None').any(axis=1)]
print (df1)
a b c
1 a g v
2 None None b
You should be dropping in the axis 1. Use the how keyword to drop columns with any or all NaN values. Check the docs
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3], 'b':[-1, 0, np.nan], 'c':[np.nan, np.nan, np.nan]})
df
a b c
0 1 -1.0 NaN
1 2 0.0 NaN
2 3 NaN 5.0
df.dropna(axis=1, how='any')
a
0 1
1 2
2 3
df.dropna(axis=1, how='all')
a b
0 1 -1.0
1 2 0.0
2 3 NaN

How to replace selected rows of pandas dataframe with a np array, sequentially?

I have a pandas dataframe
A B C
0 NaN 2 6
1 3.0 4 0
2 NaN 0 4
3 NaN 1 2
where I have a column A that has NaN values in some rows (not necessarily consecutive).
I want to replace these values not with a constant value (which pd.fillna does), but rather with the values from a numpy array.
So the desired outcome is:
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
I'm not sure the .replace method will help here as well, since that seems to replace value <-> value via dictionary. Whereas here I want to sequentially change NaN to its corresponding value (by index) in the np array.
I tried:
MWE:
huh = pd.DataFrame([[np.nan, 2, 6],
[3, 4, 0],
[np.nan, 0, 4],
[np.nan, 1, 2]],
columns=list('ABC'))
huh.A[huh.A.isnull()] = np.array([1,5,7]) # what i want to do, but this gives error
gives the error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
'''
I read the docs but I can't understand how to do this with .loc.
How do I do this properly, preferably without a for loop?
Other info:
The number of elements in the np array will always match the number of NaN in the dataframe, so your answer does not need to check for this.
You are really close, need DataFrame.loc for avoid chained assignments:
huh.loc[huh.A.isnull(), 'A'] = np.array([1,5,7])
print (huh)
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
zip
This should account for uneven lengths
m = huh.A.isna()
a = np.array([1, 5, 7])
s = pd.Series(dict(zip(huh.index[m], a)))
huh.fillna({'A': s})
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2

Copy and Paste Values Based on a Condition in Python

I am trying to populate column 'C' with values from column 'A' based on conditions in column 'B'. Example: If column 'B' equals 'nan', then row under column 'C' equals the row in column 'A'. If column 'B' does NOT equal 'nan', then leave column 'C' as is (ie 'nan'). Next, the values in column 'A' to be removed (only the values that were copied from column A to C).
Original Dataset:
index A B C
0 6 nan nan
1 6 nan nan
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 3 nan nan
8 4 nan nan
Output:
index A B C
0 nan nan 6
1 nan nan 6
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 nan nan 3
8 nan nan 4
Below is what I have tried so far, but its not working.
def impute_unit(cols):
Legal_Block = cols[0]
Legal_Lot = cols[1]
Legal_Unit = cols[2]
if pd.isnull(Legal_Lot):
return 3
else:
return Legal_Unit
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
Seems like you need
df['C'] = np.where(df.B.isna(), df.A, df.C)
df['A'] = np.where(df.B.isna(), np.nan, df.A)
A different, maybe fancy way to do it would be to swap A and C values only when B is np.nan
m = df.B.isna()
df.loc[m, ['A', 'C']] = df.loc[m, ['C', 'A']].values
In other words, change
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
for
bk_Final_tax['Legal_Unit'] = np.where(df.Legal_Lot.isna(), df.Legal_Block, df.Legal_Unit)
bk_Final_tax['Legal_Block'] = np.where(df.Legal_Lot.isna(), np.nan, df.Legal_Block)

Resources