Delete row from dataframe having "None" value in all the columns - Python - python-3.x

I need to delete the row completely in a dataframe having "None" value in all the columns. I am using the following code -
df.dropna(axis=0,how='all',thresh=None,subset=None,inplace=True)
This does not bring any difference to the dataframe. The rows with "None" value are still there.
How to achieve this?

There Nones should be strings, so use replace first:
df = df.replace('None', np.nan).dropna(how='all')
df = pd.DataFrame({
'a':['None','a', 'None'],
'b':['None','g', 'None'],
'c':['None','v', 'b'],
})
print (df)
a b c
0 None None None
1 a g v
2 None None b
df1 = df.replace('None', np.nan).dropna(how='all')
print (df1)
a b c
1 a g v
2 NaN NaN b
Or test values None with not equal and DataFrame.any:
df1 = df[df.ne('None').any(axis=1)]
print (df1)
a b c
1 a g v
2 None None b

You should be dropping in the axis 1. Use the how keyword to drop columns with any or all NaN values. Check the docs
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3], 'b':[-1, 0, np.nan], 'c':[np.nan, np.nan, np.nan]})
df
a b c
0 1 -1.0 NaN
1 2 0.0 NaN
2 3 NaN 5.0
df.dropna(axis=1, how='any')
a
0 1
1 2
2 3
df.dropna(axis=1, how='all')
a b
0 1 -1.0
1 2 0.0
2 3 NaN

Related

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Groupby id and get each string from an id, in each diferent column

Hello I just want to group the elements by id and show each string in a separated column
Original dataframe:
id|elements|
1|a
1|b
1|c
1|d
2|a
2|b
2|b
3|a
3|a
3|b
3|c
3|c
3|c
Desired output:
id|column1|column2|column3|column4|column5|
1 |a|b|c|d| | |
2 |a|b|b|
3 |a|a|b|c|c|c|
Any ideas? Thank you very much in advance
Given your original data frame, you can simply do:
df.groupby('id').apply(lambda x: x['element'].to_list()).apply(pd.Series)
Output:
0 1 2 3 4 5
id
1 a b c d NaN NaN
2 a b b NaN NaN NaN
3 a a b c c c
If you do not want id to be the index, use .reset_index().
Try this
import pandas as pd
import numpy as np
F = {'id': [1,1,1,1,2,2,2,3,3,3,3,3], 'element': ['a','b','c','d','a','b','b','a','a','b','c','c']}
df = pd.DataFrame(data = F)
df2 = df.set_index('id').stack().groupby(level=[0,1]).apply(list).unstack()
df3 = pd.DataFrame(df2["element"].to_list(), columns=['element1', 'element2','element3', 'element4','element5'])

Change/swap values one after another in pandas dataframe for selected rows

Dataframe:
col1 col2
A 0
A 1
A nan
B 0
B 1
C and so on...
I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2 wherever col1=='A'.
Code so far:
df.loc[(df.col1=='A') & (df.col2==0),'col2'] = 2
df.loc[(df.col1=='A') & (df.col2==1),'col2'] = 0
df.loc[(df.col1=='A') & (df.col2==2),'col2'] = 1
# Hope you understand why I am converting 0 to 2 first then to 1.
# Because if I convert all zeroes to 1 then all 1's will be converted to
# 0 in subsequent conversion.
Unique values in col2 are 0,1 and nan.
Is there a correct/better way of doing this?
Also, is there a way to directly swap these numbers instead of assignment operators?
One solution using Series.where and astype(bool) with ~ (NOT operator) and then back to astype(int). Then use loc with boolean indexing to assign back to DataFrame:
df.loc[df.col1.eq('A'), 'col2'] = df.col2.where(df.col2.isna(),
(~df.col2.astype(bool)).astype(int))
[out]
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C NaN
You can also try with df.mask():
m=df.col1.eq('A')&df.col2.isna() #condition
df.col2=1-df.col2.mask(m)
print(df)
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 1.0
4 B 0.0
I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2
wherever col1=='A'.
use np.where
df['col2] = np.where(df['col1'] == 'A', np.where(df['col2'] == 1, 0 , np.where(df['col2'].isnull() == True, df['col2'],1)),df['col2'])
Output
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C 0.0
In this case, you can also use your own function in combination with apply().
# import pandas
import pandas as pd
# make a sample data
list_of_rows = [
{'col1': A, 'col2': 1},
{'col1': A, 'col2': 0},
{'col1': A, 'col2': None},
{'col1': B, 'col2': 0},
{'col1': B, 'col2': 1},
{'col1': B, 'col2': None},
]
# make a pandas data frame
df = pd.DataFrame(list_of_rows)
# define a function
def change_values(row):
if row['col2'] == 0:
return 1
if row['col2'] == 1:
return 0
return row['col2']
# apply function to dataframe
df['col2'] = df.apply(lambda row: change_values(row), axis=1)

Pandas insert alternate blank rows

Given the following data frame:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'A':['a','b','c','d'],
'B':['d',np.nan,'c','f']})
df1
A B
0 a d
1 b NaN
2 c c
3 d f
I'd like to insert blank rows before each row.
The desired result is:
A B
0 NaN NaN
1 a d
2 NaN NaN
3 b NaN
4 NaN NaN
5 c c
6 NaN NaN
7 d f
In reality, I have many rows.
Thanks in advance!
I think you could change your index like #bananafish did and then use reindex:
df1.index = range(1, 2*len(df1)+1, 2)
df2 = df1.reindex(index=range(2*len(df1)))
In [29]: df2
Out[29]:
A B
0 NaN NaN
1 a d
2 NaN NaN
3 b NaN
4 NaN NaN
5 c c
6 NaN NaN
7 d f
Use numpy and pd.DataFrame
def pir(df):
nans = np.where(np.empty_like(df.values), np.nan, np.nan)
data = np.hstack([nans, df.values]).reshape(-1, df.shape[1])
return pd.DataFrame(data, columns=df.columns)
pir(df1)
Testing and Comparison
Code
def banana(df):
df1 = df.set_index(np.arange(1, 2*len(df)+1, 2))
df2 = pd.DataFrame(index=range(0, 2*len(df1), 2), columns=df1.columns)
return pd.concat([df1, df2]).sort_index()
def anton(df):
df = df.set_index(np.arange(1, 2*len(df)+1, 2))
return df.reindex(index=range(2*len(df)))
def pir(df):
nans = np.where(np.empty_like(df.values), np.nan, np.nan)
data = np.hstack([nans, df.values]).reshape(-1, df.shape[1])
return pd.DataFrame(data, columns=df.columns)
Results
pd.concat([f(df1) for f in [banana, anton, pir]],
axis=1, keys=['banana', 'anton', 'pir'])
Timing
A bit roundabout but this works:
df1.index = range(1, 2*len(df1)+1, 2)
df2 = pd.DataFrame(index=range(0, 2*len(df1), 2), columns=df1.columns)
df3 = pd.concat([df1, df2]).sort()

Element-wise Maximum of Two DataFrames Ignoring NaNs

I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321

Resources