Replace values in pandas column based on nan in another column - python-3.x

For pairs of columns, i want to replace the values of the second columns with nan if the values in the first is nan.
I have tried without success
>import pandas as pd
>
> df=pd.DataFrame({'a': ['r', np.nan, np.nan, 's'], 'b':[0.5, 0.5, 0.2,
> 0.02], 'c':['n','r', np.nan, 's' ], 'd':[1,0.5,0.2,0.05]})
>
>listA=['a','c']
>listB=['b','d']
>for color, ratio in zip(listA, listB):
>>df.loc[df[color].isnull(), ratio] == np.nan
df remain unchanged
other test using def (failed)
>def Test(df):
>> if df[color]== np.nan:
>> >> return df[ratio]== np.nan
>> else:
>> >>return
>for color, ratio in zip(listA, listB):
>>>>df[ratio]=df.apply(Test, axis=1)
Thanks

It seems you have typo, change == to =:
for color, ratio in zip(listA, listB):
df.loc[df[color].isnull(), ratio] = np.nan
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05
Another solution with mask for replace True values of mask to NaN by default:
for color, ratio in zip(listA, listB):
df[ratio] = df[ratio].mask(df[color].isnull())
print (df)
a b c d
0 r 0.50 n 1.00
1 NaN NaN r 0.50
2 NaN NaN NaN NaN
3 s 0.02 s 0.05

Related

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Summing up two columns of pandas dataframe ignoring NaN

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0
Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64
Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64
Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)

Delete row from dataframe having "None" value in all the columns - Python

I need to delete the row completely in a dataframe having "None" value in all the columns. I am using the following code -
df.dropna(axis=0,how='all',thresh=None,subset=None,inplace=True)
This does not bring any difference to the dataframe. The rows with "None" value are still there.
How to achieve this?
There Nones should be strings, so use replace first:
df = df.replace('None', np.nan).dropna(how='all')
df = pd.DataFrame({
'a':['None','a', 'None'],
'b':['None','g', 'None'],
'c':['None','v', 'b'],
})
print (df)
a b c
0 None None None
1 a g v
2 None None b
df1 = df.replace('None', np.nan).dropna(how='all')
print (df1)
a b c
1 a g v
2 NaN NaN b
Or test values None with not equal and DataFrame.any:
df1 = df[df.ne('None').any(axis=1)]
print (df1)
a b c
1 a g v
2 None None b
You should be dropping in the axis 1. Use the how keyword to drop columns with any or all NaN values. Check the docs
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3], 'b':[-1, 0, np.nan], 'c':[np.nan, np.nan, np.nan]})
df
a b c
0 1 -1.0 NaN
1 2 0.0 NaN
2 3 NaN 5.0
df.dropna(axis=1, how='any')
a
0 1
1 2
2 3
df.dropna(axis=1, how='all')
a b
0 1 -1.0
1 2 0.0
2 3 NaN

How to combine different columns in a dataframe using comprehension-python

Suppose a dataframe contains
attacker_1 attacker_2 attacker_3 attacker_4
Lannister nan nan nan
nan Stark greyjoy nan
I want to create another column called AttackerCombo that aggregates the 4 columns into 1 column.
How would I go about defining such code in python?
I have been practicing python and I reckon a list comprehension of this sort makes sense, but [list(x) for x in attackers]
where attackers is a numpy array of the 4 columns displays all 4 columns aggregated into 1 column, however I would like to remove all the nans as well.
So the result for each row instead of looking like
starknannanlannister would look like stark/lannister
I think you need apply with join and remove NaN by dropna:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join(x.dropna()), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
If need separator empty string use DataFrame.fillna:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].fillna('') \
.apply(''.join, axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Starkgreyjoy
Another 2 solutions with list comprehension - first compare by notnull and second check if string:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join([e for e in x if pd.notnull(e)]), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
#python 3 - isinstance(e, str), python 2 - isinstance(e, basestring)
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']] \
.apply(lambda x: '/'.join([e for e in x if isinstance(e, str)]), axis=1)
print (df)
attacker_1 attacker_2 attacker_3 attacker_4 attackers
0 Lannister NaN NaN NaN Lannister
1 NaN Stark greyjoy NaN Stark/greyjoy
You can set a new column in the dataframe that you will fill thanks to a lambda function:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].apply(lambda x : '{}{}{}{}'.format(x[0],x[1],x[2],x[3]), axis=1)
You don't specify how you want to aggregate them, so for instance, if you want separated by a dash:
df['attackers'] = df[['attacker_1','attacker_2','attacker_3','attacker_4']].apply(lambda x : '{}-{}-{}-{}'.format(x[0],x[1],x[2],x[3]), axis=1)

Resources