How to fill NaN with user defined value in pandas dataframe - python-3.x

How to fill NaN with user defined value in pandas dataframe.
For text columns like A and B, user defined text like 'Missing' should be imputed. For discrete numeric variables like C and D, median value should be imputed. I have many columns like these, I would like apply rule for all vars in the dataframe
DF
A B C D
A0A1 Railway 10 NaN
A1A1 Shipping NaN 1
NaN Shipping 3 2
B1A1 NaN 1 7
DF out:
A B C D
A0A1 Railway 10 2
A1A1 Shipping 3 1
Missing Shipping 3 2
B1A1 Missing 1 7

You can fillna by pass dict
df.fillna({'A':'Miss','B':"Your2",'C':df.C.median(),'D':df.D.mean()})
Out[373]:
A B C D
0 A0A1 Railway 10.0 3.333333
1 A1A1 Shipping 3.0 1.000000
2 Miss Shipping 3.0 2.000000
3 B1A1 Your2 1.0 7.000000

Fun way!
d = {np.dtype('O'): 'Missing'}
df.fillna(df.dtypes.map(d).fillna(df.median()))
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

First replace median for numeric columns and then fillna for non numeric:
df = df.fillna(df.median()).fillna('Missing')
print (df)
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

Related

Get value from grouped data frame maximum in another column

Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.
I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.
df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] = pd.to_datetime(df['b'])
df
a b c
0 a 2008-11-01 NaN
1 a 2022-07-01 6.0
2 a 2017-02-01 8.0
3 b 2017-02-01 2.0
4 b 2018-02-01 1.0
5 b 2008-11-01 NaN
6 c 2014-11-01 6.0
7 c 2008-11-01 NaN
8 c 2022-07-01 7.0
I want the following result:
a b c d
0 a 2008-11-01 NaN 8.0
1 a 2022-07-01 6.0 8.0
2 a 2017-02-01 8.0 8.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0
I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:
df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()
I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.
None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]
I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):
df= df.groupby('a')
df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())
I've also tried using solutions provided here, here, and here.
Any help or direction would be appreciated.
I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):
df["d"] = df["a"].map(
df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)
Prints:
a b c d
0 a 2008-11-01 NaN 6.0
1 a 2022-07-01 6.0 6.0
2 a 2017-02-01 8.0 6.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Pandas: GroupBy Shift And Cumulative Sum

I want to do groupby, shift and cumsum which seems pretty trivial task but still banging my head over the result I'm getting. Can someone please tell what am I doing wrong. All the results I found online shows the same or the same variation of what I am doing. Below is my implementation.
temp = pd.DataFrame(data=[['a',1],['a',1],['a',1],['b',1],['b',1],['b',1],['c',1],['c',1]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID')['X'].cumsum().shift()
print(temp)
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 3.0
4 b 1 1.0
5 b 1 2.0
6 c 1 3.0
7 c 1 1.0
This is wrong because the actual or what I am looking for is as below:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
Thanks a lot in advance.
You could use transform() to feed the separate groups that are created at each level of groupby into the cumsum() and shift() methods.
temp['transformed'] = \
temp.groupby('ID')['X'].transform(lambda x: x.cumsum().shift())
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
For more info on transform() please see here:
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#Transformation
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#transformation
You need using apply , since one function is under groupby object which is cumsum another function shift is for all df
temp['transformed'] = temp.groupby('ID')['X'].apply(lambda x : x.cumsum().shift())
temp
Out[287]:
ID X transformed
0 a 1 NaN
1 a 1 1.0
2 a 1 2.0
3 b 1 NaN
4 b 1 1.0
5 b 1 2.0
6 c 1 NaN
7 c 1 1.0
While working on this problem, as the DataFrame size grows, using lambdas on transform starts to get very slow. I found out that using some DataFrameGroupBy methods (like cumsum and shift instead of lambdas are much faster.
So here's my proposed solution, creating a 'temp' column to save the cumsum for each ID and then shifting in a different groupby:
df['temp'] = df.groupby("ID")['X'].cumsum()
df['transformed'] = df.groupby("ID")['temp'].shift()
df = df.drop(columns=["temp"])

Pandas dataframe: Number of dates in group prior to row date

I would like to add a column in a dataframe that contains for each group G the number of distinct observations in variable x that happened before time t.
Note: t is in datetime format and missing values in the data are possible but can be ignored. The same x can appear multiple times in a group but then it is assigned the same date. The time assigned to x is not the same across groups.
I hope this example helps:
Input:
Group x t
1 a 2013-11-01
1 b 2015-04-03
1 b 2015-04-03
1 c NaT
2 a 2017-03-01
2 c 2013-11-06
2 d 2015-04-26
2 d 2015-04-26
2 d 2015-04-26
2 b NaT
Output:
Group x t Number of unique x before time t
1 a 2013-11-01 0
1 b 2015-04-03 1
1 b 2015-04-03 1
1 c NaT NaN
2 a 2017-03-01 2
2 c 2013-11-06 0
2 d 2015-04-26 1
2 d 2015-04-26 1
2 d 2015-04-26 1
2 b NaT NaN
The dataset is quite large so I wonder if there is any vectorized way do this (e.g. using groupby).
Many Thanks
Here's another method.
The initial sort makes it so fillna will work later on.
Create df2, which calculates the unique number of days within each group before that date.
Merge the number of days back to the original df. fillna then takes care of the days which were duplicated (the sort ensures this happens properly)
Dates with NaT were placed at the end for the cumsum so just reset them to NaN
If you want to reorder at the end to the original order, just sort the index df.sort_index(inplace=True)
import pandas as pd
import numpy as np
df = df.sort_values(by=['Group', 't'])
df['t'] = pd.to_datetime(df.t)
df2 = df
df2 = df2[df2.t.notnull()]
df2 = df2.drop_duplicates()
df2['temp'] = 1
df2['num_b4'] = df2.groupby('Group').temp.cumsum()-1
df = df.merge(df2[['num_b4']], left_index=True, right_index=True, how='left')
df['num_b4'] = df['num_b4'].fillna(method='ffill')
df.loc[df.t.isnull(), 'num_b4'] = np.NaN
# Group x t num_b4
#0 1 a 2013-11-01 0.0
#1 1 b 2015-04-03 1.0
#2 1 b 2015-04-03 1.0
#3 1 c NaT NaN
#5 2 c 2013-11-06 0.0
#6 2 d 2015-04-26 1.0
#7 2 d 2015-04-26 1.0
#8 2 d 2015-04-26 1.0
#4 2 a 2017-03-01 2.0
#9 2 b NaT NaN
IIUUC for the new cases, you want to change a single line in the above code.
# df2 = df2.drop_duplicates()
df2 = df2.drop_duplicates(['Group', 't'])
With that, the same day that has multiple x values assigned to it does not cause the number of observations to increment. See the output for Group 3 below, in which I added 4 rows to your initial data.
Group x t
3 a 2015-04-03
3 b 2015-04-03
3 c 2015-04-03
3 c 2015-04-04
## Apply the Code changing the drop_duplicates() line
Group x t num_b4
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
5 2 c 2013-11-06 0.0
6 2 d 2015-04-26 1.0
7 2 d 2015-04-26 1.0
8 2 d 2015-04-26 1.0
4 2 a 2017-03-01 2.0
9 2 b NaT NaN
10 3 a 2015-04-03 0.0
11 3 b 2015-04-03 0.0
12 3 c 2015-04-03 0.0
13 3 c 2015-04-04 1.0
Can you can do it like this using a custom designed function using merge to do a self-join, groupby and nunique to count unique values:
def countunique(x):
df_out = x.merge(x, on='Group')\
.query('x_x != x_y and t_y < t_x')\
.groupby(['x_x','t_x'])['x_y'].nunique()\
.reset_index()
result = x.merge(df_out, left_on=['x','t'],
right_on=['x_x','t_x'],
how='left')
result = result[['Group','x','t','x_y']]
result.loc[result.t.notnull(),'x_y'] = result.loc[result.t.notnull(),'x_y'].fillna(0)
return result.rename(columns={'x_y':'No of unique x before t'})
df.groupby('Group', group_keys=False).apply(countunique)
Output:
Group x t No of unique x before t
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
0 2 a 2017-03-01 2.0
1 2 c 2013-11-06 0.0
2 2 d 2015-04-26 1.0
3 2 d 2015-04-26 1.0
4 2 d 2015-04-26 1.0
5 2 b NaT NaN
Explanation:
For each group,
Perform a self-join using merge on 'Group'
Filter result of self join only getting those time before the
current record.
Use groupby with nunique to count only unique values of x from
self-join.
Merge count of x back to the original dataframe keep all rows using
how='left'
Fill NaN values with zero where there is time on a recourd
Rename column headings

Resources