I have the following dataframe: df
name width length
timestamp
2019-08-01 00:00:08 10 10.0 NaN
2019-08-01 00:00:19 10 NaN NaN
2019-08-01 00:00:56 10 NaN 86.0
2019-08-01 00:00:08 12 NaN 90
2019-08-01 00:00:19 12 12.0 NaN
2019-08-01 00:00:28 12 NaN NaN
I would like to apply forward and backward fill on the columns 'width' and 'length' within for the column 'name'. The result would look like this:
name width length
timestamp
2019-08-01 00:00:08 10 10.0 86
2019-08-01 00:00:19 10 10.0 86
2019-08-01 00:00:56 10 10.0 86
2019-08-01 00:00:08 12 12.0 90
2019-08-01 00:00:19 12 12.0 90
2019-08-01 00:00:28 12 12.0 90
Any ideas how to do this?
We need groupby with apply , since we chain two functions ffill and bfill together
df.update(df.groupby('name').apply(lambda x : x.ffill().bfill()))
as you said each unique name has only one value of width and length, you may be able to avoid apply by using transform and max or first
df.update(df.groupby('name')[['width','length']].transform('max'))
Out[87]:
name width length
timestamp
2019-08-01 00:00:08 10 10.0 86.0
2019-08-01 00:00:19 10 10.0 86.0
2019-08-01 00:00:56 10 10.0 86.0
2019-08-01 00:00:08 12 12.0 90.0
2019-08-01 00:00:19 12 12.0 90.0
2019-08-01 00:00:28 12 12.0 90.0
Related
I have a dataframe as shown below
Date t_factor t1 t2 t3 t_function
2020-02-01 5 4 NaN NaN 4
2020-02-03 23 6 NaN NaN 6
2020-02-06 14 9 NaN NaN 9
2020-02-09 23 NaN NaN NaN 0
2020-02-10 23 NaN NaN NaN 0
2020-02-11 23 NaN NaN NaN 0
2020-02-13 30 NaN 3 NaN 3
2020-02-20 29 NaN 66 NaN 66
2020-02-29 100 NaN 291 NaN 291
2020-03-01 38 NaN NaN NaN 0
2020-03-10 38 NaN NaN NaN 0
2020-03-11 38 NaN NaN 4 4
2020-03-26 70 NaN NaN 4 4
2020-03-29 70 NaN NaN 4 4
In which I would like to fill NaN values after non NaN value as last NaN value of that column
Here the columns I wanted to impute are t1, t2 and t3.
Expected Output
Date t_factor t1 t2 t3 t_function
2020-02-01 5 4 NaN NaN 4
2020-02-03 23 6 NaN NaN 6
2020-02-06 14 9 NaN NaN 9
2020-02-09 23 9 NaN NaN 0
2020-02-10 23 9 NaN NaN 0
2020-02-11 23 9 NaN NaN 0
2020-02-13 30 9 3 NaN 3
2020-02-20 29 9 66 NaN 66
2020-02-29 100 9 291 NaN 291
2020-03-01 38 9 291 NaN 0
2020-03-10 38 9 291 NaN 0
2020-03-11 38 9 291 4 4
2020-03-26 70 9 291 4 4
2020-03-29 70 9 291 4 4
Use ffill:
df[['t1', 't2', 't3']] = df[['t1', 't2', 't3']].ffill()
Result:
Date t_factor t1 t2 t3 t_function
0 2020-02-01 5 4.0 NaN NaN 4
1 2020-02-03 23 6.0 NaN NaN 6
2 2020-02-06 14 9.0 NaN NaN 9
3 2020-02-09 23 9.0 NaN NaN 0
4 2020-02-10 23 9.0 NaN NaN 0
5 2020-02-11 23 9.0 NaN NaN 0
6 2020-02-13 30 9.0 3.0 NaN 3
7 2020-02-20 29 9.0 66.0 NaN 66
8 2020-02-29 100 9.0 291.0 NaN 291
9 2020-03-01 38 9.0 291.0 NaN 0
10 2020-03-10 38 9.0 291.0 NaN 0
11 2020-03-11 38 9.0 291.0 4.0 4
12 2020-03-26 70 9.0 291.0 4.0 4
13 2020-03-29 70 9.0 291.0 4.0 4
We can define a function for that
def imporove(iterable):
for i in range(len(iterable)):
if iterable[i].isnull() == True:
iterable[i] = iterable[i-1]
I hope you got a basic idea.
now you can pass
df['t1'].apply(improve)
Here is how I will go:
def fill_na(col):
ind = df[col].last_valid_index()
df[col][ind+1:].fillna(df[col][ind], inplace=True)
fill_na('t1')
fill_na('t2')
fill_na('t3')
I have df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 -23
2020-02-06 14
2020-02-09 23
2020-02-10 -2
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
From the above I would like to replace -ve values in a column t_factor as NaN
Expected output:
Date t_factor
2020-02-01 5
2020-02-03 NaN
2020-02-06 14
2020-02-09 23
2020-02-10 NaN
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
You can use pandas clip implementation as well. This assigns values outside boundary to boundary values. And then chain this with a replace function as below:
df['t_factor'] = df['t_factor'].clip(-1).replace(-1, np.nan)
df
Output:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use Series.mask:
df['t_factor'] = df['t_factor'].mask(df['t_factor'].lt(0))
OR use boolean indexing and assign np.nan,
df.loc[df['t_factor'].lt(0), 't_factor'] = np.nan
Result:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use pd.Series.where - by default it will replace values where the condition is False with NaN.
df["t_factor"] = df.t_factor.where(df.t_factor > 0)
I have the following dataframe df:
length timestamp width
name
testschip-1 NaN 2019-08-01 00:00:00 NaN
testschip-1 NaN 2019-08-01 00:00:09 NaN
testschip-1 2 2019-08-01 00:00:20 NaN
testschip-1 2 2019-08-01 00:00:27 NaN
testschip-1 NaN 2019-08-01 00:00:38 1
testschip-2 4 2019-08-01 00:00:39 2
testschip-2 4 2019-08-01 00:00:57 NaN
testschip-2 4 2019-08-01 00:00:58 NaN
testschip-2 NaN 2019-08-01 00:01:17 NaN
testschip-3 NaN 2019-08-01 00:02:27 NaN
testschip-3 NaN 2019-08-01 00:03:47 NaN
First, I want to remove the string "testschip-" from the index "name" so I get integers only on the indices. Second, per unique index I want to apply forward fill or backward fill (whatever is neccessary to obtain no NaNs) on both columns 'length' and 'width'. Each unique index has the same "length" and "width". On "testschip-3" I dont want to apply backward or forward fill. If I do backward fill on "testschip-1" (which is needed to set the first two indices two '2'), I get an unwanted '4' for the last row of index "testschip-1"). I cannot judge beforehand if I have to apply backward or forward fill beforehand, since I have 4 million rows of data to start with.
Use:
df.index = df.index.str.lstrip('testschip-').astype(int)
#alternative
#df.index = df.index.str[10:].astype(int)
#df.index = df.index.str.split('-').str[-1].astype(int)
df.groupby(level = 0).apply(lambda x: x.bfill().ffill())
Output
length timestamp width
name
1 2.0 2019-08-01 00:00:00 1.0
1 2.0 2019-08-01 00:00:09 1.0
1 2.0 2019-08-01 00:00:20 1.0
1 2.0 2019-08-01 00:00:27 1.0
1 2.0 2019-08-01 00:00:38 1.0
2 4.0 2019-08-01 00:00:39 2.0
2 4.0 2019-08-01 00:00:57 2.0
2 4.0 2019-08-01 00:00:58 2.0
2 4.0 2019-08-01 00:01:17 2.0
3 NaN 2019-08-01 00:02:27 NaN
3 NaN 2019-08-01 00:03:47 NaN
My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN
I have a dataframe like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 3.0 2019-10-01 11.25
3 MMG 1.0 2019-08-01 15.71
4 MMG 2.0 2020-10-01 11.50
5 MMG 3.0 2021-10-01 11.75
6 MMG 4.0 2014-01-01 14.00
I would like to have an output like this,
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 XYZ 1 2 3 NaN 2018-10-01 2017-08-01 2019-10-01 NaN 11.0 15.25 11.25 NaN
1 MMG 1 2 3 4 2019-08-01 2020-10-01 2021-10-01 2014-01-01 15.71 11.50 11.75 14.00
I tried the following,
df.vstack().reset_index().drop('level_1',axis=0)
This is not giving me the output I need.
If you see, I have per column that has incremental values that can go into column names when I transpose them.
Any suggestion would be great.
Use GroupBy.cumcount for counter and reshape by DataFrame.unstack, last flatten columns names by f-strings:
df = df.set_index(['param', df.groupby('param').cumcount().add(1)]).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 \
0 MMG 1.0 2.0 3.0 4.0 2019-08-01 2020-10-01 2021-10-01
1 XYZ 1.0 2.0 3.0 NaN 2018-10-01 2017-08-01 2019-10-01
per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 2014-01-01 15.71 11.50 11.75 14.0
1 NaN 11.00 15.25 11.25 NaN