Pandas forward and backward fill on distinct indices - python-3.x

I have the following dataframe df:
length timestamp width
name
testschip-1 NaN 2019-08-01 00:00:00 NaN
testschip-1 NaN 2019-08-01 00:00:09 NaN
testschip-1 2 2019-08-01 00:00:20 NaN
testschip-1 2 2019-08-01 00:00:27 NaN
testschip-1 NaN 2019-08-01 00:00:38 1
testschip-2 4 2019-08-01 00:00:39 2
testschip-2 4 2019-08-01 00:00:57 NaN
testschip-2 4 2019-08-01 00:00:58 NaN
testschip-2 NaN 2019-08-01 00:01:17 NaN
testschip-3 NaN 2019-08-01 00:02:27 NaN
testschip-3 NaN 2019-08-01 00:03:47 NaN
First, I want to remove the string "testschip-" from the index "name" so I get integers only on the indices. Second, per unique index I want to apply forward fill or backward fill (whatever is neccessary to obtain no NaNs) on both columns 'length' and 'width'. Each unique index has the same "length" and "width". On "testschip-3" I dont want to apply backward or forward fill. If I do backward fill on "testschip-1" (which is needed to set the first two indices two '2'), I get an unwanted '4' for the last row of index "testschip-1"). I cannot judge beforehand if I have to apply backward or forward fill beforehand, since I have 4 million rows of data to start with.

Use:
df.index = df.index.str.lstrip('testschip-').astype(int)
#alternative
#df.index = df.index.str[10:].astype(int)
#df.index = df.index.str.split('-').str[-1].astype(int)
df.groupby(level = 0).apply(lambda x: x.bfill().ffill())
Output
length timestamp width
name
1 2.0 2019-08-01 00:00:00 1.0
1 2.0 2019-08-01 00:00:09 1.0
1 2.0 2019-08-01 00:00:20 1.0
1 2.0 2019-08-01 00:00:27 1.0
1 2.0 2019-08-01 00:00:38 1.0
2 4.0 2019-08-01 00:00:39 2.0
2 4.0 2019-08-01 00:00:57 2.0
2 4.0 2019-08-01 00:00:58 2.0
2 4.0 2019-08-01 00:01:17 2.0
3 NaN 2019-08-01 00:02:27 NaN
3 NaN 2019-08-01 00:03:47 NaN

Related

Pandas Forward Backward fill on columns within column level

I have the following dataframe: df
name width length
timestamp
2019-08-01 00:00:08 10 10.0 NaN
2019-08-01 00:00:19 10 NaN NaN
2019-08-01 00:00:56 10 NaN 86.0
2019-08-01 00:00:08 12 NaN 90
2019-08-01 00:00:19 12 12.0 NaN
2019-08-01 00:00:28 12 NaN NaN
I would like to apply forward and backward fill on the columns 'width' and 'length' within for the column 'name'. The result would look like this:
name width length
timestamp
2019-08-01 00:00:08 10 10.0 86
2019-08-01 00:00:19 10 10.0 86
2019-08-01 00:00:56 10 10.0 86
2019-08-01 00:00:08 12 12.0 90
2019-08-01 00:00:19 12 12.0 90
2019-08-01 00:00:28 12 12.0 90
Any ideas how to do this?
We need groupby with apply , since we chain two functions ffill and bfill together
df.update(df.groupby('name').apply(lambda x : x.ffill().bfill()))
as you said each unique name has only one value of width and length, you may be able to avoid apply by using transform and max or first
df.update(df.groupby('name')[['width','length']].transform('max'))
Out[87]:
name width length
timestamp
2019-08-01 00:00:08 10 10.0 86.0
2019-08-01 00:00:19 10 10.0 86.0
2019-08-01 00:00:56 10 10.0 86.0
2019-08-01 00:00:08 12 12.0 90.0
2019-08-01 00:00:19 12 12.0 90.0
2019-08-01 00:00:28 12 12.0 90.0

Transpose DF columns based on column values - Pandas

My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN

Transpose dataframe columns based on a column's value - Pandas

I have a dataframe like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 3.0 2019-10-01 11.25
3 MMG 1.0 2019-08-01 15.71
4 MMG 2.0 2020-10-01 11.50
5 MMG 3.0 2021-10-01 11.75
6 MMG 4.0 2014-01-01 14.00
I would like to have an output like this,
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 XYZ 1 2 3 NaN 2018-10-01 2017-08-01 2019-10-01 NaN 11.0 15.25 11.25 NaN
1 MMG 1 2 3 4 2019-08-01 2020-10-01 2021-10-01 2014-01-01 15.71 11.50 11.75 14.00
I tried the following,
df.vstack().reset_index().drop('level_1',axis=0)
This is not giving me the output I need.
If you see, I have per column that has incremental values that can go into column names when I transpose them.
Any suggestion would be great.
Use GroupBy.cumcount for counter and reshape by DataFrame.unstack, last flatten columns names by f-strings:
df = df.set_index(['param', df.groupby('param').cumcount().add(1)]).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 \
0 MMG 1.0 2.0 3.0 4.0 2019-08-01 2020-10-01 2021-10-01
1 XYZ 1.0 2.0 3.0 NaN 2018-10-01 2017-08-01 2019-10-01
per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 2014-01-01 15.71 11.50 11.75 14.0
1 NaN 11.00 15.25 11.25 NaN

Replacing values in a string with NaN

Faced a simple task, but I can not solve. There is a table in df:
Date X1 X2
02.03.2019 2 2
03.03.2019 1 1
04.03.2019 2 3
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
And I need for rows where Date < 05.03.2019 set X1=NaN, X2=NaN:
Date X1 X2
02.03.2019 NaN NaN
03.03.2019 NaN NaN
04.03.2019 NaN NaN
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
First convert column Date to datetimes and then set values by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
df.loc[df['Date'] < '2019-03-05', ['X1','X2']] = np.nan
print (df)
Date X1 X2
0 2019-03-02 NaN NaN
1 2019-03-03 NaN NaN
2 2019-03-04 NaN NaN
3 2019-03-05 1.0 12.0
4 2019-03-06 2.0 2.0
5 2019-03-07 3.0 3.0
6 2019-03-08 4.0 1.0
7 2019-03-09 1.0 2.0
If there is DatetimeIndex:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
#change datetime to 2019-03-04
df.loc[:'2019-03-04'] = np.nan
print (df)
X1 X2
Date
2019-03-02 NaN NaN
2019-03-03 NaN NaN
2019-03-04 NaN NaN
2019-03-05 1.0 12.0
2019-03-06 2.0 2.0
2019-03-07 3.0 3.0
2019-03-08 4.0 1.0
2019-03-09 1.0 2.0
Or:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
df.loc[df.index < '2019-03-05'] = np.nan
Dont use this solution, this is just another approach possible (-: (this will affect all columns)
df.mask(df.Date < '05.03.2019').combine_first(df[['Date']])
Date X1 X2
0 02.03.2019 NaN NaN
1 03.03.2019 NaN NaN
2 04.03.2019 NaN NaN
3 05.03.2019 1.0 12.0
4 06.03.2019 2.0 2.0
5 07.03.2019 3.0 3.0
6 08.03.2019 4.0 1.0
7 09.03.2019 1.0 2.0

How To create Multiple Columns From Values of The Same Column?

I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.
This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0

Resources