I have a dataframe like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 3.0 2019-10-01 11.25
3 MMG 1.0 2019-08-01 15.71
4 MMG 2.0 2020-10-01 11.50
5 MMG 3.0 2021-10-01 11.75
6 MMG 4.0 2014-01-01 14.00
I would like to have an output like this,
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 XYZ 1 2 3 NaN 2018-10-01 2017-08-01 2019-10-01 NaN 11.0 15.25 11.25 NaN
1 MMG 1 2 3 4 2019-08-01 2020-10-01 2021-10-01 2014-01-01 15.71 11.50 11.75 14.00
I tried the following,
df.vstack().reset_index().drop('level_1',axis=0)
This is not giving me the output I need.
If you see, I have per column that has incremental values that can go into column names when I transpose them.
Any suggestion would be great.
Use GroupBy.cumcount for counter and reshape by DataFrame.unstack, last flatten columns names by f-strings:
df = df.set_index(['param', df.groupby('param').cumcount().add(1)]).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 \
0 MMG 1.0 2.0 3.0 4.0 2019-08-01 2020-10-01 2021-10-01
1 XYZ 1.0 2.0 3.0 NaN 2018-10-01 2017-08-01 2019-10-01
per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 2014-01-01 15.71 11.50 11.75 14.0
1 NaN 11.00 15.25 11.25 NaN
Related
I have df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 -23
2020-02-06 14
2020-02-09 23
2020-02-10 -2
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
From the above I would like to replace -ve values in a column t_factor as NaN
Expected output:
Date t_factor
2020-02-01 5
2020-02-03 NaN
2020-02-06 14
2020-02-09 23
2020-02-10 NaN
2020-02-11 23
2020-02-13 NaN
2020-02-20 29
You can use pandas clip implementation as well. This assigns values outside boundary to boundary values. And then chain this with a replace function as below:
df['t_factor'] = df['t_factor'].clip(-1).replace(-1, np.nan)
df
Output:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use Series.mask:
df['t_factor'] = df['t_factor'].mask(df['t_factor'].lt(0))
OR use boolean indexing and assign np.nan,
df.loc[df['t_factor'].lt(0), 't_factor'] = np.nan
Result:
Date t_factor
0 2020-02-01 5.0
1 2020-02-03 NaN
2 2020-02-06 14.0
3 2020-02-09 23.0
4 2020-02-10 NaN
5 2020-02-11 23.0
6 2020-02-13 NaN
7 2020-02-20 29.0
Use pd.Series.where - by default it will replace values where the condition is False with NaN.
df["t_factor"] = df.t_factor.where(df.t_factor > 0)
If there is a dataframe:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan}
]
)
print(users[['id','date','balance_total','transaction_total']])
Dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 NaN NaN
3 1 01/04/2019 NaN NaN
4 1 01/05/2019 NaN -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 NaN NaN
8 2 01/05/2019 NaN -4.0
How can i do the following?
If both of the transaction_total and balance_total are NaN, just fill in the last date's balance_total (e.g. in row 3 where id=1, since the user1's transaction_total and balance_total are NaN, fill in 100 from 01/02/2019. The same will be row 4, fill in 100 from 01/03/2019.)
If the transaction_total is NOT NaN, but balance_total is NaN, do the math of the previous date's balance_total+ the current row's date's transaction_total.
In user 1, 01/05/2019 as example: the balance total will be=100+(-4), where 100 is 01/04/2019's balance total, and (-4) is 01/05/2019's transaction total.
Desired output:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is my code but it doesn't work. I think i couldn't figure out how to do "if logic in pandas when a row is null, do something".
for i, row in df.iterrows():
if(pd.isnull(row['transaction_total'] is True)):
if(pd.isnull(row['balance_total'] is True)):
df.loc[i,'transaction_total'] = df.loc[i-1,'transaction_total']
Could someone enlighten?
IIUC, first create a dummy series with ffill, and then use np.where:
s = df["balance_total"].ffill()
df["balance_total"] = np.where(df["balance_total"].isnull()&df["transaction_total"].notnull(),
s.add(df["transaction_total"]), s)
print (df)
id date transaction_total balance_total
0 1 01/01/2019 -1.0 102.0
1 1 01/02/2019 -2.0 100.0
2 1 01/03/2019 NaN 100.0
3 1 01/04/2019 NaN 100.0
4 1 01/05/2019 -4.0 96.0
5 2 01/01/2019 -2.0 200.0
6 2 01/02/2019 -2.0 100.0
7 2 01/04/2019 NaN 100.0
8 2 01/05/2019 -4.0 96.0
I have the following dataframe df:
length timestamp width
name
testschip-1 NaN 2019-08-01 00:00:00 NaN
testschip-1 NaN 2019-08-01 00:00:09 NaN
testschip-1 2 2019-08-01 00:00:20 NaN
testschip-1 2 2019-08-01 00:00:27 NaN
testschip-1 NaN 2019-08-01 00:00:38 1
testschip-2 4 2019-08-01 00:00:39 2
testschip-2 4 2019-08-01 00:00:57 NaN
testschip-2 4 2019-08-01 00:00:58 NaN
testschip-2 NaN 2019-08-01 00:01:17 NaN
testschip-3 NaN 2019-08-01 00:02:27 NaN
testschip-3 NaN 2019-08-01 00:03:47 NaN
First, I want to remove the string "testschip-" from the index "name" so I get integers only on the indices. Second, per unique index I want to apply forward fill or backward fill (whatever is neccessary to obtain no NaNs) on both columns 'length' and 'width'. Each unique index has the same "length" and "width". On "testschip-3" I dont want to apply backward or forward fill. If I do backward fill on "testschip-1" (which is needed to set the first two indices two '2'), I get an unwanted '4' for the last row of index "testschip-1"). I cannot judge beforehand if I have to apply backward or forward fill beforehand, since I have 4 million rows of data to start with.
Use:
df.index = df.index.str.lstrip('testschip-').astype(int)
#alternative
#df.index = df.index.str[10:].astype(int)
#df.index = df.index.str.split('-').str[-1].astype(int)
df.groupby(level = 0).apply(lambda x: x.bfill().ffill())
Output
length timestamp width
name
1 2.0 2019-08-01 00:00:00 1.0
1 2.0 2019-08-01 00:00:09 1.0
1 2.0 2019-08-01 00:00:20 1.0
1 2.0 2019-08-01 00:00:27 1.0
1 2.0 2019-08-01 00:00:38 1.0
2 4.0 2019-08-01 00:00:39 2.0
2 4.0 2019-08-01 00:00:57 2.0
2 4.0 2019-08-01 00:00:58 2.0
2 4.0 2019-08-01 00:01:17 2.0
3 NaN 2019-08-01 00:02:27 NaN
3 NaN 2019-08-01 00:03:47 NaN
I have dozens of dataframes I would like to merge with a "reference" dataframe. I want to merge the columns when they exist in both dataframes, or conversely, create a new one when they don't already exist. I have the feeling that this is closely related to this topic but I cannot figure out out to make it work in my case.
Also, note that the key used for merging never contains duplicates.
# Reference dataframe
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})
# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
'potato':[13,21]})
df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
'carrot':[14,8,32]})
df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00'],
'potato':[27,31]})
df = df.merge(df1, how='left', on='date_time')
df = df.merge(df2, how='left', on='date_time')
df = df.merge(df3, how='left', on='date_time')
The result is :
date_time potato_x carrot potato_y
0 2018-06-01 00:00:00 NaN NaN NaN
1 2018-06-01 00:30:00 13.0 NaN NaN
2 2018-06-01 01:00:00 21.0 NaN NaN
3 2018-06-01 01:30:00 NaN 14.0 27.0
While I would like :
date_time potato carrot
0 2018-06-01 00:00:00 NaN NaN
1 2018-06-01 00:30:00 13.0 NaN
2 2018-06-01 01:00:00 21.0 NaN
3 2018-06-01 01:30:00 27.0 14.0
Edit (following #sammywemmy's answer):
I have no idea what will be the dataframe columns name before importing them (in a loop). Usually, the dataframes that are merged with my reference dataframe contain about 100 columns, from which 90%-95% are common with the other dataframes.
I would pd.concat similar structured dataframes then merge the others like this:
df.merge(pd.concat([df1, df3]), on='date_time', how='left')\
.merge(df2, on='date_time', how='left')
Output:
date_time potato carrot
0 2018-06-01 00:00:00 NaN NaN
1 2018-06-01 00:30:00 13.0 NaN
2 2018-06-01 01:00:00 21.0 NaN
3 2018-06-01 01:30:00 27.0 14.0
Per comments below:
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})
# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
'potato':[13,21]})
df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
'carrot':[14,8,32]})
df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})
df.merge(pd.concat([df1, df3]), on='date_time', how='left').merge(df2, on='date_time', how='left')
Output:
date_time potato zucchini carrot
0 2018-06-01 00:00:00 NaN NaN NaN
1 2018-06-01 00:30:00 13.0 NaN NaN
2 2018-06-01 01:00:00 21.0 NaN NaN
3 2018-06-01 01:30:00 27.0 11.0 14.0
Continuing from your code, use the filter method to pull out potato related columns, sum them along the columns axis, and remove columns that contain potato_...
df['potato'] = df.filter(like='potato').fillna(0).sum(axis=1)
exclude_columns = df.columns.str.contains('potato_[a-z]')
df = df.loc[:,~exclude_columns]
date_time carrot potato
0 2018-06-01 00:00:00 NaN 0.0
1 2018-06-01 00:30:00 NaN 13.0
2 2018-06-01 01:00:00 NaN 21.0
3 2018-06-01 01:30:00 14.0 27.0
My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN