Avoid duplicate columns while merging with pandas - python-3.x

I have dozens of dataframes I would like to merge with a "reference" dataframe. I want to merge the columns when they exist in both dataframes, or conversely, create a new one when they don't already exist. I have the feeling that this is closely related to this topic but I cannot figure out out to make it work in my case.
Also, note that the key used for merging never contains duplicates.
# Reference dataframe
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})
# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
'potato':[13,21]})
df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
'carrot':[14,8,32]})
df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00'],
'potato':[27,31]})
df = df.merge(df1, how='left', on='date_time')
df = df.merge(df2, how='left', on='date_time')
df = df.merge(df3, how='left', on='date_time')
The result is :
date_time potato_x carrot potato_y
0 2018-06-01 00:00:00 NaN NaN NaN
1 2018-06-01 00:30:00 13.0 NaN NaN
2 2018-06-01 01:00:00 21.0 NaN NaN
3 2018-06-01 01:30:00 NaN 14.0 27.0
While I would like :
date_time potato carrot
0 2018-06-01 00:00:00 NaN NaN
1 2018-06-01 00:30:00 13.0 NaN
2 2018-06-01 01:00:00 21.0 NaN
3 2018-06-01 01:30:00 27.0 14.0
Edit (following #sammywemmy's answer):
I have no idea what will be the dataframe columns name before importing them (in a loop). Usually, the dataframes that are merged with my reference dataframe contain about 100 columns, from which 90%-95% are common with the other dataframes.

I would pd.concat similar structured dataframes then merge the others like this:
df.merge(pd.concat([df1, df3]), on='date_time', how='left')\
.merge(df2, on='date_time', how='left')
Output:
date_time potato carrot
0 2018-06-01 00:00:00 NaN NaN
1 2018-06-01 00:30:00 13.0 NaN
2 2018-06-01 01:00:00 21.0 NaN
3 2018-06-01 01:30:00 27.0 14.0
Per comments below:
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})
# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
'potato':[13,21]})
df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
'carrot':[14,8,32]})
df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})
df.merge(pd.concat([df1, df3]), on='date_time', how='left').merge(df2, on='date_time', how='left')
Output:
date_time potato zucchini carrot
0 2018-06-01 00:00:00 NaN NaN NaN
1 2018-06-01 00:30:00 13.0 NaN NaN
2 2018-06-01 01:00:00 21.0 NaN NaN
3 2018-06-01 01:30:00 27.0 11.0 14.0

Continuing from your code, use the filter method to pull out potato related columns, sum them along the columns axis, and remove columns that contain potato_...
df['potato'] = df.filter(like='potato').fillna(0).sum(axis=1)
exclude_columns = df.columns.str.contains('potato_[a-z]')
df = df.loc[:,~exclude_columns]
date_time carrot potato
0 2018-06-01 00:00:00 NaN 0.0
1 2018-06-01 00:30:00 NaN 13.0
2 2018-06-01 01:00:00 NaN 21.0
3 2018-06-01 01:30:00 14.0 27.0

Related

Keep column when resampling hourly to daily data in pandas

I have a dataset of hourly weather observations in this format:
df = pd.DataFrame({ 'date': ['2019-01-01 09:30:00', '2019-01-01 10:00', '2019-01-02 04:30:00','2019-01-02 05:00:00','2019-07-04 02:00:00'],
'windSpeedHigh': [155,90,35,45,15],
'windSpeedHigh_Dir':['NE','NNW','SW','W','S']})
My goal is to find the highest wind speed each day and the wind direction associated with that maximum daily wind speed.
Using resample, I have sucessfully found the maximum wind speed for each day, but not its associated direction:
df['date'] = pd.to_datetime(df['date'])
df['windSpeedHigh'] = pd.to_numeric(df['windSpeedHigh'])
df_daily = df.resample('D', on='date')[['windSpeedHigh_Dir','windSpeedHigh']].max()
df_daily
Results in:
windSpeedHigh_Dir windSpeedHigh
date
2019-01-01 NNW 155.0
2019-01-02 W 45.0
2019-01-03 NaN NaN
2019-01-04 NaN NaN
2019-01-05 NaN NaN
... ... ...
2019-06-30 NaN NaN
2019-07-01 NaN NaN
2019-07-02 NaN NaN
2019-07-03 NaN NaN
2019-07-04 S 15.0
This is incorrect as this resample is also grabbing the max() for 'windSpeedHigh_Dir'. For 2019-01-01 the direction for the associated windspeed should be 'NE' not 'NNW', because the wind direction df['windSpeedHigh_Dir'] == 'NE' when the maximum wind speed occurred.
So my question is, is it possible for me to resample this dataset from half-hourly to daily maximum wind speed while keeping the wind direction associated with that speed?
Use DataFrameGroupBy.idxmax for indices by dates first:
df_daily = df.loc[df.groupby(df['date'].dt.date)['windSpeedHigh'].idxmax()]
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
0 2019-01-01 09:30:00 155 NE
3 2019-01-02 05:00:00 45 W
4 2019-07-04 02:00:00 15 S
And then for add DatetimeIndex use DataFrame.set_index with Series.dt.normalize and DataFrame.asfreq:
df_daily = df_daily.set_index(df_daily['date'].dt.normalize().rename('day')).asfreq('d')
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
day
2019-01-01 2019-01-01 09:30:00 155.0 NE
2019-01-02 2019-01-02 05:00:00 45.0 W
2019-01-03 NaT NaN NaN
2019-01-04 NaT NaN NaN
2019-01-05 NaT NaN NaN
... ... ...
2019-06-30 NaT NaN NaN
2019-07-01 NaT NaN NaN
2019-07-02 NaT NaN NaN
2019-07-03 NaT NaN NaN
2019-07-04 2019-07-04 02:00:00 15.0 S
[185 rows x 3 columns]
Your solution shoudl working with custom function, because idxmax failed for missing values with DataFrame.join:
f = lambda x: x.idxmax() if len(x) > 0 else np.nan
df_daily = df.resample('D', on='date')['windSpeedHigh'].agg(f).to_frame('idx').join(df, on='idx')
print (df_daily)
idx date windSpeedHigh windSpeedHigh_Dir
date
2019-01-01 0.0 2019-01-01 09:30:00 155.0 NE
2019-01-02 3.0 2019-01-02 05:00:00 45.0 W
2019-01-03 NaN NaT NaN NaN
2019-01-04 NaN NaT NaN NaN
2019-01-05 NaN NaT NaN NaN
... ... ... ...
2019-06-30 NaN NaT NaN NaN
2019-07-01 NaN NaT NaN NaN
2019-07-02 NaN NaT NaN NaN
2019-07-03 NaN NaT NaN NaN
2019-07-04 4.0 2019-07-04 02:00:00 15.0 S
[185 rows x 4 columns]

Outer merge in pandas with more than two data frames [duplicate]

This question already has answers here:
How to merge multiple dataframes
(13 answers)
Closed 1 year ago.
I have a 3 dfs as shown below
df1:
ID March_Number March_Amount
A 10 200
B 4 300
C 2 100
df2:
ID Feb_Number Feb_Amount
A 1 100
B 8 500
E 4 400
F 8 100
H 4 200
df3:
ID Jan_Number Jan_Amount
A 6 800
H 3 500
B 1 50
G 8 100
I tried below code and worked well.
df_outer = pd.merge(df1, df2, on='ID', how='outer')
df_outer = pd.merge(df_outer , df3, on='ID', how='outer')
But would like to pass all df together and merge at a short. I tried below code with error as shown.
df_outer = pd.merge(df1, df2, df3, on='ID', how='outer')
please guide me, how to merge if I have 12 months of data. i.e I have to merge 12 dfs.
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-32-a63627da7233> in <module>
----> 1 df_outer = pd.merge(df1, df2, df3, on='ID', how='outer')
TypeError: merge() got multiple values for argument 'how'
Expected output:
ID March_Number March_Amount Feb_Number Feb_Amount Jan_Number Jan_Amount
A 10.0 200.0 1.0 100.0 6.0 800.0
B 4.0 300.0 8.0 500.0 1.0 50.0
C 2.0 100.0 NaN NaN NaN NaN
E NaN NaN 4.0 400.0 NaN NaN
F NaN NaN 8.0 100.0 NaN NaN
H NaN NaN 4.0 200.0 3.0 500.0
G NaN NaN NaN NaN 8.0 100.0
We can create a list of dfs in this case dfl which we want to merge and then we can merge them together.
We can add as many dfs as we want in dfl=[df1, df2, df3,..., dfn]
from functools import reduce
dfl=[df1, df2, df3]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
how='outer'), dfl)
Output
ID March_Number March_Amount Feb_Number Feb_Amount Jan_Number Jan_Amount
0 A 10.0 200.0 1.0 0.0 6.0 800.0
1 B 4.0 300.0 8.0 500.0 1.0 50.0
2 C 2.0 100.0 NaN NaN NaN NaN
3 E NaN NaN 4.0 400.0 NaN NaN
4 F NaN NaN 8.0 0.0 NaN NaN
5 H NaN NaN 4.0 200.0 3.0 500.0
6 G NaN NaN NaN NaN 8.0 100.0

Pandas set_index creates NoneType object without inplace=True

I'm facing a weird behavior with pandas set_index function. I initilally have this dataframe:
Unnamed: 0 Timestamps PM10
0 NaN NaT PM10
1 NaN NaT ▒g/m▒
2 NaN 2018-12-31 23:00:00 10.76
3 NaN 2018-12-31 22:00:00 9.46
4 NaN 2018-12-31 21:00:00 8.67
... ... ... ...
8682 NaN 2018-01-01 04:00:00 25.14
8683 NaN 2018-01-01 03:00:00 31.34
8684 NaN 2018-01-01 02:00:00 36.28
8685 NaN 2018-01-01 01:00:00 21.78
8686 NaN 2018-01-01 00:00:00 20.59
I want to drop the first two rows and set the Timestamps as indeces so I do this:
df_final = df.drop([0,1]).set_index('Timestamps', drop=True)
and I get this dataframe:
Unnamed: 0 PM10
Timestamps
2018-12-31 23:00:00 NaN 10.76
2018-12-31 22:00:00 NaN 9.46
2018-12-31 21:00:00 NaN 8.67
2018-12-31 20:00:00 NaN 10.42
2018-12-31 19:00:00 NaN 10.04
... ... ...
2018-01-01 04:00:00 NaN 25.14
2018-01-01 03:00:00 NaN 31.34
2018-01-01 02:00:00 NaN 36.28
2018-01-01 01:00:00 NaN 21.78
2018-01-01 00:00:00 NaN 20.59
So far so good, but finally I want to re-index the PM10 column by a new time index I have created called t_index, so I do this:
data_write = df_final.PM10[-1::-1].reindex(t_index)
That is where I get an error:
TypeError: 'NoneType' object is not iterable
After some debugging I have concluded that set_index is causing this but I can't figure out why, any help is appreciated!
After some trial and error I managed to make this work and here is the code that does it:
df = df.drop([0,1]).drop("Unnamed: 0", axis=1).set_index('Timestamps', drop=True)
df = df.sort_values(by="Timestamps", ascending=True)
year = 2018
start_index = '{}-01-01 00:00:00'.format(year) # define start of the year
end_index = '{}-12-31 23:00:00'.format(year) # define end of the year
t_index = pd.DatetimeIndex(start=start_index, end=end_index, freq='1h').strftime("%Y-%m-%d %H:%M:%S")
df_final = pd.to_numeric(df.PM10).resample('H').mean().reindex(t_index)
Still not sure what was causing the erro, or why the .asfreq method did not work.

Transpose DF columns based on column values - Pandas

My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN

How To create Multiple Columns From Values of The Same Column?

I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.
This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0

Resources