How to fill data to column name when after Merge dataframe using merge PANDAS? - python-3.x

I using python 3 and i have three dataframe:
df1
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
P1
NaN
NaN
P2
NaN
NaN
P3
NaN
NaN
df2:
PEOPLE
AMOUNT
P1
1.0
P2
1.0
df3
PEOPLE
AMOUNT
P2
1.0
P3
4.0
df_1= pd.merge(df_1, df2, on ='PEOPLE ', how ='outer') //(Step 1)
df_1= pd.merge(df_1, df3, on ='PEOPLE ', how ='outer') //(Step 2)
df_1= df_1.loc[:, ~df_merge.columns.str.contains('^Unnamed')]
Ouput Actual:
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
AMOUNT_X
AMOUNT_Y
P1
NaN
NaN
1.0
NaN
P2
NaN
NaN
1.0
1.0
P3
NaN
NaN
NaN
4.0
Question
How to field data at (Step 1) to column AMOUNT_custom_A and field data at (Step 2) to column AMOUNT_custom_B?
Ouput Expected:
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
P1
1.0
NaN
P2
1.0
1.0
P3
NaN
4.0
Thank you !

Add Series.fillna with DataFrame.pop:
df['AMOUNT_custom_A'] = df['AMOUNT_custom_A'].fillna(df.pop('AMOUNT_X'))
df['AMOUNT_custom_B'] = df['AMOUNT_custom_B'].fillna(df.pop('AMOUNT_Y'))
If alwyas missing columns AMOUNT_custom_A and AMOUNT_custom_B first select only PEOPLE column for df1 and rename columns names in merge:
df_1= pd.merge(df_1[['PEOPLE']], df2.rename(columns={'AMOUNT':'AMOUNT_custom_A'}), on ='PEOPLE ', how ='outer') //(Step 1)
df_1= pd.merge(df_1, df3.rename(columns={'AMOUNT':'AMOUNT_custom_B'}), on ='PEOPLE ', how ='outer') //(Step 2)
df_1= df_1.loc[:, ~df_merge.columns.str.contains('^Unnamed')]

Related

merge dataframe with the same columns name

Hi i have a dataframe that looks like that :
Unnamed: 0
X1
Unnamed: 1
X2
Unnamed: 1
X3
Unnamed: 2
X4
1970-01-31
5.0
1970-01-31
1.0
1970-01-31
1.0
1980-01-30
1.0
1970-02-26
6.0
1970-02-26
3.0
1970-02-26
3.0
1980-02-26
3.0
I have many columns (631) that looks like that.
I would like to have :
date
X1
X2
X3
X4
1970-01-31
5.0
1.0
1.0
na
1970-02-26
6.0
3.0
3.0
na
1980-01-30
na
na
na
1.0
1980-02-26
na
na
na
3.0
I tried :
res_df = pd.concat(
df2[[date, X]].rename(columns={date: "date"}) for date, X in zip(df2.columns[::2],
df2.columns[1::2])
).pivot_table(index="date")
It works for small data but do not work for mine. Maybe because I have the same columns name 'Unnamed: 1' in my df.
I have a message error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Crete index by date varible and use axis=1 in concat:
res_df = (pd.concat((df2[[date, X]].set_index(date)
for date, X in zip(df2.columns[::2], df2.columns[1::2])), axis=1)
.rename_axis('date')
.reset_index())
print (res_df)
date X1 X2 X3 X4
0 1970-01-31 5.0 1.0 1.0 NaN
1 1970-02-26 6.0 3.0 3.0 NaN
2 1980-01-30 NaN NaN NaN 1.0
3 1980-02-26 NaN NaN NaN 3.0
EDIT: Error seems like duplicated columns names in your DataFrame, possible solution is deduplicated before apply solution above:
df = pd.DataFrame(columns=['a','a','b'], index=[0])
#you can test if duplicated columns names
print (df.columns[df.columns.duplicated(keep=False)])
Index(['a', 'a'], dtype='object')
#https://stackoverflow.com/a/43792894/2901002
df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
print (df.columns)
Index(['a', 'a.1', 'b'], dtype='object')

Create a new dataframe from specific columns

I have a dataframe and I want to use columns to create new rows in a new dataframe.
>>> df_1
mix_id ngs phr d mp1 mp2 mp1_wt mp2_wt mp1_phr mp2_phr
2 M01 SBR2353 100.0 NaN MES/HPD SBR2353 0.253731 0.746269 25.373134 74.626866
3 M02 SBR2054 80.0 NaN TDAE SBR2054 0.264706 0.735294 21.176471 58.823529
I would like to have a dataframe like this.
>>> df_2
mix_id ngs phr d
1 M01 MES/HPD 25.373134 NaN
2 M01 SBR2353 74.626866 NaN
3 M02 TDAE 21.176471 NaN
4 M02 SBR2054 58.823529 NaN
IIUC
you can use pd.wide_to_long, it does however needs the repeating columns to have numbers as suffix. So, the first part of solution, just renames the columns to bring the number as suffix
df.columns=[col for col in df.columns[:6]] + [re.sub(r'\d','',col) + str(re.search(r'(\d)',col).group(0)) for col in df.columns[6:] ]
# this makes mp1_wt as mp_wt1, to support pd.wide_to_long
df2=pd.wide_to_long(df, stubnames=['mp','mp_wt','mp_phr'], i=['mix_id','ngs','d'], j='val').reset_index().drop(columns='val')
df2.drop(columns=['ngs','phr','mp_wt'], inplace=True)
df2.rename(columns={'mp':'ngs','mp_phr':'phr'}, inplace=True)
df2
mix_id d ngs phr
0 M01 NaN MES/HPD 25.373134
1 M01 NaN SBR2353 74.626866
2 M02 NaN TDAE 21.176471
3 M02 NaN SBR2054 58.823529

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Summing up two columns of pandas dataframe ignoring NaN

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0
Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64
Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64
Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)

Copying column that have NaN value in it and adding prefix

I have x number of columns that contain NaN value
With the following code i can check that
for index,value in df.iteritems():
if value.isnull().values.any() == True:
this will show me with Boolean values which volumns have NaN.
If true I need to create new column that will have prefix 'Interpolation' + name of that column in its name.
So to make it clear if Column with the name 'XXX' has NaN I need to create new column with the name 'Interpolation XXX'.
Any ides how to do this ?
Something like this:
In [80]: df = pd.DataFrame({'XXX':[1,2,np.nan,4], 'YYY':[1,2,3,4], 'ZZZ':[1,np.nan, np.nan, 4]})
In [81]: df
Out[81]:
XXX YYY ZZZ
0 1.0 1 1.0
1 2.0 2 NaN
2 NaN 3 NaN
3 4.0 4 4.0
In [92]: nan_cols = df.columns[df.isna().any()].tolist()
In [94]: for col in df.columns:
...: if col in nan_cols:
...: df['Interpolation ' + col ] = df[col]
...:
In [95]: df
Out[95]:
XXX YYY ZZZ Interpolation XXX Interpolation ZZZ
0 1.0 1 1.0 1.0 1.0
1 2.0 2 NaN 2.0 NaN
2 NaN 3 NaN NaN NaN
3 4.0 4 4.0 4.0 4.0

Resources