How to fill data to column name when after Merge dataframe using merge PANDAS?

How to fill data to column name when after Merge dataframe using merge PANDAS? - python-3.x

I using python 3 and i have three dataframe:
df1
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
P1
NaN
NaN
P2
NaN
NaN
P3
NaN
NaN
df2:
PEOPLE
AMOUNT
P1
1.0
P2
1.0
df3
PEOPLE
AMOUNT
P2
1.0
P3
4.0
df_1= pd.merge(df_1, df2, on ='PEOPLE ', how ='outer') //(Step 1)
df_1= pd.merge(df_1, df3, on ='PEOPLE ', how ='outer') //(Step 2)
df_1= df_1.loc[:, ~df_merge.columns.str.contains('^Unnamed')]
Ouput Actual:
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
AMOUNT_X
AMOUNT_Y
P1
NaN
NaN
1.0
NaN
P2
NaN
NaN
1.0
1.0
P3
NaN
NaN
NaN
4.0
Question
How to field data at (Step 1) to column AMOUNT_custom_A and field data at (Step 2) to column AMOUNT_custom_B?
Ouput Expected:
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
P1
1.0
NaN
P2
1.0
1.0
P3
NaN
4.0
Thank you !

Add Series.fillna with DataFrame.pop:
df['AMOUNT_custom_A'] = df['AMOUNT_custom_A'].fillna(df.pop('AMOUNT_X'))
df['AMOUNT_custom_B'] = df['AMOUNT_custom_B'].fillna(df.pop('AMOUNT_Y'))
If alwyas missing columns AMOUNT_custom_A and AMOUNT_custom_B first select only PEOPLE column for df1 and rename columns names in merge:
df_1= pd.merge(df_1[['PEOPLE']], df2.rename(columns={'AMOUNT':'AMOUNT_custom_A'}), on ='PEOPLE ', how ='outer') //(Step 1)
df_1= pd.merge(df_1, df3.rename(columns={'AMOUNT':'AMOUNT_custom_B'}), on ='PEOPLE ', how ='outer') //(Step 2)
df_1= df_1.loc[:, ~df_merge.columns.str.contains('^Unnamed')]

Related

merge dataframe with the same columns name

Hi i have a dataframe that looks like that :
Unnamed: 0
X1
Unnamed: 1
X2
Unnamed: 1
X3
Unnamed: 2
X4
1970-01-31
5.0
1970-01-31
1.0
1970-01-31
1.0
1980-01-30
1.0
1970-02-26
6.0
1970-02-26
3.0
1970-02-26
3.0
1980-02-26
3.0
I have many columns (631) that looks like that.
I would like to have :
date
X1
X2
X3
X4
1970-01-31
5.0
1.0
1.0
na
1970-02-26
6.0
3.0
3.0
na
1980-01-30
na
na
na
1.0
1980-02-26
na
na
na
3.0
I tried :
res_df = pd.concat(
df2[[date, X]].rename(columns={date: "date"}) for date, X in zip(df2.columns[::2],
df2.columns[1::2])
).pivot_table(index="date")
It works for small data but do not work for mine. Maybe because I have the same columns name 'Unnamed: 1' in my df.
I have a message error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Crete index by date varible and use axis=1 in concat:
res_df = (pd.concat((df2[[date, X]].set_index(date)
for date, X in zip(df2.columns[::2], df2.columns[1::2])), axis=1)
.rename_axis('date')
.reset_index())
print (res_df)
date X1 X2 X3 X4
0 1970-01-31 5.0 1.0 1.0 NaN
1 1970-02-26 6.0 3.0 3.0 NaN
2 1980-01-30 NaN NaN NaN 1.0
3 1980-02-26 NaN NaN NaN 3.0
EDIT: Error seems like duplicated columns names in your DataFrame, possible solution is deduplicated before apply solution above:
df = pd.DataFrame(columns=['a','a','b'], index=[0])
#you can test if duplicated columns names
print (df.columns[df.columns.duplicated(keep=False)])
Index(['a', 'a'], dtype='object')
#https://stackoverflow.com/a/43792894/2901002
df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
print (df.columns)
Index(['a', 'a.1', 'b'], dtype='object')

Create a new dataframe from specific columns

I have a dataframe and I want to use columns to create new rows in a new dataframe.
>>> df_1
mix_id ngs phr d mp1 mp2 mp1_wt mp2_wt mp1_phr mp2_phr
2 M01 SBR2353 100.0 NaN MES/HPD SBR2353 0.253731 0.746269 25.373134 74.626866
3 M02 SBR2054 80.0 NaN TDAE SBR2054 0.264706 0.735294 21.176471 58.823529
I would like to have a dataframe like this.
>>> df_2
mix_id ngs phr d
1 M01 MES/HPD 25.373134 NaN
2 M01 SBR2353 74.626866 NaN
3 M02 TDAE 21.176471 NaN
4 M02 SBR2054 58.823529 NaN

IIUC
you can use pd.wide_to_long, it does however needs the repeating columns to have numbers as suffix. So, the first part of solution, just renames the columns to bring the number as suffix
df.columns=[col for col in df.columns[:6]] + [re.sub(r'\d','',col) + str(re.search(r'(\d)',col).group(0)) for col in df.columns[6:] ]
# this makes mp1_wt as mp_wt1, to support pd.wide_to_long
df2=pd.wide_to_long(df, stubnames=['mp','mp_wt','mp_phr'], i=['mix_id','ngs','d'], j='val').reset_index().drop(columns='val')
df2.drop(columns=['ngs','phr','mp_wt'], inplace=True)
df2.rename(columns={'mp':'ngs','mp_phr':'phr'}, inplace=True)
df2
mix_id d ngs phr
0 M01 NaN MES/HPD 25.373134
1 M01 NaN SBR2353 74.626866
2 M02 NaN TDAE 21.176471
3 M02 NaN SBR2054 58.823529

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?

Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN

You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Summing up two columns of pandas dataframe ignoring NaN

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0

Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64

Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64

Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)

Copying column that have NaN value in it and adding prefix

I have x number of columns that contain NaN value
With the following code i can check that
for index,value in df.iteritems():
if value.isnull().values.any() == True:
this will show me with Boolean values which volumns have NaN.
If true I need to create new column that will have prefix 'Interpolation' + name of that column in its name.
So to make it clear if Column with the name 'XXX' has NaN I need to create new column with the name 'Interpolation XXX'.
Any ides how to do this ?

Something like this:
In [80]: df = pd.DataFrame({'XXX':[1,2,np.nan,4], 'YYY':[1,2,3,4], 'ZZZ':[1,np.nan, np.nan, 4]})
In [81]: df
Out[81]:
XXX YYY ZZZ
0 1.0 1 1.0
1 2.0 2 NaN
2 NaN 3 NaN
3 4.0 4 4.0
In [92]: nan_cols = df.columns[df.isna().any()].tolist()
In [94]: for col in df.columns:
...: if col in nan_cols:
...: df['Interpolation ' + col ] = df[col]
...:
In [95]: df
Out[95]:
XXX YYY ZZZ Interpolation XXX Interpolation ZZZ
0 1.0 1 1.0 1.0 1.0
1 2.0 2 NaN 2.0 NaN
2 NaN 3 NaN NaN NaN
3 4.0 4 4.0 4.0 4.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to fill data to column name when after Merge dataframe using merge PANDAS? - python-3.x

Related

merge dataframe with the same columns name

Create a new dataframe from specific columns

calculate different between consecutive date records at an ID level

Summing up two columns of pandas dataframe ignoring NaN

Copying column that have NaN value in it and adding prefix

Categories

Resources