merge dataframe with the same columns name

merge dataframe with the same columns name - python-3.x

Hi i have a dataframe that looks like that :
Unnamed: 0
X1
Unnamed: 1
X2
Unnamed: 1
X3
Unnamed: 2
X4
1970-01-31
5.0
1970-01-31
1.0
1970-01-31
1.0
1980-01-30
1.0
1970-02-26
6.0
1970-02-26
3.0
1970-02-26
3.0
1980-02-26
3.0
I have many columns (631) that looks like that.
I would like to have :
date
X1
X2
X3
X4
1970-01-31
5.0
1.0
1.0
na
1970-02-26
6.0
3.0
3.0
na
1980-01-30
na
na
na
1.0
1980-02-26
na
na
na
3.0
I tried :
res_df = pd.concat(
df2[[date, X]].rename(columns={date: "date"}) for date, X in zip(df2.columns[::2],
df2.columns[1::2])
).pivot_table(index="date")
It works for small data but do not work for mine. Maybe because I have the same columns name 'Unnamed: 1' in my df.
I have a message error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Crete index by date varible and use axis=1 in concat:
res_df = (pd.concat((df2[[date, X]].set_index(date)
for date, X in zip(df2.columns[::2], df2.columns[1::2])), axis=1)
.rename_axis('date')
.reset_index())
print (res_df)
date X1 X2 X3 X4
0 1970-01-31 5.0 1.0 1.0 NaN
1 1970-02-26 6.0 3.0 3.0 NaN
2 1980-01-30 NaN NaN NaN 1.0
3 1980-02-26 NaN NaN NaN 3.0
EDIT: Error seems like duplicated columns names in your DataFrame, possible solution is deduplicated before apply solution above:
df = pd.DataFrame(columns=['a','a','b'], index=[0])
#you can test if duplicated columns names
print (df.columns[df.columns.duplicated(keep=False)])
Index(['a', 'a'], dtype='object')
#https://stackoverflow.com/a/43792894/2901002
df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
print (df.columns)
Index(['a', 'a.1', 'b'], dtype='object')

Related

How to fill data to column name when after Merge dataframe using merge PANDAS?

I using python 3 and i have three dataframe:
df1
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
P1
NaN
NaN
P2
NaN
NaN
P3
NaN
NaN
df2:
PEOPLE
AMOUNT
P1
1.0
P2
1.0
df3
PEOPLE
AMOUNT
P2
1.0
P3
4.0
df_1= pd.merge(df_1, df2, on ='PEOPLE ', how ='outer') //(Step 1)
df_1= pd.merge(df_1, df3, on ='PEOPLE ', how ='outer') //(Step 2)
df_1= df_1.loc[:, ~df_merge.columns.str.contains('^Unnamed')]
Ouput Actual:
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
AMOUNT_X
AMOUNT_Y
P1
NaN
NaN
1.0
NaN
P2
NaN
NaN
1.0
1.0
P3
NaN
NaN
NaN
4.0
Question
How to field data at (Step 1) to column AMOUNT_custom_A and field data at (Step 2) to column AMOUNT_custom_B?
Ouput Expected:
PEOPLE
AMOUNT_custom_A
AMOUNT_custom_B
P1
1.0
NaN
P2
1.0
1.0
P3
NaN
4.0
Thank you !

Add Series.fillna with DataFrame.pop:
df['AMOUNT_custom_A'] = df['AMOUNT_custom_A'].fillna(df.pop('AMOUNT_X'))
df['AMOUNT_custom_B'] = df['AMOUNT_custom_B'].fillna(df.pop('AMOUNT_Y'))
If alwyas missing columns AMOUNT_custom_A and AMOUNT_custom_B first select only PEOPLE column for df1 and rename columns names in merge:
df_1= pd.merge(df_1[['PEOPLE']], df2.rename(columns={'AMOUNT':'AMOUNT_custom_A'}), on ='PEOPLE ', how ='outer') //(Step 1)
df_1= pd.merge(df_1, df3.rename(columns={'AMOUNT':'AMOUNT_custom_B'}), on ='PEOPLE ', how ='outer') //(Step 2)
df_1= df_1.loc[:, ~df_merge.columns.str.contains('^Unnamed')]

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?

Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN

You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Copying column that have NaN value in it and adding prefix

I have x number of columns that contain NaN value
With the following code i can check that
for index,value in df.iteritems():
if value.isnull().values.any() == True:
this will show me with Boolean values which volumns have NaN.
If true I need to create new column that will have prefix 'Interpolation' + name of that column in its name.
So to make it clear if Column with the name 'XXX' has NaN I need to create new column with the name 'Interpolation XXX'.
Any ides how to do this ?

Something like this:
In [80]: df = pd.DataFrame({'XXX':[1,2,np.nan,4], 'YYY':[1,2,3,4], 'ZZZ':[1,np.nan, np.nan, 4]})
In [81]: df
Out[81]:
XXX YYY ZZZ
0 1.0 1 1.0
1 2.0 2 NaN
2 NaN 3 NaN
3 4.0 4 4.0
In [92]: nan_cols = df.columns[df.isna().any()].tolist()
In [94]: for col in df.columns:
...: if col in nan_cols:
...: df['Interpolation ' + col ] = df[col]
...:
In [95]: df
Out[95]:
XXX YYY ZZZ Interpolation XXX Interpolation ZZZ
0 1.0 1 1.0 1.0 1.0
1 2.0 2 NaN 2.0 NaN
2 NaN 3 NaN NaN NaN
3 4.0 4 4.0 4.0 4.0

How to fill NaN with user defined value in pandas dataframe

How to fill NaN with user defined value in pandas dataframe.
For text columns like A and B, user defined text like 'Missing' should be imputed. For discrete numeric variables like C and D, median value should be imputed. I have many columns like these, I would like apply rule for all vars in the dataframe
DF
A B C D
A0A1 Railway 10 NaN
A1A1 Shipping NaN 1
NaN Shipping 3 2
B1A1 NaN 1 7
DF out:
A B C D
A0A1 Railway 10 2
A1A1 Shipping 3 1
Missing Shipping 3 2
B1A1 Missing 1 7

You can fillna by pass dict
df.fillna({'A':'Miss','B':"Your2",'C':df.C.median(),'D':df.D.mean()})
Out[373]:
A B C D
0 A0A1 Railway 10.0 3.333333
1 A1A1 Shipping 3.0 1.000000
2 Miss Shipping 3.0 2.000000
3 B1A1 Your2 1.0 7.000000

Fun way!
d = {np.dtype('O'): 'Missing'}
df.fillna(df.dtypes.map(d).fillna(df.median()))
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

First replace median for numeric columns and then fillna for non numeric:
df = df.fillna(df.median()).fillna('Missing')
print (df)
A B C D
0 A0A1 Railway 10.0 2.0
1 A1A1 Shipping 3.0 1.0
2 Missing Shipping 3.0 2.0
3 B1A1 Missing 1.0 7.0

Pandas Pivot and Summarize For Multiple Rows Vertically

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,0,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 0
5 b 0.0 0
I am looking for the most efficient way, for each numerical column (y and x), to produce a percent per group, label the column name, and stack them in one column.
Here's how I accomplish this for 'y':
df=df.loc[~np.isnan(df['y'])] #do not count non-numbers
t=pd.pivot_table(df,index='Site',values='y',aggfunc=[np.sum,len])
t['Item']='y'
t['Perc']=round(t['sum']/t['len']*100,1)
t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
Now all I need is a way to add 2 more rows to this; the results for 'x' if I had pivoted with its values above, like this:
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1 2 x 50.0
b 1 3 x 33.3
In reality, I have 48 such numerical data columns that need to be stacked as such.
Thanks in advance!

First you can use notnull. Then omit in pivot_table parameter value, stack and sort_values by new column Item. Last you can use pandas function round:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site', aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3
Another solution if is neccessary define values columns in pivot_table:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site',values=['y', 'x'], aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

merge dataframe with the same columns name - python-3.x

Related

How to fill data to column name when after Merge dataframe using merge PANDAS?

calculate different between consecutive date records at an ID level

Copying column that have NaN value in it and adding prefix

How to fill NaN with user defined value in pandas dataframe

Pandas Pivot and Summarize For Multiple Rows Vertically

Categories

Resources