add row to dataframe pandas - python-3.x

I want to add a median row to the top. Based on this stack answer I do the following:
pd.concat([df.median(),df],axis=0, ignore_index=True)
Shape of DF: 50000 x 226
Shape expected: 50001 x 226
Shape of modified DF: 500213 x 227 ???
What am I doing wrong? I am unable to understand what is going on?

Maybe what you wanted is like this:
dfn = pd.concat([df.median().to_frame().T, df], ignore_index=True)
create some sample data:
df = pd.DataFrame(np.arange(20).reshape(4,5), columns= list('ABCDE'))
dfn = pd.concat([df.median().to_frame().T, df])
df
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df.median().to_frame().T
A B C D E
0 7.5 8.5 9.5 10.5 11.5
dfn
A B C D E
0 7.5 8.5 9.5 10.5 11.5
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df.median() is an Series, with row index of A, B, C, D, E, so when you concat df.median() with df, the result is that:
pd.concat([df.median(),df], axis=0)
0 A B C D E
A 7.5 NaN NaN NaN NaN NaN
B 8.5 NaN NaN NaN NaN NaN
C 9.5 NaN NaN NaN NaN NaN
D 10.5 NaN NaN NaN NaN NaN
E 11.5 NaN NaN NaN NaN NaN
0 NaN 0.0 1.0 2.0 3.0 4.0
1 NaN 5.0 6.0 7.0 8.0 9.0
2 NaN 10.0 11.0 12.0 13.0 14.0
3 NaN 15.0 16.0 17.0 18.0 19.0

pd.concat([df.median(),df],axis=0, ignore_index=True)
this code creates a row for you but that is not a DataFrame it is a Series. So you want to convert the series to DataFrame
so you can use
.to_frame().T
to your code then your code become
pd.concat([df.median().to_frame().T,df],axis=0, ignore_index=True)

Related

Outer merge in pandas with more than two data frames [duplicate]

This question already has answers here:
How to merge multiple dataframes
(13 answers)
Closed 1 year ago.
I have a 3 dfs as shown below
df1:
ID March_Number March_Amount
A 10 200
B 4 300
C 2 100
df2:
ID Feb_Number Feb_Amount
A 1 100
B 8 500
E 4 400
F 8 100
H 4 200
df3:
ID Jan_Number Jan_Amount
A 6 800
H 3 500
B 1 50
G 8 100
I tried below code and worked well.
df_outer = pd.merge(df1, df2, on='ID', how='outer')
df_outer = pd.merge(df_outer , df3, on='ID', how='outer')
But would like to pass all df together and merge at a short. I tried below code with error as shown.
df_outer = pd.merge(df1, df2, df3, on='ID', how='outer')
please guide me, how to merge if I have 12 months of data. i.e I have to merge 12 dfs.
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-32-a63627da7233> in <module>
----> 1 df_outer = pd.merge(df1, df2, df3, on='ID', how='outer')
TypeError: merge() got multiple values for argument 'how'
Expected output:
ID March_Number March_Amount Feb_Number Feb_Amount Jan_Number Jan_Amount
A 10.0 200.0 1.0 100.0 6.0 800.0
B 4.0 300.0 8.0 500.0 1.0 50.0
C 2.0 100.0 NaN NaN NaN NaN
E NaN NaN 4.0 400.0 NaN NaN
F NaN NaN 8.0 100.0 NaN NaN
H NaN NaN 4.0 200.0 3.0 500.0
G NaN NaN NaN NaN 8.0 100.0
We can create a list of dfs in this case dfl which we want to merge and then we can merge them together.
We can add as many dfs as we want in dfl=[df1, df2, df3,..., dfn]
from functools import reduce
dfl=[df1, df2, df3]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
how='outer'), dfl)
Output
ID March_Number March_Amount Feb_Number Feb_Amount Jan_Number Jan_Amount
0 A 10.0 200.0 1.0 0.0 6.0 800.0
1 B 4.0 300.0 8.0 500.0 1.0 50.0
2 C 2.0 100.0 NaN NaN NaN NaN
3 E NaN NaN 4.0 400.0 NaN NaN
4 F NaN NaN 8.0 0.0 NaN NaN
5 H NaN NaN 4.0 200.0 3.0 500.0
6 G NaN NaN NaN NaN 8.0 100.0

How to read data from excel and concatenate columns vertically?

I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0

Read dataframe split by nan rows and reshape them into multiple dataframes in Python

I have a example excel file data1.xlsx from here, which has a Sheet1 as follows:
Now I want to read it with openpyxl or pandas, then convert them into new df1 and df2, I will finally save them as price and quantity sheet:
price sheet:
and quantity sheet
Code I have used:
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1')
df_list = np.split(df, df[df.isnull().all(1)].index)
for df in df_list:
print(df, '\n')
Out:
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 year 2018.0 2019.0 2020.0 sum
1 price 12.0 4.0 5.0 21
2 quantity 5.0 5.0 3.0 13
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
3 NaN NaN NaN NaN NaN
4 sh NaN NaN NaN NaN
5 year 2018.0 2019.0 2020.0 sum
6 price 5.0 6.0 7.0 18
7 quantity 7.0 5.0 4.0 16
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
8 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
9 NaN NaN NaN NaN NaN
10 gz NaN NaN NaN NaN
11 year 2018.0 2019.0 2020.0 sum
12 price 2.0 3.0 1.0 6
13 quantity 6.0 9.0 3.0 18
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
14 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
15 NaN NaN NaN NaN NaN
16 sz NaN NaN NaN NaN
17 year 2018.0 2019.0 2020.0 sum
18 price 8.0 2.0 3.0 13
19 quantity 5.0 4.0 3.0 12
How could I do that in Python? Thanks a lot.
Use:
#add header=None for default columns names
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1', header=None)
#convert columns by second row
df.columns = df.iloc[1].rename(None)
#create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())
#convert floats to integers
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
#convert column year to index
df = df.set_index('year')
print (df)
city 2018 2019 2020 sum
year
bj bj NaN NaN NaN NaN
year bj 2018.0 2019.0 2020.0 sum
price bj 12.0 4.0 5.0 21
quantity bj 5.0 5.0 3.0 13
NaN bj NaN NaN NaN NaN
sh sh NaN NaN NaN NaN
year sh 2018.0 2019.0 2020.0 sum
price sh 5.0 6.0 7.0 18
quantity sh 7.0 5.0 4.0 16
NaN sh NaN NaN NaN NaN
NaN sh NaN NaN NaN NaN
gz gz NaN NaN NaN NaN
year gz 2018.0 2019.0 2020.0 sum
price gz 2.0 3.0 1.0 6
quantity gz 6.0 9.0 3.0 18
NaN gz NaN NaN NaN NaN
NaN gz NaN NaN NaN NaN
sz sz NaN NaN NaN NaN
year sz 2018.0 2019.0 2020.0 sum
price sz 8.0 2.0 3.0 13
quantity sz 5.0 4.0 3.0 12
df1 = df.loc['price'].reset_index(drop=True)
print (df1)
city 2018 2019 2020 sum
0 bj 12.0 4.0 5.0 21
1 sh 5.0 6.0 7.0 18
2 gz 2.0 3.0 1.0 6
3 sz 8.0 2.0 3.0 13
df2 = df.loc['quantity'].reset_index(drop=True)
print (df2)
city 2018 2019 2020 sum
0 bj 5.0 5.0 3.0 13
1 sh 7.0 5.0 4.0 16
2 gz 6.0 9.0 3.0 18
3 sz 5.0 4.0 3.0 12
Last write DataFrames to existing file is possible by mode='a' parameter, link:
with pd.ExcelWriter('data1.xlsx', mode='a') as writer:
df1.to_excel(writer, sheet_name='price')
df2.to_excel(writer, sheet_name='quantity')

Concatenate two dataframes of different sizes (pandas)

I have two dataframes with unique ids. They share some columns but not all. I need to create a combined dataframe which will include rows from missing ids from the second dataframe. Tried merge and concat, no luck. It's probably too late, my brain stopped working. Will appreciate your help!
df1 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','j','k','l','m'],
'metric1': [123,22,356,412,54,634,72,812,129,110,200],
'metric2':[1,2,3,4,5,6,7,8,9,10,11]
})
df2 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','q','z','w'],
'metric1': [123,22,356,412,54,634,72,812,129,110]
})
df2
The result should look like this:
id metric1 metric2
0 a 123 1.0
1 b 22 2.0
2 c 356 3.0
3 d 412 4.0
4 f 54 5.0
5 g 634 6.0
6 h 72 7.0
7 j 812 8.0
8 k 129 9.0
9 l 110 10.0
10 m 200 11.0
11 q 812 NaN
12 z 129 NaN
13 w 110 NaN
In this case using combine_first
df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[766]:
id metric1 metric2
0 a 123.0 1.0
1 b 22.0 2.0
2 c 356.0 3.0
3 d 412.0 4.0
4 f 54.0 5.0
5 g 634.0 6.0
6 h 72.0 7.0
7 j 812.0 8.0
8 k 129.0 9.0
9 l 110.0 10.0
10 m 200.0 11.0
11 q 812.0 NaN
12 w 110.0 NaN
13 z 129.0 NaN

merging multiple columns into one columns in pandas

I have a dataframe called ref(first dataframe) with columns c1, c2 ,c3 and c4.
ref= pd.DataFrame([[1,3,.3,7],[0,4,.5,4.5],[2,5,.6,3]], columns=['c1','c2','c3','c4'])
print(ref)
c1 c2 c3 c4
0 1 3 0.3 7.0
1 0 4 0.5 4.5
2 2 5 0.6 3.0
I wanted to create a new column i.e, c5 ( second dataframe) that has all the values from columns c1,c2,c3 and c4.
I tried concat, merge columns but i cannot get it work.
Please let me know if you have a solutions?
You can use unstack for creating Series from DataFrame and then concat to original:
print (pd.concat([ref, ref.unstack().reset_index(drop=True).rename('c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
Alternative solution for creating Series is convert df to numpy array by values and then reshape by ravel:
print (pd.concat([ref, pd.Series(ref.values.ravel('F'), name='c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
using join + ravel('F')
ref.join(pd.Series(ref.values.ravel('F')).to_frame('c5'), how='right')
using join + T.ravel()
ref.join(pd.Series(ref.values.T.ravel()).to_frame('c5'), how='right')
pd.concat + T.stack() + rename
pd.concat([ref, ref.T.stack().reset_index(drop=True).rename('c5')], axis=1)
way too many transposes + append
ref.T.append(ref.T.stack().reset_index(drop=True).rename('c5')).T
combine_first + ravel('F') <--- my favorite
ref.combine_first(pd.Series(ref.values.ravel('F')).to_frame('c5'))
All yield
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
use the list(zip()) as follows:
d=list(zip(df1.c1,df1.c2,df1.c3,df1.c4))
df2['c5']=pd.Series(d)
try this one, works as you expected
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
print(df)
r = len(df['c1'])
c = len(list(df))
ndata = list(df.c1) + list(df.c2) + list(df.c3) + list(df.c4)
r = len(ndata) - r
t = r*c
dfnan = pd.DataFrame(np.reshape([np.nan]*t, (r,c)), columns=list(df))
df = df.append(dfnan)
df['c5'] = ndata
print(df)
output is below
This could be a fast option and maybe you can use it inside a loop.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
df['c5'] = df.iloc[:,0].astype(str) + df.iloc[:,1].astype(str) + df.iloc[:,2].astype(str) + df.iloc[:,3].astype(str)
Greetings

Resources