Replace multiple columns' NaNs with other columns' values in Pandas - python-3.x

Given a dataframe as follows:
date city gdp gdp1 gdp2 gross domestic product pop pop1 pop2
0 2001-03 bj 3.0 NaN NaN NaN 7.0 NaN NaN
1 2001-06 bj 5.0 NaN NaN NaN 6.0 6.0 NaN
2 2001-09 bj 8.0 NaN NaN 8.0 4.0 4.0 NaN
3 2001-12 bj 7.0 NaN 7.0 NaN 2.0 NaN 2.0
4 2001-03 sh 4.0 4.0 NaN NaN 3.0 NaN NaN
5 2001-06 sh 5.0 NaN NaN 5.0 5.0 5.0 NaN
6 2001-09 sh 9.0 NaN NaN NaN 4.0 4.0 NaN
7 2001-12 sh 3.0 3.0 NaN NaN 6.0 NaN 6.0
I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively.
date city gdp pop
0 2001-03 bj 3 7
1 2001-06 bj 5 6
2 2001-09 bj 8 4
3 2001-12 bj 7 2
4 2001-03 sh 4 3
5 2001-06 sh 5 5
6 2001-09 sh 9 4
7 2001-12 sh 3 6
The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns?
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product']
df.loc[df['pop'].isnull(), 'pop'] = df['pop1']
df.loc[df['pop'].isnull(), 'pop'] = df['pop2']
df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)

Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side:
#if first column is gdp, pop
df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp']
df['pop'] = df.filter(like='pop').bfill(axis=1)['pop']
#if possible any first column
df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0]
df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0]
But if only one non missing values is posible use max, min...:
df['gdp'] = df.filter(like='gdp').max(axis=1)
df['pop'] = df.filter(like='pop').max(axis=1)
If need specify columns names by list:
gdp_c = ['gdp1','gdp2','gross domestic product']
pop_c = ['pop1','pop2']
df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0]
df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0]
df = df[['date','city','gdp','pop']]
print (df)
date city gdp pop
0 2001-03 bj 3.0 7.0
1 2001-06 bj 5.0 6.0
2 2001-09 bj 8.0 4.0
3 2001-12 bj 7.0 2.0
4 2001-03 sh 4.0 3.0
5 2001-06 sh 5.0 5.0
6 2001-09 sh 9.0 4.0
7 2001-12 sh 3.0 6.0

Related

How to read data from excel and concatenate columns vertically?

I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0

add row to dataframe pandas

I want to add a median row to the top. Based on this stack answer I do the following:
pd.concat([df.median(),df],axis=0, ignore_index=True)
Shape of DF: 50000 x 226
Shape expected: 50001 x 226
Shape of modified DF: 500213 x 227 ???
What am I doing wrong? I am unable to understand what is going on?
Maybe what you wanted is like this:
dfn = pd.concat([df.median().to_frame().T, df], ignore_index=True)
create some sample data:
df = pd.DataFrame(np.arange(20).reshape(4,5), columns= list('ABCDE'))
dfn = pd.concat([df.median().to_frame().T, df])
df
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df.median().to_frame().T
A B C D E
0 7.5 8.5 9.5 10.5 11.5
dfn
A B C D E
0 7.5 8.5 9.5 10.5 11.5
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df.median() is an Series, with row index of A, B, C, D, E, so when you concat df.median() with df, the result is that:
pd.concat([df.median(),df], axis=0)
0 A B C D E
A 7.5 NaN NaN NaN NaN NaN
B 8.5 NaN NaN NaN NaN NaN
C 9.5 NaN NaN NaN NaN NaN
D 10.5 NaN NaN NaN NaN NaN
E 11.5 NaN NaN NaN NaN NaN
0 NaN 0.0 1.0 2.0 3.0 4.0
1 NaN 5.0 6.0 7.0 8.0 9.0
2 NaN 10.0 11.0 12.0 13.0 14.0
3 NaN 15.0 16.0 17.0 18.0 19.0
pd.concat([df.median(),df],axis=0, ignore_index=True)
this code creates a row for you but that is not a DataFrame it is a Series. So you want to convert the series to DataFrame
so you can use
.to_frame().T
to your code then your code become
pd.concat([df.median().to_frame().T,df],axis=0, ignore_index=True)

Read dataframe split by nan rows and reshape them into multiple dataframes in Python

I have a example excel file data1.xlsx from here, which has a Sheet1 as follows:
Now I want to read it with openpyxl or pandas, then convert them into new df1 and df2, I will finally save them as price and quantity sheet:
price sheet:
and quantity sheet
Code I have used:
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1')
df_list = np.split(df, df[df.isnull().all(1)].index)
for df in df_list:
print(df, '\n')
Out:
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 year 2018.0 2019.0 2020.0 sum
1 price 12.0 4.0 5.0 21
2 quantity 5.0 5.0 3.0 13
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
3 NaN NaN NaN NaN NaN
4 sh NaN NaN NaN NaN
5 year 2018.0 2019.0 2020.0 sum
6 price 5.0 6.0 7.0 18
7 quantity 7.0 5.0 4.0 16
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
8 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
9 NaN NaN NaN NaN NaN
10 gz NaN NaN NaN NaN
11 year 2018.0 2019.0 2020.0 sum
12 price 2.0 3.0 1.0 6
13 quantity 6.0 9.0 3.0 18
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
14 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
15 NaN NaN NaN NaN NaN
16 sz NaN NaN NaN NaN
17 year 2018.0 2019.0 2020.0 sum
18 price 8.0 2.0 3.0 13
19 quantity 5.0 4.0 3.0 12
How could I do that in Python? Thanks a lot.
Use:
#add header=None for default columns names
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1', header=None)
#convert columns by second row
df.columns = df.iloc[1].rename(None)
#create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())
#convert floats to integers
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
#convert column year to index
df = df.set_index('year')
print (df)
city 2018 2019 2020 sum
year
bj bj NaN NaN NaN NaN
year bj 2018.0 2019.0 2020.0 sum
price bj 12.0 4.0 5.0 21
quantity bj 5.0 5.0 3.0 13
NaN bj NaN NaN NaN NaN
sh sh NaN NaN NaN NaN
year sh 2018.0 2019.0 2020.0 sum
price sh 5.0 6.0 7.0 18
quantity sh 7.0 5.0 4.0 16
NaN sh NaN NaN NaN NaN
NaN sh NaN NaN NaN NaN
gz gz NaN NaN NaN NaN
year gz 2018.0 2019.0 2020.0 sum
price gz 2.0 3.0 1.0 6
quantity gz 6.0 9.0 3.0 18
NaN gz NaN NaN NaN NaN
NaN gz NaN NaN NaN NaN
sz sz NaN NaN NaN NaN
year sz 2018.0 2019.0 2020.0 sum
price sz 8.0 2.0 3.0 13
quantity sz 5.0 4.0 3.0 12
df1 = df.loc['price'].reset_index(drop=True)
print (df1)
city 2018 2019 2020 sum
0 bj 12.0 4.0 5.0 21
1 sh 5.0 6.0 7.0 18
2 gz 2.0 3.0 1.0 6
3 sz 8.0 2.0 3.0 13
df2 = df.loc['quantity'].reset_index(drop=True)
print (df2)
city 2018 2019 2020 sum
0 bj 5.0 5.0 3.0 13
1 sh 7.0 5.0 4.0 16
2 gz 6.0 9.0 3.0 18
3 sz 5.0 4.0 3.0 12
Last write DataFrames to existing file is possible by mode='a' parameter, link:
with pd.ExcelWriter('data1.xlsx', mode='a') as writer:
df1.to_excel(writer, sheet_name='price')
df2.to_excel(writer, sheet_name='quantity')

Replacing values in a string with NaN

Faced a simple task, but I can not solve. There is a table in df:
Date X1 X2
02.03.2019 2 2
03.03.2019 1 1
04.03.2019 2 3
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
And I need for rows where Date < 05.03.2019 set X1=NaN, X2=NaN:
Date X1 X2
02.03.2019 NaN NaN
03.03.2019 NaN NaN
04.03.2019 NaN NaN
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
First convert column Date to datetimes and then set values by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
df.loc[df['Date'] < '2019-03-05', ['X1','X2']] = np.nan
print (df)
Date X1 X2
0 2019-03-02 NaN NaN
1 2019-03-03 NaN NaN
2 2019-03-04 NaN NaN
3 2019-03-05 1.0 12.0
4 2019-03-06 2.0 2.0
5 2019-03-07 3.0 3.0
6 2019-03-08 4.0 1.0
7 2019-03-09 1.0 2.0
If there is DatetimeIndex:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
#change datetime to 2019-03-04
df.loc[:'2019-03-04'] = np.nan
print (df)
X1 X2
Date
2019-03-02 NaN NaN
2019-03-03 NaN NaN
2019-03-04 NaN NaN
2019-03-05 1.0 12.0
2019-03-06 2.0 2.0
2019-03-07 3.0 3.0
2019-03-08 4.0 1.0
2019-03-09 1.0 2.0
Or:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
df.loc[df.index < '2019-03-05'] = np.nan
Dont use this solution, this is just another approach possible (-: (this will affect all columns)
df.mask(df.Date < '05.03.2019').combine_first(df[['Date']])
Date X1 X2
0 02.03.2019 NaN NaN
1 03.03.2019 NaN NaN
2 04.03.2019 NaN NaN
3 05.03.2019 1.0 12.0
4 06.03.2019 2.0 2.0
5 07.03.2019 3.0 3.0
6 08.03.2019 4.0 1.0
7 09.03.2019 1.0 2.0

Stack two pandas dataframes with different columns, keeping source dataframe as column, also

I have a couple of toy dataframes I can stack using df.append, but I need to keep the source dataframes as a column, as well. I can't seem to find anything about how to do that. Here's what I do have:
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2005
A B C G
0 1 2 3 7
1 2 4 5 8
2 3 5 7 9
3 4 6 8 10
d2006
A B D F
0 2 3 a 7
1 1 1 c 8
2 4 5 d 10
3 5 6 e 12
Then I can stack them like this:
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
d_combined
A B C D F G
0 1 2 3.0 NaN NaN 7.0
1 2 4 5.0 NaN NaN 8.0
2 3 5 7.0 NaN NaN 9.0
3 4 6 8.0 NaN NaN 10.0
4 2 3 NaN a 7.0 NaN
5 1 1 NaN c 8.0 NaN
6 4 5 NaN d 10.0 NaN
7 5 6 NaN e 12.0 NaN
But what I really need is another column with the source dataframe added to the right end of d_combined. Something like this:
A B C D G F From
0 1 2 3.0 NaN 7.0 NaN d2005
1 2 4 5.0 NaN 8.0 NaN d2005
2 3 5 7.0 NaN 9.0 NaN d2005
3 4 6 8.0 NaN 10.0 NaN d2005
4 2 3 NaN a NaN 7.0 d2006
5 1 1 NaN c NaN 8.0 d2006
6 4 5 NaN d NaN 10.0 d2006
7 5 6 NaN e NaN 12.0 d2006
Hopefully someone has a quick trick they can share.
Thanks.
This gets what you want but there should be a more elegant way:
df_list = [d2005, d2006]
name_list = ['2005', '2006']
for df, name in zip(df_list, name_list):
df['from'] = name
Then
d_combined = d2005.append(d2006, ignore_index=True)
d_combined
A B C D F G from
0 1 2 3.0 NaN NaN 7.0 2005
1 2 4 5.0 NaN NaN 8.0 2005
2 3 5 7.0 NaN NaN 9.0 2005
3 4 6 8.0 NaN NaN 10.0 2005
4 2 3 NaN a 7.0 NaN 2006
5 1 1 NaN c 8.0 NaN 2006
6 4 5 NaN d 10.0 NaN 2006
7 5 6 NaN e 12.0 NaN 2006
Alternatively, you can set df.name at the time of creation of the df and use it in the for loop.
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]} )
d2005.name = 2005
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2006.name = 2006
df_list = [d2005, d2006]
for df in df_list:
df['from'] = df.name
I believe this can be simply achieved by adding the From column to the original dataframes itself.
So effectively,
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
Then,
d2005['From'] = 'd2005'
d2006['From'] = 'd2006'
And then you append,
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
gives you something like this:

Resources