pandas groupby after nan value - python-3.x

I want to collect the rows found after the nan value in var1 up to another nan value under the var2 category of the nan value. how can I do that ?
the attached table is just the head(20)
var1 var2
2 NaN ADIYAMAN ÜNİVERSİTESİ (Devlet Üniversitesi)
3 NaN Besni Ali Erdemoğlu Meslek Yüksekokulu
4 100290102 Bankacılık ve Sigortacılık
5 100290109 Bilgi Yönetimi
6 100290116 Bilgisayar Programcılığı
7 100290123 Büro Yönetimi ve Yönetici Asistanlığı
8 100290130 İşletme Yönetimi
9 100290137 Mekatronik
10 100290144 Muhasebe ve Vergi Uygulamaları
11 NaN Gölbaşı Meslek Yüksekokulu
12 100290070 Bankacılık ve Sigortacılık
13 100250476 Bilgisayar Programcılığı
14 100250591 Büro Yönetimi ve Yönetici Asistanlığı
15 100290056 İş Sağlığı ve Güvenliği
16 100250767 Lojistik
17 100250555 Yerel Yönetimler
18 NaN Kahta Meslek Yüksekokulu
19 100250713 Bahçe Tarımı
20 100250652 Bankacılık ve Sigortacılık
21 100250485 Bilgisayar Programcılığı
.
df["var1"].isnull().sum
var1 1185

Are you looking to select all var2 values where var1 is not Null? In that case, you'd need:
df[df['var1'].notnull()]['var2']
This will select all var2 values where var1 is not Null.

df
###
value1 value2
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 NaN NaN
6 7.0 7.0
7 NaN NaN
8 9.0 9.0
9 NaN 10.0
df.query('value1.isnull() & value2.isnull()')
###
value1 value2
5 NaN NaN
7 NaN NaN

Related

Fill one column value to another one randomly selected from multiple columns in Python

Given a dataset as follows:
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN NaN NaN 15 NaN NaN NaN
1 sh 8 NaN NaN NaN 13 NaN NaN NaN
2 gz 9 NaN NaN NaN 9 NaN NaN NaN
3 sz 6 NaN NaN NaN 16 NaN NaN NaN
I would like to fill value1 to randomly select one column from 'March', 'April', 'May', also fill value2 to one column randomly selected from 'Jun', 'Jul', 'Aut'.
Output desired:
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN 12.0 NaN 15 NaN 15.0 NaN
1 sh 8 8.0 NaN NaN 13 NaN NaN 13.0
2 gz 9 NaN NaN 9.0 9 NaN 9.0 NaN
3 sz 6 NaN 6.0 NaN 16 16.0 NaN NaN
How could I do that in Python? Thanks.
Here is one way by defining a function which randomly selects the indices from the slice of dataframe as defined by the passed cols then fills the corresponding values from the value column (val_col) passed to the function:
def fill(df, val_col, cols):
i = np.random.choice(len(cols), len(df))
vals = df[cols].to_numpy()
vals[range(len(df)), i] = list(df[val_col])
return df.assign(**dict(zip(cols, vals.T)))
>>> df = fill(df, 'value1', ['March', 'April', 'May'])
>>> df
city value1 March April May value2 Jun Jul Aut
0 bj 12 12.0 NaN NaN 15 NaN NaN NaN
1 sh 8 NaN NaN 8.0 13 NaN NaN NaN
2 gz 9 NaN 9.0 NaN 9 NaN NaN NaN
3 sz 6 NaN 6.0 NaN 16 NaN NaN NaN
>>> df = fill(df, 'value2', ['Jun', 'Jul', 'Aut'])
>>> df
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN NaN 12.0 15 NaN NaN 15.0
1 sh 8 NaN NaN 8.0 13 13.0 NaN NaN
2 gz 9 NaN NaN 9.0 9 NaN NaN 9.0
3 sz 6 NaN 6.0 NaN 16 NaN NaN 16.0

How to get the last row with null value

I have a table:
a b c
1 11 21
2 12 22
3 3 3
NaN 14 24
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I want to remove all the rows in column a before the last row with the null value. The output that I want is:
a b c
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I couldn't find a better solution for this but first_valid_index and last_valid_index. I think I don't need that.
BONUS
I also want to add a new column in the dataframe if all the values in a row are the same. The following rows should have the same value:
new a b c
NaN NaN 15 NaN
4 4 4 4
4 5 15 25
6 6 6 6
6 7 17 27
Thank you!
Use isna with idxmax:
new_df = df.iloc[df["a"].isna().idxmax()+1:]
Output:
a b c
4 NaN 15 NaN
5 4.0 4 4.0
6 5.0 15 25.0
7 6.0 6 6.0
8 7.0 17 27.0
Then use pandas.Series.where with nunique:
new_df["new"] = new_df["a"].where(new_df.nunique(axis=1).eq(1)).ffill()
print(new_df)
Final output:
a b c new
4 NaN 15 NaN NaN
5 4.0 4 4.0 4.0
6 5.0 15 25.0 4.0
7 6.0 6 6.0 6.0
8 7.0 17 27.0 6.0
Find the rows that contain an NaN:
nanrows = df['a'].isnull()
Find the index of the last of them:
nanmax = df[nanrows].index.max()
Do slicing:
df.iloc[nanmax:]
# a b c
#4 NaN 15 NaN
#5 4.0 4 4.0
#6 5.0 15 25.0
#7 6.0 6 6.0
#8 7.0 17 27.0

Read dataframe split by nan rows and reshape them into multiple dataframes in Python

I have a example excel file data1.xlsx from here, which has a Sheet1 as follows:
Now I want to read it with openpyxl or pandas, then convert them into new df1 and df2, I will finally save them as price and quantity sheet:
price sheet:
and quantity sheet
Code I have used:
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1')
df_list = np.split(df, df[df.isnull().all(1)].index)
for df in df_list:
print(df, '\n')
Out:
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 year 2018.0 2019.0 2020.0 sum
1 price 12.0 4.0 5.0 21
2 quantity 5.0 5.0 3.0 13
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
3 NaN NaN NaN NaN NaN
4 sh NaN NaN NaN NaN
5 year 2018.0 2019.0 2020.0 sum
6 price 5.0 6.0 7.0 18
7 quantity 7.0 5.0 4.0 16
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
8 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
9 NaN NaN NaN NaN NaN
10 gz NaN NaN NaN NaN
11 year 2018.0 2019.0 2020.0 sum
12 price 2.0 3.0 1.0 6
13 quantity 6.0 9.0 3.0 18
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
14 NaN NaN NaN NaN NaN
bj Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
15 NaN NaN NaN NaN NaN
16 sz NaN NaN NaN NaN
17 year 2018.0 2019.0 2020.0 sum
18 price 8.0 2.0 3.0 13
19 quantity 5.0 4.0 3.0 12
How could I do that in Python? Thanks a lot.
Use:
#add header=None for default columns names
df = pd.read_excel('./data1.xlsx', sheet_name = 'Sheet1', header=None)
#convert columns by second row
df.columns = df.iloc[1].rename(None)
#create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())
#convert floats to integers
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
#convert column year to index
df = df.set_index('year')
print (df)
city 2018 2019 2020 sum
year
bj bj NaN NaN NaN NaN
year bj 2018.0 2019.0 2020.0 sum
price bj 12.0 4.0 5.0 21
quantity bj 5.0 5.0 3.0 13
NaN bj NaN NaN NaN NaN
sh sh NaN NaN NaN NaN
year sh 2018.0 2019.0 2020.0 sum
price sh 5.0 6.0 7.0 18
quantity sh 7.0 5.0 4.0 16
NaN sh NaN NaN NaN NaN
NaN sh NaN NaN NaN NaN
gz gz NaN NaN NaN NaN
year gz 2018.0 2019.0 2020.0 sum
price gz 2.0 3.0 1.0 6
quantity gz 6.0 9.0 3.0 18
NaN gz NaN NaN NaN NaN
NaN gz NaN NaN NaN NaN
sz sz NaN NaN NaN NaN
year sz 2018.0 2019.0 2020.0 sum
price sz 8.0 2.0 3.0 13
quantity sz 5.0 4.0 3.0 12
df1 = df.loc['price'].reset_index(drop=True)
print (df1)
city 2018 2019 2020 sum
0 bj 12.0 4.0 5.0 21
1 sh 5.0 6.0 7.0 18
2 gz 2.0 3.0 1.0 6
3 sz 8.0 2.0 3.0 13
df2 = df.loc['quantity'].reset_index(drop=True)
print (df2)
city 2018 2019 2020 sum
0 bj 5.0 5.0 3.0 13
1 sh 7.0 5.0 4.0 16
2 gz 6.0 9.0 3.0 18
3 sz 5.0 4.0 3.0 12
Last write DataFrames to existing file is possible by mode='a' parameter, link:
with pd.ExcelWriter('data1.xlsx', mode='a') as writer:
df1.to_excel(writer, sheet_name='price')
df2.to_excel(writer, sheet_name='quantity')

Replace multiple columns' NaNs with other columns' values in Pandas

Given a dataframe as follows:
date city gdp gdp1 gdp2 gross domestic product pop pop1 pop2
0 2001-03 bj 3.0 NaN NaN NaN 7.0 NaN NaN
1 2001-06 bj 5.0 NaN NaN NaN 6.0 6.0 NaN
2 2001-09 bj 8.0 NaN NaN 8.0 4.0 4.0 NaN
3 2001-12 bj 7.0 NaN 7.0 NaN 2.0 NaN 2.0
4 2001-03 sh 4.0 4.0 NaN NaN 3.0 NaN NaN
5 2001-06 sh 5.0 NaN NaN 5.0 5.0 5.0 NaN
6 2001-09 sh 9.0 NaN NaN NaN 4.0 4.0 NaN
7 2001-12 sh 3.0 3.0 NaN NaN 6.0 NaN 6.0
I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively.
date city gdp pop
0 2001-03 bj 3 7
1 2001-06 bj 5 6
2 2001-09 bj 8 4
3 2001-12 bj 7 2
4 2001-03 sh 4 3
5 2001-06 sh 5 5
6 2001-09 sh 9 4
7 2001-12 sh 3 6
The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns?
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product']
df.loc[df['pop'].isnull(), 'pop'] = df['pop1']
df.loc[df['pop'].isnull(), 'pop'] = df['pop2']
df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)
Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side:
#if first column is gdp, pop
df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp']
df['pop'] = df.filter(like='pop').bfill(axis=1)['pop']
#if possible any first column
df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0]
df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0]
But if only one non missing values is posible use max, min...:
df['gdp'] = df.filter(like='gdp').max(axis=1)
df['pop'] = df.filter(like='pop').max(axis=1)
If need specify columns names by list:
gdp_c = ['gdp1','gdp2','gross domestic product']
pop_c = ['pop1','pop2']
df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0]
df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0]
df = df[['date','city','gdp','pop']]
print (df)
date city gdp pop
0 2001-03 bj 3.0 7.0
1 2001-06 bj 5.0 6.0
2 2001-09 bj 8.0 4.0
3 2001-12 bj 7.0 2.0
4 2001-03 sh 4.0 3.0
5 2001-06 sh 5.0 5.0
6 2001-09 sh 9.0 4.0
7 2001-12 sh 3.0 6.0

How to remove rows in a dataframe with more than x number of Null values? [duplicate]

This question already has answers here:
Filter out rows with more than certain number of NaN
(3 answers)
Closed 4 years ago.
I am trying to remove the rows in the data frame with more than 7 null values. Please suggest something that is efficient to achieve this.
If I understand correctly, you need to remove rows only if total nan's in a row is more than 7:
df = df[df.isnull().sum(axis=1) < 7]
This will keep only rows which have nan's less than 7 in the dataframe, and will remove all having nan's > 7.
dropna has a thresh argument. Subtract your desired number from the number of columns.
thresh : int, optional Require that many non-NA values.
df.dropna(thresh=df.shape[1]-7, axis=0)
Sample Data:
print(df)
0 1 2 3 4 5 6 7
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN 5.0
2 6.0 7.0 8.0 9.0 NaN NaN NaN NaN
3 NaN NaN 11.0 12.0 13.0 14.0 15.0 16.0
df.dropna(thresh=df.shape[1]-7, axis=0)
0 1 2 3 4 5 6 7
1 NaN NaN NaN NaN NaN NaN NaN 5.0
2 6.0 7.0 8.0 9.0 NaN NaN NaN NaN
3 NaN NaN 11.0 12.0 13.0 14.0 15.0 16.0

Resources