Create Column with different conditions

Create Column with different conditions - python-3.x

I have the following extract from a larger Data Frame:
df = pd.DataFrame({'Tipe': ['DM','DM','DM','DS','DS','DI','DI','DM','DI','DS','DM','DM','DM','DM','DI','DM','DM','DS','DS','DS','DM']})
The objective is to create a column "D" where, for an index greater than "X", the program must search the column "Type" for the last n values and count only those that are identified as 'DM'.
for example, if "x" and "n" were 5, I expect something like this:
Tipe D
0 DM NaN
1 DM NaN
2 DM NaN
3 DS NaN
4 DS NaN
5 DI NaN
6 DI 2.0
7 DM 1.0
8 DI 1.0
9 DS 1.0
10 DM 1.0
11 DM 2.0
12 DM 3.0
13 DM 3.0
14 DI 4.0
15 DM 4.0
16 DM 4.0
17 DS 4.0
18 DS 3.0
19 DS 2.0
20 DM 2.0
I try with ".tail" but the existing 'DM' values throughout the entire column, not just the ones in the last n values.

Use pandas.Series.shift and rolling with where:
x = 5
n = 5
s = df["Type"].eq("DM").shift().rolling(n).sum()
df["D"] = s.where(s.index > x)
Output:
Type D
0 DM NaN
1 DM NaN
2 DM NaN
3 DS NaN
4 DS NaN
5 DI NaN
6 DI 2.0
7 DM 1.0
8 DI 1.0
9 DS 1.0
10 DM 1.0
11 DM 2.0
12 DM 3.0
13 DM 3.0
14 DI 4.0
15 DM 4.0
16 DM 4.0
17 DS 4.0
18 DS 3.0
19 DS 2.0
20 DM 2.0

Related

How to read data from excel and concatenate columns vertically?

I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0

How to get the last row with null value

I have a table:
a b c
1 11 21
2 12 22
3 3 3
NaN 14 24
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I want to remove all the rows in column a before the last row with the null value. The output that I want is:
a b c
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I couldn't find a better solution for this but first_valid_index and last_valid_index. I think I don't need that.
BONUS
I also want to add a new column in the dataframe if all the values in a row are the same. The following rows should have the same value:
new a b c
NaN NaN 15 NaN
4 4 4 4
4 5 15 25
6 6 6 6
6 7 17 27
Thank you!

Use isna with idxmax:
new_df = df.iloc[df["a"].isna().idxmax()+1:]
Output:
a b c
4 NaN 15 NaN
5 4.0 4 4.0
6 5.0 15 25.0
7 6.0 6 6.0
8 7.0 17 27.0
Then use pandas.Series.where with nunique:
new_df["new"] = new_df["a"].where(new_df.nunique(axis=1).eq(1)).ffill()
print(new_df)
Final output:
a b c new
4 NaN 15 NaN NaN
5 4.0 4 4.0 4.0
6 5.0 15 25.0 4.0
7 6.0 6 6.0 6.0
8 7.0 17 27.0 6.0

Find the rows that contain an NaN:
nanrows = df['a'].isnull()
Find the index of the last of them:
nanmax = df[nanrows].index.max()
Do slicing:
df.iloc[nanmax:]
# a b c
#4 NaN 15 NaN
#5 4.0 4 4.0
#6 5.0 15 25.0
#7 6.0 6 6.0
#8 7.0 17 27.0

Replace multiple columns' NaNs with other columns' values in Pandas

Given a dataframe as follows:
date city gdp gdp1 gdp2 gross domestic product pop pop1 pop2
0 2001-03 bj 3.0 NaN NaN NaN 7.0 NaN NaN
1 2001-06 bj 5.0 NaN NaN NaN 6.0 6.0 NaN
2 2001-09 bj 8.0 NaN NaN 8.0 4.0 4.0 NaN
3 2001-12 bj 7.0 NaN 7.0 NaN 2.0 NaN 2.0
4 2001-03 sh 4.0 4.0 NaN NaN 3.0 NaN NaN
5 2001-06 sh 5.0 NaN NaN 5.0 5.0 5.0 NaN
6 2001-09 sh 9.0 NaN NaN NaN 4.0 4.0 NaN
7 2001-12 sh 3.0 3.0 NaN NaN 6.0 NaN 6.0
I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively.
date city gdp pop
0 2001-03 bj 3 7
1 2001-06 bj 5 6
2 2001-09 bj 8 4
3 2001-12 bj 7 2
4 2001-03 sh 4 3
5 2001-06 sh 5 5
6 2001-09 sh 9 4
7 2001-12 sh 3 6
The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns?
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product']
df.loc[df['pop'].isnull(), 'pop'] = df['pop1']
df.loc[df['pop'].isnull(), 'pop'] = df['pop2']
df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)

Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side:
#if first column is gdp, pop
df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp']
df['pop'] = df.filter(like='pop').bfill(axis=1)['pop']
#if possible any first column
df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0]
df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0]
But if only one non missing values is posible use max, min...:
df['gdp'] = df.filter(like='gdp').max(axis=1)
df['pop'] = df.filter(like='pop').max(axis=1)
If need specify columns names by list:
gdp_c = ['gdp1','gdp2','gross domestic product']
pop_c = ['pop1','pop2']
df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0]
df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0]
df = df[['date','city','gdp','pop']]
print (df)
date city gdp pop
0 2001-03 bj 3.0 7.0
1 2001-06 bj 5.0 6.0
2 2001-09 bj 8.0 4.0
3 2001-12 bj 7.0 2.0
4 2001-03 sh 4.0 3.0
5 2001-06 sh 5.0 5.0
6 2001-09 sh 9.0 4.0
7 2001-12 sh 3.0 6.0

Concatenate two dataframes of different sizes (pandas)

I have two dataframes with unique ids. They share some columns but not all. I need to create a combined dataframe which will include rows from missing ids from the second dataframe. Tried merge and concat, no luck. It's probably too late, my brain stopped working. Will appreciate your help!
df1 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','j','k','l','m'],
'metric1': [123,22,356,412,54,634,72,812,129,110,200],
'metric2':[1,2,3,4,5,6,7,8,9,10,11]
})
df2 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','q','z','w'],
'metric1': [123,22,356,412,54,634,72,812,129,110]
})
df2
The result should look like this:
id metric1 metric2
0 a 123 1.0
1 b 22 2.0
2 c 356 3.0
3 d 412 4.0
4 f 54 5.0
5 g 634 6.0
6 h 72 7.0
7 j 812 8.0
8 k 129 9.0
9 l 110 10.0
10 m 200 11.0
11 q 812 NaN
12 z 129 NaN
13 w 110 NaN

In this case using combine_first
df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[766]:
id metric1 metric2
0 a 123.0 1.0
1 b 22.0 2.0
2 c 356.0 3.0
3 d 412.0 4.0
4 f 54.0 5.0
5 g 634.0 6.0
6 h 72.0 7.0
7 j 812.0 8.0
8 k 129.0 9.0
9 l 110.0 10.0
10 m 200.0 11.0
11 q 812.0 NaN
12 w 110.0 NaN
13 z 129.0 NaN

shifting a column down in a pandas dataframe

I have data in the following way
A B C
1 2 3
2 5 6
7 8 9
I want to change the dataframe into
A B C
2 3
1 5 6
2 8 9
3

One way would be to add a blank row to the dataframe and then use shift
# input df:
A B C
0 1 2 3
1 2 5 6
2 7 8 9
df.loc[len(df.index), :] = None
df['A'] = df.A.shift(1)
print (df)
A B C
0 NaN 2.0 3.0
1 1.0 5.0 6.0
2 2.0 8.0 9.0
3 7.0 NaN NaN

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Create Column with different conditions - python-3.x

Related

How to read data from excel and concatenate columns vertically?

How to get the last row with null value

Replace multiple columns' NaNs with other columns' values in Pandas

Concatenate two dataframes of different sizes (pandas)

shifting a column down in a pandas dataframe

Categories

Resources