I have the following extract from a larger Data Frame:
df = pd.DataFrame({'Tipe': ['DM','DM','DM','DS','DS','DI','DI','DM','DI','DS','DM','DM','DM','DM','DI','DM','DM','DS','DS','DS','DM']})
The objective is to create a column "D" where, for an index greater than "X", the program must search the column "Type" for the last n values and count only those that are identified as 'DM'.
for example, if "x" and "n" were 5, I expect something like this:
Tipe D
0 DM NaN
1 DM NaN
2 DM NaN
3 DS NaN
4 DS NaN
5 DI NaN
6 DI 2.0
7 DM 1.0
8 DI 1.0
9 DS 1.0
10 DM 1.0
11 DM 2.0
12 DM 3.0
13 DM 3.0
14 DI 4.0
15 DM 4.0
16 DM 4.0
17 DS 4.0
18 DS 3.0
19 DS 2.0
20 DM 2.0
I try with ".tail" but the existing 'DM' values throughout the entire column, not just the ones in the last n values.
Use pandas.Series.shift and rolling with where:
x = 5
n = 5
s = df["Type"].eq("DM").shift().rolling(n).sum()
df["D"] = s.where(s.index > x)
Output:
Type D
0 DM NaN
1 DM NaN
2 DM NaN
3 DS NaN
4 DS NaN
5 DI NaN
6 DI 2.0
7 DM 1.0
8 DI 1.0
9 DS 1.0
10 DM 1.0
11 DM 2.0
12 DM 3.0
13 DM 3.0
14 DI 4.0
15 DM 4.0
16 DM 4.0
17 DS 4.0
18 DS 3.0
19 DS 2.0
20 DM 2.0
Related
I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0
I have a table:
a b c
1 11 21
2 12 22
3 3 3
NaN 14 24
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I want to remove all the rows in column a before the last row with the null value. The output that I want is:
a b c
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I couldn't find a better solution for this but first_valid_index and last_valid_index. I think I don't need that.
BONUS
I also want to add a new column in the dataframe if all the values in a row are the same. The following rows should have the same value:
new a b c
NaN NaN 15 NaN
4 4 4 4
4 5 15 25
6 6 6 6
6 7 17 27
Thank you!
Use isna with idxmax:
new_df = df.iloc[df["a"].isna().idxmax()+1:]
Output:
a b c
4 NaN 15 NaN
5 4.0 4 4.0
6 5.0 15 25.0
7 6.0 6 6.0
8 7.0 17 27.0
Then use pandas.Series.where with nunique:
new_df["new"] = new_df["a"].where(new_df.nunique(axis=1).eq(1)).ffill()
print(new_df)
Final output:
a b c new
4 NaN 15 NaN NaN
5 4.0 4 4.0 4.0
6 5.0 15 25.0 4.0
7 6.0 6 6.0 6.0
8 7.0 17 27.0 6.0
Find the rows that contain an NaN:
nanrows = df['a'].isnull()
Find the index of the last of them:
nanmax = df[nanrows].index.max()
Do slicing:
df.iloc[nanmax:]
# a b c
#4 NaN 15 NaN
#5 4.0 4 4.0
#6 5.0 15 25.0
#7 6.0 6 6.0
#8 7.0 17 27.0
Given a dataframe as follows:
date city gdp gdp1 gdp2 gross domestic product pop pop1 pop2
0 2001-03 bj 3.0 NaN NaN NaN 7.0 NaN NaN
1 2001-06 bj 5.0 NaN NaN NaN 6.0 6.0 NaN
2 2001-09 bj 8.0 NaN NaN 8.0 4.0 4.0 NaN
3 2001-12 bj 7.0 NaN 7.0 NaN 2.0 NaN 2.0
4 2001-03 sh 4.0 4.0 NaN NaN 3.0 NaN NaN
5 2001-06 sh 5.0 NaN NaN 5.0 5.0 5.0 NaN
6 2001-09 sh 9.0 NaN NaN NaN 4.0 4.0 NaN
7 2001-12 sh 3.0 3.0 NaN NaN 6.0 NaN 6.0
I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively.
date city gdp pop
0 2001-03 bj 3 7
1 2001-06 bj 5 6
2 2001-09 bj 8 4
3 2001-12 bj 7 2
4 2001-03 sh 4 3
5 2001-06 sh 5 5
6 2001-09 sh 9 4
7 2001-12 sh 3 6
The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns?
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product']
df.loc[df['pop'].isnull(), 'pop'] = df['pop1']
df.loc[df['pop'].isnull(), 'pop'] = df['pop2']
df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)
Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side:
#if first column is gdp, pop
df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp']
df['pop'] = df.filter(like='pop').bfill(axis=1)['pop']
#if possible any first column
df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0]
df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0]
But if only one non missing values is posible use max, min...:
df['gdp'] = df.filter(like='gdp').max(axis=1)
df['pop'] = df.filter(like='pop').max(axis=1)
If need specify columns names by list:
gdp_c = ['gdp1','gdp2','gross domestic product']
pop_c = ['pop1','pop2']
df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0]
df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0]
df = df[['date','city','gdp','pop']]
print (df)
date city gdp pop
0 2001-03 bj 3.0 7.0
1 2001-06 bj 5.0 6.0
2 2001-09 bj 8.0 4.0
3 2001-12 bj 7.0 2.0
4 2001-03 sh 4.0 3.0
5 2001-06 sh 5.0 5.0
6 2001-09 sh 9.0 4.0
7 2001-12 sh 3.0 6.0
I have two dataframes with unique ids. They share some columns but not all. I need to create a combined dataframe which will include rows from missing ids from the second dataframe. Tried merge and concat, no luck. It's probably too late, my brain stopped working. Will appreciate your help!
df1 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','j','k','l','m'],
'metric1': [123,22,356,412,54,634,72,812,129,110,200],
'metric2':[1,2,3,4,5,6,7,8,9,10,11]
})
df2 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','q','z','w'],
'metric1': [123,22,356,412,54,634,72,812,129,110]
})
df2
The result should look like this:
id metric1 metric2
0 a 123 1.0
1 b 22 2.0
2 c 356 3.0
3 d 412 4.0
4 f 54 5.0
5 g 634 6.0
6 h 72 7.0
7 j 812 8.0
8 k 129 9.0
9 l 110 10.0
10 m 200 11.0
11 q 812 NaN
12 z 129 NaN
13 w 110 NaN
In this case using combine_first
df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[766]:
id metric1 metric2
0 a 123.0 1.0
1 b 22.0 2.0
2 c 356.0 3.0
3 d 412.0 4.0
4 f 54.0 5.0
5 g 634.0 6.0
6 h 72.0 7.0
7 j 812.0 8.0
8 k 129.0 9.0
9 l 110.0 10.0
10 m 200.0 11.0
11 q 812.0 NaN
12 w 110.0 NaN
13 z 129.0 NaN
I have data in the following way
A B C
1 2 3
2 5 6
7 8 9
I want to change the dataframe into
A B C
2 3
1 5 6
2 8 9
3
One way would be to add a blank row to the dataframe and then use shift
# input df:
A B C
0 1 2 3
1 2 5 6
2 7 8 9
df.loc[len(df.index), :] = None
df['A'] = df.A.shift(1)
print (df)
A B C
0 NaN 2.0 3.0
1 1.0 5.0 6.0
2 2.0 8.0 9.0
3 7.0 NaN NaN