Concatenate two dataframes of different sizes (pandas) - python-3.x

I have two dataframes with unique ids. They share some columns but not all. I need to create a combined dataframe which will include rows from missing ids from the second dataframe. Tried merge and concat, no luck. It's probably too late, my brain stopped working. Will appreciate your help!
df1 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','j','k','l','m'],
'metric1': [123,22,356,412,54,634,72,812,129,110,200],
'metric2':[1,2,3,4,5,6,7,8,9,10,11]
})
df2 = pd.DataFrame({
'id': ['a','b','c','d','f','g','h','q','z','w'],
'metric1': [123,22,356,412,54,634,72,812,129,110]
})
df2
The result should look like this:
id metric1 metric2
0 a 123 1.0
1 b 22 2.0
2 c 356 3.0
3 d 412 4.0
4 f 54 5.0
5 g 634 6.0
6 h 72 7.0
7 j 812 8.0
8 k 129 9.0
9 l 110 10.0
10 m 200 11.0
11 q 812 NaN
12 z 129 NaN
13 w 110 NaN

In this case using combine_first
df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[766]:
id metric1 metric2
0 a 123.0 1.0
1 b 22.0 2.0
2 c 356.0 3.0
3 d 412.0 4.0
4 f 54.0 5.0
5 g 634.0 6.0
6 h 72.0 7.0
7 j 812.0 8.0
8 k 129.0 9.0
9 l 110.0 10.0
10 m 200.0 11.0
11 q 812.0 NaN
12 w 110.0 NaN
13 z 129.0 NaN

Related

Replace only leading NaN values in Pandas dataframe

I have a dataframe of time series data, in which data reporting starts at different times (columns) for different observation units (rows). Prior to first reported datapoint for each unit, the dataframe contains NaN values, e.g.
0 1 2 3 4 ...
A NaN NaN 4 5 6 ...
B NaN 7 8 NaN 10...
C NaN 2 11 24 17...
I want to replace the leading (left-side) NaN values with 0, but only the leading ones (i.e. leaving the internal missing ones as NaN. So the result on the example above would be:
0 1 2 3 4 ...
A 0 0 4 5 6 ...
B 0 7 8 NaN 10...
C 0 2 11 24 17...
(Note the retained NaN for row B col 3)
I could iterate through the dataframe row-by-row, identify the first index of a non-NaN value in each row, and replace everything left of that with 0. But is there a way to do this as a whole-array operation?
notna + cumsum by rows, cells with zeros are leading NaN:
df[df.notna().cumsum(1) == 0] = 0
df
0 1 2 3 4
A 0.0 0.0 4 5.0 6
B 0.0 7.0 8 NaN 10
C 0.0 2.0 11 24.0 17
Here is another way using cumprod() and apply()
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.fillna(0,limit = s.loc[x.name]),axis=1)
Output:
0 1 2 3 4
A 0.0 0.0 4.0 5.0 6.0
B 0.0 7.0 8.0 NaN 10.0
C 0.0 2.0 11.0 24.0 17.0

How to read data from excel and concatenate columns vertically?

I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0

add row to dataframe pandas

I want to add a median row to the top. Based on this stack answer I do the following:
pd.concat([df.median(),df],axis=0, ignore_index=True)
Shape of DF: 50000 x 226
Shape expected: 50001 x 226
Shape of modified DF: 500213 x 227 ???
What am I doing wrong? I am unable to understand what is going on?
Maybe what you wanted is like this:
dfn = pd.concat([df.median().to_frame().T, df], ignore_index=True)
create some sample data:
df = pd.DataFrame(np.arange(20).reshape(4,5), columns= list('ABCDE'))
dfn = pd.concat([df.median().to_frame().T, df])
df
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df.median().to_frame().T
A B C D E
0 7.5 8.5 9.5 10.5 11.5
dfn
A B C D E
0 7.5 8.5 9.5 10.5 11.5
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df.median() is an Series, with row index of A, B, C, D, E, so when you concat df.median() with df, the result is that:
pd.concat([df.median(),df], axis=0)
0 A B C D E
A 7.5 NaN NaN NaN NaN NaN
B 8.5 NaN NaN NaN NaN NaN
C 9.5 NaN NaN NaN NaN NaN
D 10.5 NaN NaN NaN NaN NaN
E 11.5 NaN NaN NaN NaN NaN
0 NaN 0.0 1.0 2.0 3.0 4.0
1 NaN 5.0 6.0 7.0 8.0 9.0
2 NaN 10.0 11.0 12.0 13.0 14.0
3 NaN 15.0 16.0 17.0 18.0 19.0
pd.concat([df.median(),df],axis=0, ignore_index=True)
this code creates a row for you but that is not a DataFrame it is a Series. So you want to convert the series to DataFrame
so you can use
.to_frame().T
to your code then your code become
pd.concat([df.median().to_frame().T,df],axis=0, ignore_index=True)

How to get the last row with null value

I have a table:
a b c
1 11 21
2 12 22
3 3 3
NaN 14 24
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I want to remove all the rows in column a before the last row with the null value. The output that I want is:
a b c
NaN 15 NaN
4 4 4
5 15 25
6 6 6
7 17 27
I couldn't find a better solution for this but first_valid_index and last_valid_index. I think I don't need that.
BONUS
I also want to add a new column in the dataframe if all the values in a row are the same. The following rows should have the same value:
new a b c
NaN NaN 15 NaN
4 4 4 4
4 5 15 25
6 6 6 6
6 7 17 27
Thank you!
Use isna with idxmax:
new_df = df.iloc[df["a"].isna().idxmax()+1:]
Output:
a b c
4 NaN 15 NaN
5 4.0 4 4.0
6 5.0 15 25.0
7 6.0 6 6.0
8 7.0 17 27.0
Then use pandas.Series.where with nunique:
new_df["new"] = new_df["a"].where(new_df.nunique(axis=1).eq(1)).ffill()
print(new_df)
Final output:
a b c new
4 NaN 15 NaN NaN
5 4.0 4 4.0 4.0
6 5.0 15 25.0 4.0
7 6.0 6 6.0 6.0
8 7.0 17 27.0 6.0
Find the rows that contain an NaN:
nanrows = df['a'].isnull()
Find the index of the last of them:
nanmax = df[nanrows].index.max()
Do slicing:
df.iloc[nanmax:]
# a b c
#4 NaN 15 NaN
#5 4.0 4 4.0
#6 5.0 15 25.0
#7 6.0 6 6.0
#8 7.0 17 27.0

Create Column with different conditions

I have the following extract from a larger Data Frame:
df = pd.DataFrame({'Tipe': ['DM','DM','DM','DS','DS','DI','DI','DM','DI','DS','DM','DM','DM','DM','DI','DM','DM','DS','DS','DS','DM']})
The objective is to create a column "D" where, for an index greater than "X", the program must search the column "Type" for the last n values and count only those that are identified as 'DM'.
for example, if "x" and "n" were 5, I expect something like this:
Tipe D
0 DM NaN
1 DM NaN
2 DM NaN
3 DS NaN
4 DS NaN
5 DI NaN
6 DI 2.0
7 DM 1.0
8 DI 1.0
9 DS 1.0
10 DM 1.0
11 DM 2.0
12 DM 3.0
13 DM 3.0
14 DI 4.0
15 DM 4.0
16 DM 4.0
17 DS 4.0
18 DS 3.0
19 DS 2.0
20 DM 2.0
I try with ".tail" but the existing 'DM' values throughout the entire column, not just the ones in the last n values.
Use pandas.Series.shift and rolling with where:
x = 5
n = 5
s = df["Type"].eq("DM").shift().rolling(n).sum()
df["D"] = s.where(s.index > x)
Output:
Type D
0 DM NaN
1 DM NaN
2 DM NaN
3 DS NaN
4 DS NaN
5 DI NaN
6 DI 2.0
7 DM 1.0
8 DI 1.0
9 DS 1.0
10 DM 1.0
11 DM 2.0
12 DM 3.0
13 DM 3.0
14 DI 4.0
15 DM 4.0
16 DM 4.0
17 DS 4.0
18 DS 3.0
19 DS 2.0
20 DM 2.0

Resources