How to merge pandas dataframes with different column names - python-3.x

Can someone please tell me how I can achieve results like the image above, but with the following differences:
# Note the column names
df1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index = [0, 1, 2, 3],
)
# Note the column names
df2 = pd.DataFrame({"AA": ["A4", "A5", "A6", "A7"],
"BB": ["B4", "B5", "B6", "B7"],
"CC": ["C4", "C5", "C6", "C7"],
"DD": ["D4", "D5", "D6", "D7"],
},
index = [4, 5, 6, 7],
)
# Note the column names
df3 = pd.DataFrame({"AAA": ["A8", "A9", "A10", "A11"],
"BBB": ["B8", "B9", "B10", "B11"],
"CCC": ["C8", "C9", "C10", "C11"],
"DDD": ["D8", "D9", "D10", "D11"],
},
index = [8, 9, 10, 11],
)
Every kind of merge I do results in this:
Here's what I'm trying to accomplish:
I'm doing my Capstone Project, and the use case uses the SpaceX data set. I've web-scraped the tables found here: SpaceX Falcon 9 Wikipedia,
Now I'm trying to combine them into one large table. However, there are slight differences in the column names, between each table, and so I have to do more logic to merge properly. There are 10 tables in total, I've checked 5. 3 have unique column names, so the simple merging doesn't work.
I've searched around at the other questions, but the use case is different than mine, so I haven't found an answer that works for me.
I'd really appreciate someone's help, or pointing me where I can find more info on the subject. So far I've had no luck in my searches.

Let us just do np.concatenate
out = pd.DataFrame(np.concatenate([df1.values,df2.values,df3.values]),columns=df1.columns)
Out[346]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

IIUC, you could just modify the column names and concatenate:
df2.columns = df2.columns.str[0]
df3.columns = df3.columns.str[0]
out = pd.concat([df1, df2, df3])
or if you're into one-liners, you could do:
out = pd.concat([df1, df2.rename(columns=lambda x:x[0]), df3.rename(columns=lambda x:x[0])])
Output:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

Related

Combine 2 related DataFrames into one multiple indexes dataFrame

I've 2 related Data Frames, Is there any easy way to combine into multi-indexes dataframe?
import pandas as pd
df = pd.DataFrame( [[
1,0,1,0],
[1,1,0,0],
[1,0,0,1],
[0,1,0,1]], columns=["c1","c2","c3", "c4"]
)
idx= pd.Index(['p1','p2','p3','p4'])
df = df.set_index(idx)
df output is:
c1 c2 c3 c4
p1 1 0 1 0
p2 1 1 0 0
p3 1 0 0 1
p4 0 1 0 1
df2 = pd.DataFrame( [[
0,10,30,0],
[20,10,0,0],
[0,10,0,6],
[15,0,18,5]], columns=["c1","c2","c3", "c4"]
)
idx2= pd.Index(['a1','a2','a3','a4'])
df2 = df2.set_index(idx2)
df2 output is:
c1 c2 c3 c4
a1 0 10 30 0
a2 20 10 0 0
a3 0 10 0 6
a4 15 0 18 5
The final dataframe is multi-indexing (p,c,a) single column (value):
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5
You can reshape an merge:
(df.reset_index().melt('index')
.loc[lambda x: x.pop('value').eq(1)]
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
p2 c1 a2 20
a4 15
p3 c1 a2 20
a4 15
p2 c2 a1 10
a2 10
a3 10
p4 c2 a1 10
a2 10
a3 10
p1 c3 a1 30
a4 18
p3 c4 a3 6
a4 5
p4 c4 a3 6
a4 5
If order matters:
(df.stack().reset_index()
.loc[lambda x: x.pop(0).eq(1)]
.set_axis(['index', 'variable'], axis=1)
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable', how='left')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5

How to sort data in Excel to have the same mean?

I've a set of data in excel that i need to sort to reach the closest mean between columns in excel:
I need to sort (obviousliy mixing) the data to have columns of six datas but with the closest mean possible between them.
DATA
VALUE
DATA
VALUE
DATA
VALUE
B4
9
B1
32
C1
3
A2
5
B2
5
C2
2
B3
56
C6
7
C3
155
A4
5
C5
56
B5
3
A5
79
C4
6
A1
1
A6
5
B6
45
A3
4
26,5
25,16667
28
Thank you!

Merging 2 data frames on 3 columns where data sometimes exists

I am attempting merge and fill in missing values in one data frame from another one. Hopefully this isn't too long of an explanation i have just been wracking my brain around this for too long. I am working with 2 huge CSV files so i made a small example here. I have included the entire code at the end in case you were curious to assist. THANK YOU SO MUCH IN ADVANCE. Here we go!
print(df1)
A B C D E
0 1 B1 D1 E1
1 C1 D1 E1
2 1 B1 D1 E1
3 2 B2 D2 E2
4 B2 C2 D2 E2
5 3 D3 E3
6 3 B3 C3 D3 E3
7 4 C4 D4 E4
print(df2)
A B C F G
0 1 C1 F1 G1
1 B2 C2 F2 G2
2 3 B3 F3 G3
3 4 B4 C4 F4 G4
I would essentially like to merge df2 into df1 by 3 different columns. i understand that you can merge on multiple column names but it seems to not give me the desired result. I would like to KEEP all data in df1, and fill in the data from df2 so i use how='left'.
I am fairly new to python and have done a lot of research but have hit a stuck point. Here is what i have tried.
data3 = df1.merge(df2, how='left', on=['A'])
print(data3)
A B_x C_x D E B_y C_y F G
0 1 B1 D1 E1 C1 F1 G1
1 C1 D1 E1 B2 C2 F2 G2
2 1 B1 D1 E1 C1 F1 G1
3 2 B2 D2 E2 NaN NaN NaN NaN
4 B2 C2 D2 E2 B2 C2 F2 G2
5 3 D3 E3 B3 F3 G3
6 3 B3 C3 D3 E3 B3 F3 G3
7 4 C4 D4 E4 B4 C4 F4 G4
As you can see here it sort of worked with just A, however since this is a csv file with blank values. the blank values seem to merge together. which i do not want. because df2 was blank in row 2 it filled in the data where it saw blanks, which is not what i want. it should be NaN if it could not find a match.
whenever i start putting additional rows into my "on=['A', 'B'] it does not do anything different. in-fact, A no longer merges.
data3 = df1.merge(df2, how='left', on=['A', 'B'])
print(data3)
A B C_x D E C_y F G
0 1 B1 D1 E1 NaN NaN NaN
1 C1 D1 E1 NaN NaN NaN
2 1 B1 D1 E1 NaN NaN NaN
3 2 B2 D2 E2 NaN NaN NaN
4 B2 C2 D2 E2 C2 F2 G2
5 3 D3 E3 NaN NaN NaN
6 3 B3 C3 D3 E3 F3 G3
7 4 C4 D4 E4 NaN NaN NaN
Rows A, B, and C are the values i want to correlate and merge on. Using both data frames it should know enough to fill in all the gaps. my ending df should look like:
print(desired_output):
A B C D E F G
0 1 B1 C1 D1 E1 F1 G1
1 1 B1 C1 D1 E1 F1 G1
2 1 B1 C1 D1 E1 F1 G1
3 2 B2 C2 D2 E2 F2 G2
4 2 B2 C2 D2 E2 F2 G2
5 3 B3 C3 D3 E3 F3 G3
6 3 B3 C3 D3 E3 F3 G3
7 4 B4 C4 D4 E4 F4 G4
even though A, B, and C have repeating rows i want to keep ALL the data and just fill in the data from df2 where it might fit, even if it is repeat data. i also do not want to have all of the _x and _y the suffix's from merging. i know how to rename but doing 3 different merges and merging those merges starts to get really complicated really fast with repeated rows and suffix's...
long story short, how can i merge both data-frames by A, and then B, and then C? order in which it happens is irrelevant.
Here is a sample of actual data. I have my own data that has additional data and i relate it to this data by certain identifiers. basically by MMSI, Name and IMO. i want to keep duplicates because they aren't actually duplicates, just additional data points for each vessel
MMSI BaseDateTime LAT LON VesselName IMO CallSign
366940480.0 2017-01-04T11:39:36 52.48730 -174.02316 EARLY DAWN 7821130 WDB7319
366940480.0 2017-01-04T13:51:07 52.41575 -174.60041 EARLY DAWN 7821130 WDB7319
273898000.0 2017-01-06T16:55:33 63.83668 -174.41172 MYS CHUPROVA NaN UAEZ
352844000.0 2017-01-31T22:51:31 51.89778 -176.59334 JACHA 8512920 3EFC4
352844000.0 2017-01-31T23:06:31 51.89795 -176.59333 JACHA 8512920 3EFC4

How to convert values of a column into a new rows [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 3 years ago.
I have a huge pandas dataflame such as
A B C D E
0: a0 b0 c0 d0 e0
1: a1 b1 c1 d1 e1
2: a2 b2 c2 d2 e2
I want to change it into:
A B C X Y
0: a0 b0 c0 D d0
0: a0 b0 c0 E e0
1: a1 b1 c1 D d1
1: a1 b1 c1 E e2
2: a2 b2 c2 D d1
2: a2 b2 c2 E e2
How can I do that?
So far I am creating a new dataFrame and populating it with a for-loop.
You can do via set index, and stack:
df = (df.set_index(list('ABC'))
.stack()
.reset_index()
.rename(columns={'level_3': 'X', 0: 'Y'})
)
A B C X Y
0 a0 b0 c0 D d0
1 a0 b0 c0 E e0
2 a1 b1 c1 D d1
3 a1 b1 c1 E e1
4 a2 b2 c2 D d2
5 a2 b2 c2 E e2

Change index of crosstab in pandas dataframe

I have following data-frame df. I retrieved subset of df without NAN.
#df is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D2 E3 F2 NaN UNKNOWN
2 D1 E3 NaN S2 UNKNOWN
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
9 D2 E2 NaN NaN good
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
df_subset = df[~(df.iloc[:, 0:4].isnull().any(1))]
print(df_subset)
#df_subset is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
After this I try to make cross-tab from both df and df_subset data-frames, 'C_Step' for index and 'RE' for column
Cross-tab from df:
c1 = pd.crosstab([df.C_Step],[df.RE],dropna=True)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 4
fair 0 1 3
good 0 1 0
poor 2 0 0
Cross tab from df_subset:
c1 = pd.crosstab([df_subset.C_Step],[df_subset.RE],dropna=False)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
poor 2 0 0
Question: Index of both crosstab is different. How Can I have index of cross-tab generated from 'df_subset' same as 'df'? Category 'good' is missing in cross-tab of df_subset
The desired cross-tab of df_subset is:
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0
Use reindex with parameter fill_value=0:
c2 = pd.crosstab([df_subset.C_Step], [df_subset.RE], dropna=False)
c2 = c2.reindex(c1.index, fill_value=0)
print(c2)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0

Resources