How to reorganize/restructure values in a dataframe with no column header by refering to a master dataframe in python? - python-3.x

Master Dataframe:
B
D
E
b1
d1
e1
b2
d2
e2
b3
d3
d4
d5
Dataframe with no column name:
b1
d3
e1
d2
b2
e2
e1
d5
e1
How do i convert the dataframe above to something like in the table below (with column names) by refering to master dataframe?
B
D
E
b1
d3
e1
d2
b2
e2
e1
d5
e1
Thank you in advance for your help!

One way would be to make a mapping dict, then reindex each row:
# Mapping dict
d = {}
for k, v in df.to_dict("list").items():
d.update(**dict.fromkeys(set(v) - {np.nan}, k))
# or pandas approach
d = df.melt().dropna().set_index("value")["variable"].to_dict()
def reorganize(ser):
data = [i for i in ser if pd.notna(i)]
ind = [d.get(i, i) for i in data]
return pd.Series(data, index=ind)
df2.apply(reorganize, axis=1)
Output:
B D E
0 b1 NaN NaN
1 NaN d3 e1
2 NaN d2 NaN
3 b2 NaN e2
4 NaN NaN e1
5 NaN d5 e1

It's not a beautiful answer, but I think I was able to do it by using .loc. I don't think you need to use Master Dataframe.
import pandas as pd
df = pd.DataFrame({'col1': ['b1', 'd3', 'd2', 'b2', 'e1', 'd5'],
'col2': ['', 'e1', '', 'e2', '', 'e1']},
columns=['col1', 'col2'])
df
# col1 col2
# 0 b1
# 1 d3 e1
# 2 d2
# 3 b2 e2
# 4 e1
# 5 d5 e1
df_reshaped = pd.DataFrame()
for index, row in df.iterrows():
for col in df.columns:
i = row[col]
j = i[0] if i != '' else ''
if j != '':
df_reshaped.loc[index, j] = i
df_reshaped.columns = df_reshaped.columns.str.upper()
df_reshaped
# B D E
# 0 b1 NaN NaN
# 1 NaN d3 e1
# 2 NaN d2 NaN
# 3 b2 NaN e2
# 4 NaN NaN e1
# 5 NaN d5 e1

Related

Check if dictionaries are equal in df

I have a df, in which a column contains dictionaries:
a b c d
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'}
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
and another dict:
di = {0.0: 'a', 1.0: 'b'}
I want to add a new column with 'yes' in it when d=di, and 'no' or 'NaN' or just empty when it's not. I tried the below but it doesnt work:
df.loc[df['d'] == di, 'e'] = 'yes'
The result would be:
a b c d e
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'} yes
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
Anyone being able to help me here?
Thank you in advance!
You can try
df['new'] = df['d'].eq(di).map({True:'yes',False:''})

Merging 2 data frames on 3 columns where data sometimes exists

I am attempting merge and fill in missing values in one data frame from another one. Hopefully this isn't too long of an explanation i have just been wracking my brain around this for too long. I am working with 2 huge CSV files so i made a small example here. I have included the entire code at the end in case you were curious to assist. THANK YOU SO MUCH IN ADVANCE. Here we go!
print(df1)
A B C D E
0 1 B1 D1 E1
1 C1 D1 E1
2 1 B1 D1 E1
3 2 B2 D2 E2
4 B2 C2 D2 E2
5 3 D3 E3
6 3 B3 C3 D3 E3
7 4 C4 D4 E4
print(df2)
A B C F G
0 1 C1 F1 G1
1 B2 C2 F2 G2
2 3 B3 F3 G3
3 4 B4 C4 F4 G4
I would essentially like to merge df2 into df1 by 3 different columns. i understand that you can merge on multiple column names but it seems to not give me the desired result. I would like to KEEP all data in df1, and fill in the data from df2 so i use how='left'.
I am fairly new to python and have done a lot of research but have hit a stuck point. Here is what i have tried.
data3 = df1.merge(df2, how='left', on=['A'])
print(data3)
A B_x C_x D E B_y C_y F G
0 1 B1 D1 E1 C1 F1 G1
1 C1 D1 E1 B2 C2 F2 G2
2 1 B1 D1 E1 C1 F1 G1
3 2 B2 D2 E2 NaN NaN NaN NaN
4 B2 C2 D2 E2 B2 C2 F2 G2
5 3 D3 E3 B3 F3 G3
6 3 B3 C3 D3 E3 B3 F3 G3
7 4 C4 D4 E4 B4 C4 F4 G4
As you can see here it sort of worked with just A, however since this is a csv file with blank values. the blank values seem to merge together. which i do not want. because df2 was blank in row 2 it filled in the data where it saw blanks, which is not what i want. it should be NaN if it could not find a match.
whenever i start putting additional rows into my "on=['A', 'B'] it does not do anything different. in-fact, A no longer merges.
data3 = df1.merge(df2, how='left', on=['A', 'B'])
print(data3)
A B C_x D E C_y F G
0 1 B1 D1 E1 NaN NaN NaN
1 C1 D1 E1 NaN NaN NaN
2 1 B1 D1 E1 NaN NaN NaN
3 2 B2 D2 E2 NaN NaN NaN
4 B2 C2 D2 E2 C2 F2 G2
5 3 D3 E3 NaN NaN NaN
6 3 B3 C3 D3 E3 F3 G3
7 4 C4 D4 E4 NaN NaN NaN
Rows A, B, and C are the values i want to correlate and merge on. Using both data frames it should know enough to fill in all the gaps. my ending df should look like:
print(desired_output):
A B C D E F G
0 1 B1 C1 D1 E1 F1 G1
1 1 B1 C1 D1 E1 F1 G1
2 1 B1 C1 D1 E1 F1 G1
3 2 B2 C2 D2 E2 F2 G2
4 2 B2 C2 D2 E2 F2 G2
5 3 B3 C3 D3 E3 F3 G3
6 3 B3 C3 D3 E3 F3 G3
7 4 B4 C4 D4 E4 F4 G4
even though A, B, and C have repeating rows i want to keep ALL the data and just fill in the data from df2 where it might fit, even if it is repeat data. i also do not want to have all of the _x and _y the suffix's from merging. i know how to rename but doing 3 different merges and merging those merges starts to get really complicated really fast with repeated rows and suffix's...
long story short, how can i merge both data-frames by A, and then B, and then C? order in which it happens is irrelevant.
Here is a sample of actual data. I have my own data that has additional data and i relate it to this data by certain identifiers. basically by MMSI, Name and IMO. i want to keep duplicates because they aren't actually duplicates, just additional data points for each vessel
MMSI BaseDateTime LAT LON VesselName IMO CallSign
366940480.0 2017-01-04T11:39:36 52.48730 -174.02316 EARLY DAWN 7821130 WDB7319
366940480.0 2017-01-04T13:51:07 52.41575 -174.60041 EARLY DAWN 7821130 WDB7319
273898000.0 2017-01-06T16:55:33 63.83668 -174.41172 MYS CHUPROVA NaN UAEZ
352844000.0 2017-01-31T22:51:31 51.89778 -176.59334 JACHA 8512920 3EFC4
352844000.0 2017-01-31T23:06:31 51.89795 -176.59333 JACHA 8512920 3EFC4

Change index of crosstab in pandas dataframe

I have following data-frame df. I retrieved subset of df without NAN.
#df is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D2 E3 F2 NaN UNKNOWN
2 D1 E3 NaN S2 UNKNOWN
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
9 D2 E2 NaN NaN good
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
df_subset = df[~(df.iloc[:, 0:4].isnull().any(1))]
print(df_subset)
#df_subset is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
After this I try to make cross-tab from both df and df_subset data-frames, 'C_Step' for index and 'RE' for column
Cross-tab from df:
c1 = pd.crosstab([df.C_Step],[df.RE],dropna=True)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 4
fair 0 1 3
good 0 1 0
poor 2 0 0
Cross tab from df_subset:
c1 = pd.crosstab([df_subset.C_Step],[df_subset.RE],dropna=False)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
poor 2 0 0
Question: Index of both crosstab is different. How Can I have index of cross-tab generated from 'df_subset' same as 'df'? Category 'good' is missing in cross-tab of df_subset
The desired cross-tab of df_subset is:
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0
Use reindex with parameter fill_value=0:
c2 = pd.crosstab([df_subset.C_Step], [df_subset.RE], dropna=False)
c2 = c2.reindex(c1.index, fill_value=0)
print(c2)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0

Retrieve the rows of data-frames with NAN values and not NAN values

I have a data frame df:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D2 E3 F2 NaN NaN
2 D1 E3 NaN S2 good
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 NaN
I want to retrieve df1 with no NAN values in first four columns of the data frame df and df2 with NAN values in first four columns of the data frame df.
Desired output:
df1 =
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D1 E3 F1 S2 fair
2 D2 E2 F1 S1 NaN
df2 =
DT RE FE SE C_Step
0 D2 E3 F2 NaN NaN
1 D1 E3 NaN S2 good
2 D1 NaN F1 S1 poor
3 D2 NaN F1 S2 poor
4 D2 E3 NaN S1 fair
Using dropna
df1 = df.dropna(subset = ['DT','RE','FE','SE'])
df2 = df.loc[~df.index.isin(df.dropna(subset = ['DT','RE','FE','SE']).index)]
df1
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 NaN
df2
DT RE FE SE C_Step
1 D2 E3 F2 NaN NaN
2 D1 E3 NaN S2 good
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
Option 2: to find the rows with null
null_idx = df.index.difference(df.dropna(subset = ['DT','RE','FE','SE']).index)
df.iloc[null_idx]
Create a mask with isnull + any:
m = df.iloc[:, 0:4].isnull().any(1)
df1 = df[~m]
# DT RE FE SE C_Step
#0 D1 E1 F1 S1 poor
#6 D1 E3 F1 S2 fair
#7 D2 E2 F1 S1 NaN
df2 = df[m]
# DT RE FE SE C_Step
#1 D2 E3 F2 NaN NaN
#2 D1 E3 NaN S2 good
#3 D1 NaN F1 S1 poor
#4 D2 NaN F1 S2 poor
#5 D2 E3 NaN S1 fair

Create frame wih both single and multi-level columns and add data to it

I have 2 dataframes like this
frame1=pd.DataFrame(columns=['A','B','C'])
a=['d1','d2','d3']
b=['d4','d5']
tups=([('T1',x) for x in a]+
[('T2',x) for x in b])
cols=pd.MultiIndex.from_tuples(tups,names=['Trial','Data'])
frame2=pd.DataFrame(columns=cols)
My goal is to have both DataFrames in one, and then add some rows of data. The resulting DataFrame would be like
Trial A B C T1 T2
Data d1 d2 d3 d4 d5
0 1 2 3 4 5 6 7 8
1 ...
...
That could somehow be achieved if I did
frame2['A']=1
frame2['B']=2
frame2['C']=3
But this is not a clean solution, and I can't create the frame and then add data, for I would be required to at least insert manually the first row.
I tried
frame3=frame1.join(frame2)
>> A1 A2 A3 (T1, d1) (T1, d2) (T1, d3) (T2, d4) (T2, d5)
This I think is not multi column level.
My second trial
tup2=([('A1',),('A2',),('A3',)]+[('T1',x) for x in a]+
[('T2',x) for x in b])
cols2=pd.MultiIndex.from_tuples(tup2,names=['Trial','Data'])
data=[1,2,3,4,5,6,7,8]
frame20=pd.DataFrame(data,index=cols2).T
Trial A1 A2 A3 T1 T2
Data NaN NaN NaN d1 d2 d3 d4 d5
0 1 2 3 4 5 6 7 8
This one works fine when trying to query it frame20.loc[0,'A1'][0] but if for example I do
frame20['Peter']=1234
>Trial A1 A2 A3 T1 T2 Peter
Data NaN NaN NaN d1 d2 d3 d4 d5
0 1 2 3 4 5 6 7 8 1234
being the column 'Peter' what I desire as opposed to for example A1, which is what I get.
My third trial
tup3=(['A','B','C']+[('T1',x) for x in a]+
[('T2',x) for x in b])
cols3=pd.MultiIndex.from_tuples(tup3,names=['Trial','Data'])
frame21=pd.DataFrame(data,index=cols3).T
returned exactly the same as the second one.
So, what I'm looking for, is a way to do
pd.DataFrame(rows_of_data,index=alfa).T #or
pd.DataFrame(rows_of_data,columns=beta)
where either alfa or beta are in a correct format.
Also, as a bonus, let's say I finally came up with a way to do
finalframe=pd.DataFrame(columns=beta)
How do I have to use concat,append or join so I can add a random row of data such as data=[1,2,3,4,5,6,7,8] to my empty but perfectly created finalframe?
Thank you, best regards
You want to add a level to frame1 with empty strings
pandas.MultiIndex.from_tuples
idx = pd.MultiIndex.from_tuples([(c, '') for c in frame1])
f1 = frame1.set_axis(idx, axis=1, inplace=False)
frame3 = pd.concat([f1, frame2], axis=1)
frame3.reindex([0, 1])
A B C T1 T2
d1 d2 d3 d4 d5
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
pandas.concat
frame3 = pd.concat([
pd.concat([frame1], keys=[''], axis=1).swaplevel(0, 1, 1),
frame2], axis=1)
frame3.reindex([0, 1])
A B C T1 T2
d1 d2 d3 d4 d5
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN

Resources