Change index of crosstab in pandas dataframe - python-3.x

I have following data-frame df. I retrieved subset of df without NAN.
#df is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D2 E3 F2 NaN UNKNOWN
2 D1 E3 NaN S2 UNKNOWN
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
9 D2 E2 NaN NaN good
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
df_subset = df[~(df.iloc[:, 0:4].isnull().any(1))]
print(df_subset)
#df_subset is:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 UNKNOWN
8 D2 E2 F1 S1 fair
10 D2 E2 F1 S1 UNKNOWN
11 D1 E3 F2 S1 UNKNOWN
12 D2 E1 F2 S2 UNKNOWN
13 D2 E1 F1 S2 poor
14 D2 E3 F1 S1 fair
15 D1 E3 F1 S2 UNKNOWN
After this I try to make cross-tab from both df and df_subset data-frames, 'C_Step' for index and 'RE' for column
Cross-tab from df:
c1 = pd.crosstab([df.C_Step],[df.RE],dropna=True)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 4
fair 0 1 3
good 0 1 0
poor 2 0 0
Cross tab from df_subset:
c1 = pd.crosstab([df_subset.C_Step],[df_subset.RE],dropna=False)
print(c1)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
poor 2 0 0
Question: Index of both crosstab is different. How Can I have index of cross-tab generated from 'df_subset' same as 'df'? Category 'good' is missing in cross-tab of df_subset
The desired cross-tab of df_subset is:
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0

Use reindex with parameter fill_value=0:
c2 = pd.crosstab([df_subset.C_Step], [df_subset.RE], dropna=False)
c2 = c2.reindex(c1.index, fill_value=0)
print(c2)
RE E1 E2 E3
C_Step
UNKNOWN 1 2 2
fair 0 1 2
good 0 0 0
poor 2 0 0

Related

Combine 2 related DataFrames into one multiple indexes dataFrame

I've 2 related Data Frames, Is there any easy way to combine into multi-indexes dataframe?
import pandas as pd
df = pd.DataFrame( [[
1,0,1,0],
[1,1,0,0],
[1,0,0,1],
[0,1,0,1]], columns=["c1","c2","c3", "c4"]
)
idx= pd.Index(['p1','p2','p3','p4'])
df = df.set_index(idx)
df output is:
c1 c2 c3 c4
p1 1 0 1 0
p2 1 1 0 0
p3 1 0 0 1
p4 0 1 0 1
df2 = pd.DataFrame( [[
0,10,30,0],
[20,10,0,0],
[0,10,0,6],
[15,0,18,5]], columns=["c1","c2","c3", "c4"]
)
idx2= pd.Index(['a1','a2','a3','a4'])
df2 = df2.set_index(idx2)
df2 output is:
c1 c2 c3 c4
a1 0 10 30 0
a2 20 10 0 0
a3 0 10 0 6
a4 15 0 18 5
The final dataframe is multi-indexing (p,c,a) single column (value):
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5
You can reshape an merge:
(df.reset_index().melt('index')
.loc[lambda x: x.pop('value').eq(1)]
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
p2 c1 a2 20
a4 15
p3 c1 a2 20
a4 15
p2 c2 a1 10
a2 10
a3 10
p4 c2 a1 10
a2 10
a3 10
p1 c3 a1 30
a4 18
p3 c4 a3 6
a4 5
p4 c4 a3 6
a4 5
If order matters:
(df.stack().reset_index()
.loc[lambda x: x.pop(0).eq(1)]
.set_axis(['index', 'variable'], axis=1)
.merge(df2.reset_index().melt('index').query('value != 0'),
on='variable', how='left')
.set_index(['index_x', 'variable', 'index_y'])
.rename_axis([None, None, None])
)
output:
value
p1 c1 a2 20
a4 15
c3 a1 30
a4 18
p2 c1 a2 20
a4 15
c2 a1 10
a2 10
a3 10
p3 c1 a2 20
a4 15
c4 a3 6
a4 5
p4 c2 a1 10
a2 10
a3 10
c4 a3 6
a4 5

How to reorganize/restructure values in a dataframe with no column header by refering to a master dataframe in python?

Master Dataframe:
B
D
E
b1
d1
e1
b2
d2
e2
b3
d3
d4
d5
Dataframe with no column name:
b1
d3
e1
d2
b2
e2
e1
d5
e1
How do i convert the dataframe above to something like in the table below (with column names) by refering to master dataframe?
B
D
E
b1
d3
e1
d2
b2
e2
e1
d5
e1
Thank you in advance for your help!
One way would be to make a mapping dict, then reindex each row:
# Mapping dict
d = {}
for k, v in df.to_dict("list").items():
d.update(**dict.fromkeys(set(v) - {np.nan}, k))
# or pandas approach
d = df.melt().dropna().set_index("value")["variable"].to_dict()
def reorganize(ser):
data = [i for i in ser if pd.notna(i)]
ind = [d.get(i, i) for i in data]
return pd.Series(data, index=ind)
df2.apply(reorganize, axis=1)
Output:
B D E
0 b1 NaN NaN
1 NaN d3 e1
2 NaN d2 NaN
3 b2 NaN e2
4 NaN NaN e1
5 NaN d5 e1
It's not a beautiful answer, but I think I was able to do it by using .loc. I don't think you need to use Master Dataframe.
import pandas as pd
df = pd.DataFrame({'col1': ['b1', 'd3', 'd2', 'b2', 'e1', 'd5'],
'col2': ['', 'e1', '', 'e2', '', 'e1']},
columns=['col1', 'col2'])
df
# col1 col2
# 0 b1
# 1 d3 e1
# 2 d2
# 3 b2 e2
# 4 e1
# 5 d5 e1
df_reshaped = pd.DataFrame()
for index, row in df.iterrows():
for col in df.columns:
i = row[col]
j = i[0] if i != '' else ''
if j != '':
df_reshaped.loc[index, j] = i
df_reshaped.columns = df_reshaped.columns.str.upper()
df_reshaped
# B D E
# 0 b1 NaN NaN
# 1 NaN d3 e1
# 2 NaN d2 NaN
# 3 b2 NaN e2
# 4 NaN NaN e1
# 5 NaN d5 e1

Merging 2 data frames on 3 columns where data sometimes exists

I am attempting merge and fill in missing values in one data frame from another one. Hopefully this isn't too long of an explanation i have just been wracking my brain around this for too long. I am working with 2 huge CSV files so i made a small example here. I have included the entire code at the end in case you were curious to assist. THANK YOU SO MUCH IN ADVANCE. Here we go!
print(df1)
A B C D E
0 1 B1 D1 E1
1 C1 D1 E1
2 1 B1 D1 E1
3 2 B2 D2 E2
4 B2 C2 D2 E2
5 3 D3 E3
6 3 B3 C3 D3 E3
7 4 C4 D4 E4
print(df2)
A B C F G
0 1 C1 F1 G1
1 B2 C2 F2 G2
2 3 B3 F3 G3
3 4 B4 C4 F4 G4
I would essentially like to merge df2 into df1 by 3 different columns. i understand that you can merge on multiple column names but it seems to not give me the desired result. I would like to KEEP all data in df1, and fill in the data from df2 so i use how='left'.
I am fairly new to python and have done a lot of research but have hit a stuck point. Here is what i have tried.
data3 = df1.merge(df2, how='left', on=['A'])
print(data3)
A B_x C_x D E B_y C_y F G
0 1 B1 D1 E1 C1 F1 G1
1 C1 D1 E1 B2 C2 F2 G2
2 1 B1 D1 E1 C1 F1 G1
3 2 B2 D2 E2 NaN NaN NaN NaN
4 B2 C2 D2 E2 B2 C2 F2 G2
5 3 D3 E3 B3 F3 G3
6 3 B3 C3 D3 E3 B3 F3 G3
7 4 C4 D4 E4 B4 C4 F4 G4
As you can see here it sort of worked with just A, however since this is a csv file with blank values. the blank values seem to merge together. which i do not want. because df2 was blank in row 2 it filled in the data where it saw blanks, which is not what i want. it should be NaN if it could not find a match.
whenever i start putting additional rows into my "on=['A', 'B'] it does not do anything different. in-fact, A no longer merges.
data3 = df1.merge(df2, how='left', on=['A', 'B'])
print(data3)
A B C_x D E C_y F G
0 1 B1 D1 E1 NaN NaN NaN
1 C1 D1 E1 NaN NaN NaN
2 1 B1 D1 E1 NaN NaN NaN
3 2 B2 D2 E2 NaN NaN NaN
4 B2 C2 D2 E2 C2 F2 G2
5 3 D3 E3 NaN NaN NaN
6 3 B3 C3 D3 E3 F3 G3
7 4 C4 D4 E4 NaN NaN NaN
Rows A, B, and C are the values i want to correlate and merge on. Using both data frames it should know enough to fill in all the gaps. my ending df should look like:
print(desired_output):
A B C D E F G
0 1 B1 C1 D1 E1 F1 G1
1 1 B1 C1 D1 E1 F1 G1
2 1 B1 C1 D1 E1 F1 G1
3 2 B2 C2 D2 E2 F2 G2
4 2 B2 C2 D2 E2 F2 G2
5 3 B3 C3 D3 E3 F3 G3
6 3 B3 C3 D3 E3 F3 G3
7 4 B4 C4 D4 E4 F4 G4
even though A, B, and C have repeating rows i want to keep ALL the data and just fill in the data from df2 where it might fit, even if it is repeat data. i also do not want to have all of the _x and _y the suffix's from merging. i know how to rename but doing 3 different merges and merging those merges starts to get really complicated really fast with repeated rows and suffix's...
long story short, how can i merge both data-frames by A, and then B, and then C? order in which it happens is irrelevant.
Here is a sample of actual data. I have my own data that has additional data and i relate it to this data by certain identifiers. basically by MMSI, Name and IMO. i want to keep duplicates because they aren't actually duplicates, just additional data points for each vessel
MMSI BaseDateTime LAT LON VesselName IMO CallSign
366940480.0 2017-01-04T11:39:36 52.48730 -174.02316 EARLY DAWN 7821130 WDB7319
366940480.0 2017-01-04T13:51:07 52.41575 -174.60041 EARLY DAWN 7821130 WDB7319
273898000.0 2017-01-06T16:55:33 63.83668 -174.41172 MYS CHUPROVA NaN UAEZ
352844000.0 2017-01-31T22:51:31 51.89778 -176.59334 JACHA 8512920 3EFC4
352844000.0 2017-01-31T23:06:31 51.89795 -176.59333 JACHA 8512920 3EFC4

How to convert values of a column into a new rows [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 3 years ago.
I have a huge pandas dataflame such as
A B C D E
0: a0 b0 c0 d0 e0
1: a1 b1 c1 d1 e1
2: a2 b2 c2 d2 e2
I want to change it into:
A B C X Y
0: a0 b0 c0 D d0
0: a0 b0 c0 E e0
1: a1 b1 c1 D d1
1: a1 b1 c1 E e2
2: a2 b2 c2 D d1
2: a2 b2 c2 E e2
How can I do that?
So far I am creating a new dataFrame and populating it with a for-loop.
You can do via set index, and stack:
df = (df.set_index(list('ABC'))
.stack()
.reset_index()
.rename(columns={'level_3': 'X', 0: 'Y'})
)
A B C X Y
0 a0 b0 c0 D d0
1 a0 b0 c0 E e0
2 a1 b1 c1 D d1
3 a1 b1 c1 E e1
4 a2 b2 c2 D d2
5 a2 b2 c2 E e2

Retrieve the rows of data-frames with NAN values and not NAN values

I have a data frame df:
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D2 E3 F2 NaN NaN
2 D1 E3 NaN S2 good
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 NaN
I want to retrieve df1 with no NAN values in first four columns of the data frame df and df2 with NAN values in first four columns of the data frame df.
Desired output:
df1 =
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
1 D1 E3 F1 S2 fair
2 D2 E2 F1 S1 NaN
df2 =
DT RE FE SE C_Step
0 D2 E3 F2 NaN NaN
1 D1 E3 NaN S2 good
2 D1 NaN F1 S1 poor
3 D2 NaN F1 S2 poor
4 D2 E3 NaN S1 fair
Using dropna
df1 = df.dropna(subset = ['DT','RE','FE','SE'])
df2 = df.loc[~df.index.isin(df.dropna(subset = ['DT','RE','FE','SE']).index)]
df1
DT RE FE SE C_Step
0 D1 E1 F1 S1 poor
6 D1 E3 F1 S2 fair
7 D2 E2 F1 S1 NaN
df2
DT RE FE SE C_Step
1 D2 E3 F2 NaN NaN
2 D1 E3 NaN S2 good
3 D1 NaN F1 S1 poor
4 D2 NaN F1 S2 poor
5 D2 E3 NaN S1 fair
Option 2: to find the rows with null
null_idx = df.index.difference(df.dropna(subset = ['DT','RE','FE','SE']).index)
df.iloc[null_idx]
Create a mask with isnull + any:
m = df.iloc[:, 0:4].isnull().any(1)
df1 = df[~m]
# DT RE FE SE C_Step
#0 D1 E1 F1 S1 poor
#6 D1 E3 F1 S2 fair
#7 D2 E2 F1 S1 NaN
df2 = df[m]
# DT RE FE SE C_Step
#1 D2 E3 F2 NaN NaN
#2 D1 E3 NaN S2 good
#3 D1 NaN F1 S1 poor
#4 D2 NaN F1 S2 poor
#5 D2 E3 NaN S1 fair

Resources