Merged dataframe seems missing two rows - python-3.x

I had run the below code :
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
print(pd.merge(df1,df3, on='HPI'))
I am getting the output as :
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
My Question here is
1) Why I am having so big dataframe. HPI has only 4 values but in output 6 rows has been generated.
2) If merge will take all the values from HPI then why the value 80 and 88 hasn't been taken twice each?

You get 85 4 times, because duplicated in df1 and df2 in joined columns HPI . And 88 with 80 are unique, so inner join return alo one row for each.
Apparently, the inner join means that if there is a match on the join column in both tables, every row will be matched the maximum number of times possible.
So before merge need remove duplicates for correct output.
df1 = df1.drop_duplicates('HPI')
df3 = df3.drop_duplicates('HPI')
Samples with dupes values in HPI columns and outputs:
#2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 85 - 2x2, value 85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 65 50 7
2 85 3 55 52 8
3 85 3 55 53 6
4 85 2 55 52 8
5 85 2 55 53 6
#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 3 55 50 7
#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 50 52 8
2 80 2 50 50 9
3 80 3 55 50 7
4 80 3 55 52 8
5 80 3 55 50 9
6 80 2 65 50 7
7 80 2 65 52 8
8 80 2 65 50 9
9 80 2 55 50 7
10 80 2 55 52 8
11 80 2 55 50 9

As jezrael wrote, you have 6 rows because the values for HPI=85 in df1 and df3 are not unique. On the contrary on df1 and df3 you have only a value for HPI=80 and for HPI=88.
If I make an assumption and consider also your index, I can guess that what you want is something like this:
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
index
2001 80 2 50 50 7
2002 85 3 55 52 8
2003 88 2 65 50 9
2004 85 2 55 53 6
If you want something like this, then you can do:
pd.merge(df1, df3, left_index=True, right_index=True, on='HPI')
But I am just making an assumption, so I dont know if this is the output you would like.

Related

Form a list of pairwise metrics (distance) using a matrix lookp comming from a python pandas dataframe

I have a distance matrix as a dataframe:
data_map = {
'startNode':["0","0","0","0","0","455","455","455","455","455","10","10","10","10","10","30","30","30","30","30","2","2","2","2","2"],
'EndNode':["0","455","30","10","2","0","455","30","10","2","0","455","30","10","2","0","455","10","2","30","0","455","30","10","2"],
'Dmeters':["0","19481","94","90","10","19481","0","750","75","20","90","75","1013","0","200","94","750","1013","50","0","10","20","50","200","0"]
}
df_map_mat = pd.DataFrame.from_dict(data_map)
Input data frames:
df_map_mat
Out[141]:
startNode EndNode Dmeters
0 0 0 0
1 0 455 19481
2 0 30 94
3 0 10 90
4 0 2 10
5 455 0 19481
6 455 455 0
7 455 30 750
8 455 10 75
9 455 2 20
10 10 0 90
11 10 455 75
12 10 30 1013
13 10 10 0
14 10 2 200
15 30 0 94
16 30 455 750
17 30 10 1013
18 30 2 50
19 30 30 0
20 2 0 10
21 2 455 20
22 2 30 50
23 2 10 200
24 2 2 0
I need to query the df_map_mat dataframe and populate the list column shown below
the list column is formed by querying the NID column against df_map_mat
eg: 0 in the startNode and 0 in the End node distance is 0, again 10 -> 0 is 90, similary 30 -> 455 is 750 meters.
df_dist_mat = {
'Nid':["0","10","2","30","455"],
'NName':["Q-CH","ANGC","AmOR","ANAGER","RPURAM"],
'D_list':[ "[0,90,10,94,19481]","[90,0,200,1013,75]","[10,200,0,50,20]","[94,1013,50,0,750]","[19481,75,20,750,0]"]
}
df_dist_mat = pd.DataFrame.from_dict(df_dist_mat)
Expected DataFrame:
df_dist_mat
Out[142]:
Nid NName D_list
0 0 Q-CH [0,90,10,94,19481]
1 10 ANGC [90,0,200,1013,75]
2 2 AmOR [10,200,0,50,20]
3 30 ANAGER [94,1013,50,0,750]
4 455 RPURAM [19481,75,20,750,0]
[![enter code here][1]][1]
]
You can use DataFrame.pivot with DataFrame.reindex:
arr = np.array([0,10,2,30,455])
df = (df_map_mat.astype({'startNode':int, 'EndNode':int})
.pivot('startNode','EndNode','Dmeters')
.reindex(index=arr, columns=arr))
print (df)
EndNode 0 10 2 30 455
startNode
0 0 90 10 94 19481
10 90 0 200 1013 75
2 10 200 0 50 20
30 94 1013 50 0 750
455 19481 75 20 750 0
and for lists use:
out = df.to_numpy().tolist()
print (out)
[[0, 90, 10, 94, 19481], [90, 0, 200, 1013, 75],
[10, 200, 0, 50, 20], [94, 1013, 50, 0, 750],
[19481, 75, 20, 750, 0]]
I have encoded the Nodeid column in two np arrays.. might not be an efficient solution, but a solution that gives the answer.
import numpy as np
x = np.array([[0],[10],[2], [30],[455]])
y = np.array([[0],[10],[2], [30],[455]])
def calc_dist(x,y):
d_list = []
for i in (x):
d_inner_list = []
for j in (y):
i = int(i)
j = int(j)
match = df_map_mat[(df_map_mat["startNode"] == i) & (df_map_mat["EndNode"] == j)]
d = match['Dmeters']
dist = int(d)
d_inner_list.append(dist)
d_list.append(d_inner_list)
print(d_list)
calc_dist(x,y)
solution:
calc_dist(x,y)
[[0, 90, 10, 94, 19481], [90, 0, 200, 1013, 75], [10, 200, 0, 50, 20], [94, 1013, 50, 0, 750], [19481, 75, 20, 750, 0]]

Need help on agg function after groupby for doing operation last - first

I've below pandas dataframe.
group A B C D E
0 g1 12 14 26 68 83
1 g1 56 58 67 34 97
2 g1 47 87 23 87 90
3 g2 43 76 98 32 78
4 g2 32 56 36 87 65
5 g2 54 12 24 45 95
I wish to apply groupby on same using column 'group' and wish to apply aggregate function to get (last - first) for column 'E'.
The expected output:
group A B C D E
0 g1 12 87 116 34 7
1 g2 43 12 158 32 17
I've written below code. But it is not working.
import pandas as pd
df = pd.DataFrame([["g1", 12, 14, 26, 68, 83], ["g1", 56, 58, 67, 34, 97], ["g1", 47, 87, 23, 87, 90], ["g2", 43, 76, 98, 32, 78], ["g2", 32, 56, 36, 87, 65], ["g2", 54, 12, 24, 45, 95]], columns=["group", "A", "B", "C", "D", "E"])
ndf = df.groupby(["group"], as_index=False).agg({"A": 'first', "B": 'last', "C": 'sum', "D": 'min', "E": 'last - first'})
print(df)
print(ndf)
You can use a lambda function for this.
ndf = (
df.groupby(["group"], as_index=False)
.agg({"A": 'first',
"B": 'last',
"C": 'sum',
"D": 'min',
"E": lambda x: x.iat[-1]-x.iat[0]})
)
will output
group A B C D E
0 g1 12 87 116 34 7
1 g2 43 12 158 32 17

Pandas JOIN/MERGE/CONCAT Data Frame On Specific Indices

I want to join two data frames specific indices as per the map (dictionary) I have created. What is an efficient way to do this?
Data:
df = pd.DataFrame({"a":[10, 34, 24, 40, 56, 44],
"b":[95, 63, 74, 85, 56, 43]})
print(df)
a b
0 10 95
1 34 63
2 24 74
3 40 85
4 56 56
5 44 43
df1 = pd.DataFrame({"c":[1, 2, 3, 4],
"d":[5, 6, 7, 8]})
print(df1)
c d
0 1 5
1 2 6
2 3 7
3 4 8
d = {
(1,0):0.67,
(1,2):0.9,
(2,1):0.2,
(2,3):0.34
(4,0):0.7,
(4,2):0.5
}
Desired Output:
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.9
...
5 56 56 3 7 0.5
I'm able to achieve this but it takes a lot of time since my original data frames' map has about 4.7M rows to map. I'd love to know if there is a way to MERGE, JOIN or CONCAT these data frames on different indices.
My Approach:
matched_rows = []
for key in d.keys():
s = df.iloc[key[0]].tolist() + df1.iloc[key[1]].tolist() + [d[key]]
matched_rows.append(s)
df_matched = pd.DataFrame(matched_rows, columns = df.columns.tolist() + df1.columns.tolist() + ['ratio']
I would highly appreciate your help. Thanks a lot in advance.
Create Series and then DaatFrame by dictioanry, DataFrame.join both and last remove first 2 columns by positions:
df = (pd.Series(d).reset_index(name='ratio')
.join(df, on='level_0')
.join(df1, on='level_1')
.iloc[:, 2:])
print (df)
ratio a b c d
0 0.67 34 63 1 5
1 0.90 34 63 3 7
2 0.20 24 74 2 6
3 0.34 24 74 4 8
4 0.70 56 56 1 5
5 0.50 56 56 3 7
And then if necessary reorder columns:
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
print (df)
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.90
2 24 74 2 6 0.20
3 24 74 4 8 0.34
4 56 56 1 5 0.70
5 56 56 3 7 0.50

Data partition on known and unknown rows

I have a dataset with known and unknown variables (just one column). I'd like to separate rows for 2 lists - First list of rows with all known variables and the Second list of rows with all missed (unknown) variables.
df = {'Id' : [1, 2, 3, 4, 5],
'First' : [30, 22, 18, 49, 22],
'Second' : [80, 28, 16, 56, 30],
'Third' : [14, None, None, 30, 27],
'Fourth' : [14, 85, 17, 22, 14],
'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['Id', 'First', 'Second', 'Third', 'Fourth'])
df
Two separate lists with all Known variables and another one with Unknown variables
Let me know if this helps :
df['TF']= df.isnull().any(axis=1)
df_without_none = df[df['TF'] == 0]
df_with_none = df[df['TF'] == 1]
print(df_without_none.head())
print(df_with_none.head())
#### Input ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
#### Output ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
Id First Second Third Fourth Fruit Total TF
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True

Transpose a pandas dataframe with headers as column and not index

When I transpose a dataframe, the headers are considered as "index" by default. But I want it to be a column and not an index. How do I achieve this ?
import pandas as pd
dict = {'col-a': [97, 98, 99],
'col-b': [34, 35, 36],
'col-c': [24, 25, 26]}
df = pd.DataFrame(dict)
print(df.T)
0 1 2
col-a 97 98 99
col-b 34 35 36
col-c 24 25 26
Desired Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Try T with reset_index:
df=df.T.reset_index()
print(df)
Or:
df.T.reset_index(inplace=True)
print(df)
Both Output:
index 0 1 2
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
If care about column names, add this to the code:
df.columns=range(4)
Or:
it=iter(range(4))
df=df.rename(columns=lambda x: next(it))
Or if don't know number of columns:
df.columns=range(len(df.columns))
Or:
it=iter(range(len(df.columns)))
df=df.rename(columns=lambda x: next(it))
All Output:
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Use reset_index and then set default columns names:
df1 = df.T.reset_index()
df1.columns = np.arange(len(df1.columns))
print (df1)
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26
Another solution:
print (df.rename_axis(0, axis=1).rename(lambda x: x + 1).T.reset_index())
#alternative
#print (df.T.rename_axis(0).rename(columns = lambda x: x + 1).reset_index())
0 1 2 3
0 col-a 97 98 99
1 col-b 34 35 36
2 col-c 24 25 26

Resources