Merged dataframe seems missing two rows - python-3.x
I had run the below code :
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
print(pd.merge(df1,df3, on='HPI'))
I am getting the output as :
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
My Question here is
1) Why I am having so big dataframe. HPI has only 4 values but in output 6 rows has been generated.
2) If merge will take all the values from HPI then why the value 80 and 88 hasn't been taken twice each?
You get 85 4 times, because duplicated in df1 and df2 in joined columns HPI . And 88 with 80 are unique, so inner join return alo one row for each.
Apparently, the inner join means that if there is a match on the join column in both tables, every row will be matched the maximum number of times possible.
So before merge need remove duplicates for correct output.
df1 = df1.drop_duplicates('HPI')
df3 = df3.drop_duplicates('HPI')
Samples with dupes values in HPI columns and outputs:
#2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 85 - 2x2, value 85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 65 50 7
2 85 3 55 52 8
3 85 3 55 53 6
4 85 2 55 52 8
5 85 2 55 53 6
#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 3 55 50 7
#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 50 52 8
2 80 2 50 50 9
3 80 3 55 50 7
4 80 3 55 52 8
5 80 3 55 50 9
6 80 2 65 50 7
7 80 2 65 52 8
8 80 2 65 50 9
9 80 2 55 50 7
10 80 2 55 52 8
11 80 2 55 50 9
As jezrael wrote, you have 6 rows because the values for HPI=85 in df1 and df3 are not unique. On the contrary on df1 and df3 you have only a value for HPI=80 and for HPI=88.
If I make an assumption and consider also your index, I can guess that what you want is something like this:
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
index
2001 80 2 50 50 7
2002 85 3 55 52 8
2003 88 2 65 50 9
2004 85 2 55 53 6
If you want something like this, then you can do:
pd.merge(df1, df3, left_index=True, right_index=True, on='HPI')
But I am just making an assumption, so I dont know if this is the output you would like.
Related
Form a list of pairwise metrics (distance) using a matrix lookp comming from a python pandas dataframe
I have a distance matrix as a dataframe: data_map = { 'startNode':["0","0","0","0","0","455","455","455","455","455","10","10","10","10","10","30","30","30","30","30","2","2","2","2","2"], 'EndNode':["0","455","30","10","2","0","455","30","10","2","0","455","30","10","2","0","455","10","2","30","0","455","30","10","2"], 'Dmeters':["0","19481","94","90","10","19481","0","750","75","20","90","75","1013","0","200","94","750","1013","50","0","10","20","50","200","0"] } df_map_mat = pd.DataFrame.from_dict(data_map) Input data frames: df_map_mat Out[141]: startNode EndNode Dmeters 0 0 0 0 1 0 455 19481 2 0 30 94 3 0 10 90 4 0 2 10 5 455 0 19481 6 455 455 0 7 455 30 750 8 455 10 75 9 455 2 20 10 10 0 90 11 10 455 75 12 10 30 1013 13 10 10 0 14 10 2 200 15 30 0 94 16 30 455 750 17 30 10 1013 18 30 2 50 19 30 30 0 20 2 0 10 21 2 455 20 22 2 30 50 23 2 10 200 24 2 2 0 I need to query the df_map_mat dataframe and populate the list column shown below the list column is formed by querying the NID column against df_map_mat eg: 0 in the startNode and 0 in the End node distance is 0, again 10 -> 0 is 90, similary 30 -> 455 is 750 meters. df_dist_mat = { 'Nid':["0","10","2","30","455"], 'NName':["Q-CH","ANGC","AmOR","ANAGER","RPURAM"], 'D_list':[ "[0,90,10,94,19481]","[90,0,200,1013,75]","[10,200,0,50,20]","[94,1013,50,0,750]","[19481,75,20,750,0]"] } df_dist_mat = pd.DataFrame.from_dict(df_dist_mat) Expected DataFrame: df_dist_mat Out[142]: Nid NName D_list 0 0 Q-CH [0,90,10,94,19481] 1 10 ANGC [90,0,200,1013,75] 2 2 AmOR [10,200,0,50,20] 3 30 ANAGER [94,1013,50,0,750] 4 455 RPURAM [19481,75,20,750,0] [![enter code here][1]][1] ]
You can use DataFrame.pivot with DataFrame.reindex: arr = np.array([0,10,2,30,455]) df = (df_map_mat.astype({'startNode':int, 'EndNode':int}) .pivot('startNode','EndNode','Dmeters') .reindex(index=arr, columns=arr)) print (df) EndNode 0 10 2 30 455 startNode 0 0 90 10 94 19481 10 90 0 200 1013 75 2 10 200 0 50 20 30 94 1013 50 0 750 455 19481 75 20 750 0 and for lists use: out = df.to_numpy().tolist() print (out) [[0, 90, 10, 94, 19481], [90, 0, 200, 1013, 75], [10, 200, 0, 50, 20], [94, 1013, 50, 0, 750], [19481, 75, 20, 750, 0]]
I have encoded the Nodeid column in two np arrays.. might not be an efficient solution, but a solution that gives the answer. import numpy as np x = np.array([[0],[10],[2], [30],[455]]) y = np.array([[0],[10],[2], [30],[455]]) def calc_dist(x,y): d_list = [] for i in (x): d_inner_list = [] for j in (y): i = int(i) j = int(j) match = df_map_mat[(df_map_mat["startNode"] == i) & (df_map_mat["EndNode"] == j)] d = match['Dmeters'] dist = int(d) d_inner_list.append(dist) d_list.append(d_inner_list) print(d_list) calc_dist(x,y) solution: calc_dist(x,y) [[0, 90, 10, 94, 19481], [90, 0, 200, 1013, 75], [10, 200, 0, 50, 20], [94, 1013, 50, 0, 750], [19481, 75, 20, 750, 0]]
Need help on agg function after groupby for doing operation last - first
I've below pandas dataframe. group A B C D E 0 g1 12 14 26 68 83 1 g1 56 58 67 34 97 2 g1 47 87 23 87 90 3 g2 43 76 98 32 78 4 g2 32 56 36 87 65 5 g2 54 12 24 45 95 I wish to apply groupby on same using column 'group' and wish to apply aggregate function to get (last - first) for column 'E'. The expected output: group A B C D E 0 g1 12 87 116 34 7 1 g2 43 12 158 32 17 I've written below code. But it is not working. import pandas as pd df = pd.DataFrame([["g1", 12, 14, 26, 68, 83], ["g1", 56, 58, 67, 34, 97], ["g1", 47, 87, 23, 87, 90], ["g2", 43, 76, 98, 32, 78], ["g2", 32, 56, 36, 87, 65], ["g2", 54, 12, 24, 45, 95]], columns=["group", "A", "B", "C", "D", "E"]) ndf = df.groupby(["group"], as_index=False).agg({"A": 'first', "B": 'last', "C": 'sum', "D": 'min', "E": 'last - first'}) print(df) print(ndf)
You can use a lambda function for this. ndf = ( df.groupby(["group"], as_index=False) .agg({"A": 'first', "B": 'last', "C": 'sum', "D": 'min', "E": lambda x: x.iat[-1]-x.iat[0]}) ) will output group A B C D E 0 g1 12 87 116 34 7 1 g2 43 12 158 32 17
Pandas JOIN/MERGE/CONCAT Data Frame On Specific Indices
I want to join two data frames specific indices as per the map (dictionary) I have created. What is an efficient way to do this? Data: df = pd.DataFrame({"a":[10, 34, 24, 40, 56, 44], "b":[95, 63, 74, 85, 56, 43]}) print(df) a b 0 10 95 1 34 63 2 24 74 3 40 85 4 56 56 5 44 43 df1 = pd.DataFrame({"c":[1, 2, 3, 4], "d":[5, 6, 7, 8]}) print(df1) c d 0 1 5 1 2 6 2 3 7 3 4 8 d = { (1,0):0.67, (1,2):0.9, (2,1):0.2, (2,3):0.34 (4,0):0.7, (4,2):0.5 } Desired Output: a b c d ratio 0 34 63 1 5 0.67 1 34 63 3 7 0.9 ... 5 56 56 3 7 0.5 I'm able to achieve this but it takes a lot of time since my original data frames' map has about 4.7M rows to map. I'd love to know if there is a way to MERGE, JOIN or CONCAT these data frames on different indices. My Approach: matched_rows = [] for key in d.keys(): s = df.iloc[key[0]].tolist() + df1.iloc[key[1]].tolist() + [d[key]] matched_rows.append(s) df_matched = pd.DataFrame(matched_rows, columns = df.columns.tolist() + df1.columns.tolist() + ['ratio'] I would highly appreciate your help. Thanks a lot in advance.
Create Series and then DaatFrame by dictioanry, DataFrame.join both and last remove first 2 columns by positions: df = (pd.Series(d).reset_index(name='ratio') .join(df, on='level_0') .join(df1, on='level_1') .iloc[:, 2:]) print (df) ratio a b c d 0 0.67 34 63 1 5 1 0.90 34 63 3 7 2 0.20 24 74 2 6 3 0.34 24 74 4 8 4 0.70 56 56 1 5 5 0.50 56 56 3 7 And then if necessary reorder columns: df = df[df.columns[1:].tolist() + df.columns[:1].tolist()] print (df) a b c d ratio 0 34 63 1 5 0.67 1 34 63 3 7 0.90 2 24 74 2 6 0.20 3 24 74 4 8 0.34 4 56 56 1 5 0.70 5 56 56 3 7 0.50
Data partition on known and unknown rows
I have a dataset with known and unknown variables (just one column). I'd like to separate rows for 2 lists - First list of rows with all known variables and the Second list of rows with all missed (unknown) variables. df = {'Id' : [1, 2, 3, 4, 5], 'First' : [30, 22, 18, 49, 22], 'Second' : [80, 28, 16, 56, 30], 'Third' : [14, None, None, 30, 27], 'Fourth' : [14, 85, 17, 22, 14], 'Fifth' : [22, 33, 45, 72, 11]} df = pd.DataFrame(df, columns = ['Id', 'First', 'Second', 'Third', 'Fourth']) df Two separate lists with all Known variables and another one with Unknown variables
Let me know if this helps : df['TF']= df.isnull().any(axis=1) df_without_none = df[df['TF'] == 0] df_with_none = df[df['TF'] == 1] print(df_without_none.head()) print(df_with_none.head()) #### Input #### Id First Second Third Fourth Fruit Total TF 0 1 30 80 14.0 14 124.0 False 1 2 22 28 NaN 85 50.0 True 2 3 18 16 NaN 17 34.0 True 3 4 49 56 30.0 22 135.0 False 4 5 22 30 27.0 14 79.0 False #### Output #### Id First Second Third Fourth Fruit Total TF 0 1 30 80 14.0 14 124.0 False 3 4 49 56 30.0 22 135.0 False 4 5 22 30 27.0 14 79.0 False Id First Second Third Fourth Fruit Total TF 1 2 22 28 NaN 85 50.0 True 2 3 18 16 NaN 17 34.0 True
Transpose a pandas dataframe with headers as column and not index
When I transpose a dataframe, the headers are considered as "index" by default. But I want it to be a column and not an index. How do I achieve this ? import pandas as pd dict = {'col-a': [97, 98, 99], 'col-b': [34, 35, 36], 'col-c': [24, 25, 26]} df = pd.DataFrame(dict) print(df.T) 0 1 2 col-a 97 98 99 col-b 34 35 36 col-c 24 25 26 Desired Output: 0 1 2 3 0 col-a 97 98 99 1 col-b 34 35 36 2 col-c 24 25 26
Try T with reset_index: df=df.T.reset_index() print(df) Or: df.T.reset_index(inplace=True) print(df) Both Output: index 0 1 2 0 col-a 97 98 99 1 col-b 34 35 36 2 col-c 24 25 26 If care about column names, add this to the code: df.columns=range(4) Or: it=iter(range(4)) df=df.rename(columns=lambda x: next(it)) Or if don't know number of columns: df.columns=range(len(df.columns)) Or: it=iter(range(len(df.columns))) df=df.rename(columns=lambda x: next(it)) All Output: 0 1 2 3 0 col-a 97 98 99 1 col-b 34 35 36 2 col-c 24 25 26
Use reset_index and then set default columns names: df1 = df.T.reset_index() df1.columns = np.arange(len(df1.columns)) print (df1) 0 1 2 3 0 col-a 97 98 99 1 col-b 34 35 36 2 col-c 24 25 26 Another solution: print (df.rename_axis(0, axis=1).rename(lambda x: x + 1).T.reset_index()) #alternative #print (df.T.rename_axis(0).rename(columns = lambda x: x + 1).reset_index()) 0 1 2 3 0 col-a 97 98 99 1 col-b 34 35 36 2 col-c 24 25 26