Data partition on known and unknown rows

Data partition on known and unknown rows - python-3.x

I have a dataset with known and unknown variables (just one column). I'd like to separate rows for 2 lists - First list of rows with all known variables and the Second list of rows with all missed (unknown) variables.
df = {'Id' : [1, 2, 3, 4, 5],
'First' : [30, 22, 18, 49, 22],
'Second' : [80, 28, 16, 56, 30],
'Third' : [14, None, None, 30, 27],
'Fourth' : [14, 85, 17, 22, 14],
'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['Id', 'First', 'Second', 'Third', 'Fourth'])
df
Two separate lists with all Known variables and another one with Unknown variables

Let me know if this helps :
df['TF']= df.isnull().any(axis=1)
df_without_none = df[df['TF'] == 0]
df_with_none = df[df['TF'] == 1]
print(df_without_none.head())
print(df_with_none.head())
#### Input ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
#### Output ####
Id First Second Third Fourth Fruit Total TF
0 1 30 80 14.0 14 124.0 False
3 4 49 56 30.0 22 135.0 False
4 5 22 30 27.0 14 79.0 False
Id First Second Third Fourth Fruit Total TF
1 2 22 28 NaN 85 50.0 True
2 3 18 16 NaN 17 34.0 True

Related

Set upperbound in a column for a specific group by using Python

I have a dataset given as such in Python:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'ID': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'Salary': [1, 2, 3, 4, 5,6,7,8,9,10, 1, 2, 3,4,5,6, 1, 2, 3, 4,5,6,7,8],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such :
Now, for every ID/group, I wish to set an upperbound for some value of 'Salary'.
For example,
For ID=1, the upperbound of 'Salary' should be set at 4
For ID=2, the upperbound of 'Salary' should be set at 3
For ID=3, the upperbound of 'Salary' should be set at 5
The net result needs to look as such:
Can somebody please let me know how to achieve this task in python?

Use custom function with mapping by helper dictionary in GroupBy.transform:
d = {1:4, 2:3, 3:5}
def f(x):
x.iloc[:d[x.name]] = d[x.name]
return x
df['Salary'] = df.groupby('ID')['Salary'].transform(f)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
Another idea is use GroupBy.cumcount for counter per ID, compared by mapped ID and if match set mapped Series by Series.mask:
d = {1:4, 2:3, 3:5}
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df.groupby('ID').cumcount().lt(s), s)
Or if counter column is in Salary is possible use:
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df['Salary'].le(s), s)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89

One option is to create a series from the dictionary, merge with the dataframe and then update the Salary column conditionally:
ser = pd.Series(d, name = 'd')
ser.index.name = 'ID'
(df
.merge(ser, on = 'ID')
.assign(Salary = lambda f: np.where(f.Salary.lt(f.d), f.d, f.Salary))
.drop(columns='d')
)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89

Need help on agg function after groupby for doing operation last - first

I've below pandas dataframe.
group A B C D E
0 g1 12 14 26 68 83
1 g1 56 58 67 34 97
2 g1 47 87 23 87 90
3 g2 43 76 98 32 78
4 g2 32 56 36 87 65
5 g2 54 12 24 45 95
I wish to apply groupby on same using column 'group' and wish to apply aggregate function to get (last - first) for column 'E'.
The expected output:
group A B C D E
0 g1 12 87 116 34 7
1 g2 43 12 158 32 17
I've written below code. But it is not working.
import pandas as pd
df = pd.DataFrame([["g1", 12, 14, 26, 68, 83], ["g1", 56, 58, 67, 34, 97], ["g1", 47, 87, 23, 87, 90], ["g2", 43, 76, 98, 32, 78], ["g2", 32, 56, 36, 87, 65], ["g2", 54, 12, 24, 45, 95]], columns=["group", "A", "B", "C", "D", "E"])
ndf = df.groupby(["group"], as_index=False).agg({"A": 'first', "B": 'last', "C": 'sum', "D": 'min', "E": 'last - first'})
print(df)
print(ndf)

You can use a lambda function for this.
ndf = (
df.groupby(["group"], as_index=False)
.agg({"A": 'first',
"B": 'last',
"C": 'sum',
"D": 'min',
"E": lambda x: x.iat[-1]-x.iat[0]})
)
will output
group A B C D E
0 g1 12 87 116 34 7
1 g2 43 12 158 32 17

update the same value of parent column as child list values in pandas data frame

Input dataframe
data = {
'IDs': ['A1','A10','A11','A12','A13','A14','A17','A10','A68','A7','A68','A34','A6','A24','A20','A21','A34','A14','A20','A68'],
'S_S' :['G001','','','','','','','','','','','','','','','','','','','',],
'St_s': ['Pa','','','','','','','','','','','','','','','','','','','',],
'SsFlag': ['Pr','','','','','','','','','','','','','','','','','','','',],
'org_id' :[32,10,11,12,11,12,17,10,68,7,68,34,6,24,20,21,34,14,20,68,],
'flag': [[ '32','68','7'],['10', '68'],['11', '12', '34', '6'],['12','24'],['11','20','21','34'],['12','14','20'],['17','10','68'],[],[],[],[],[],[],[],[],[],[],[],[],[]]
}
df = pd.DataFrame.from_dict(data)
df
output of the original dataframe:
Out[713]:
IDs S_S St_s SsFlag org_id flag
0 A1 G001 Pa Pr 32 [32, 68, 7]
1 A10 10 [10, 68]
2 A11 11 [11, 12, 34, 6]
3 A12 12 [12, 24]
4 A13 11 [11, 20, 21, 34]
5 A14 12 [12, 14, 20]
6 A17 17 [17, 10, 68]
7 A10 10 []
8 A68 68 []
9 A7 7 []
10 A68 68 []
11 A34 34 []
12 A6 6 []
13 A24 24 []
14 A20 20 []
15 A21 21 []
16 A34 34 []
17 A14 14 []
18 A20 20 []
19 A68 68 []
Required dataframe:
data = {
'IDs': ['A1','A10','A11','A12','A13','A14','A17','A10','A68','A7','A68','A34','A6','A24','A20','A21','A34','A14','A20','A68'],
'S_S' :['G001','','','','','','','','','','','','','','','','','','','',],
'St_s': ['Pa','','','','','','','','','','','','','','','','','','','',],
'SsFlag': ['Pr','','','','','','','','','','','','','','','','','','','',],
'org_id' :[32,10,11,12,11,12,17,10,68,7,68,34,6,24,20,21,34,14,20,68,],
'rel_id' : [32,10,11,11,11,12,17,17,32,32,10,11,11,12,11,11,11,12,12,17,],
'flag': [[ '32','68','7'],['10', '68'],['11', '12', '34', '6'],['12','24'],['11','20','21','34'],['12','14','20'],['17','10','68'],[],[],[],[],[],[],[],[],[],[],[],[],[]],
'Processed_first' :['','','','yes','','','','','','','','yes','yes','','yes','yes','yes','yes','yes','yes',]
}
df = pd.DataFrame.from_dict(data)
df
Out[679]:
IDs S_S St_s SsFlag org_id rel_id flag Processed_first
0 A1 G001 Pa Pr 32 32 [32, 68, 7]
1 A10 10 10 [10, 68]
2 A11 11 11 [11, 12, 34, 6]
3 A12 12 11 [12, 24] yes
4 A13 11 11 [11, 20, 21, 34]
5 A14 12 12 [12, 14, 20]
6 A17 17 17 [17, 10, 68]
7 A10 10 17 []
8 A68 68 32 []
9 A7 7 32 []
10 A68 68 10 []
11 A34 34 11 [] yes
12 A6 6 11 [] yes
13 A24 24 12 []
14 A20 20 11 [] yes
15 A21 21 11 [] yes
16 A34 34 11 [] yes
17 A14 14 12 [] yes
18 A20 20 12 [] yes
19 A68 68 17 [] yes
I need a column to have an updated id depending upon the parent(org_id) and the childs(rel_id) column whose child list is there in flag column, also added processed_first for reference to explain the logic as the alerts are processed first, hence need not process that column.
for each element in the flag_list update the rel_id column, for the first time it gets the self-id same as org_id, the second element the related org_id column with the should be updated by parent_id, in rel_id column, for example in first row 32 get 32 first as id, the second element 68 appears at the bottom 8th row hence gets an ID as 32 since 32 is its parent. Similarly, 2nd row 10 gets an id as 10 for the first time, and 68 appears again in 10th row and gets a related id as 10, process_firs indicates the alert is processed.

IIUC, this is one way to update rel_id:
df_map = df.set_index('org_id')['flag'].explode().rename_axis('rel_id').rename('org_id').reset_index()
df_map = df_map.set_index(['org_id', df_map.groupby('org_id').cumcount()]).reset_index().dropna()
df_map['org_id'] = df_map['org_id'].astype('int')
df.set_index(['org_id', df.groupby('org_id').cumcount()]).reset_index().merge(df_map)
Output:
org_id level_1 IDs S_S St_s SsFlag flag rel_id
0 32 0 A1 G001 Pa Pr [32, 68, 7] 32
1 10 0 A10 [10, 68] 10
2 11 0 A11 [11, 12, 34, 6] 11
3 12 0 A12 [12, 24] 11
4 11 1 A13 [11, 20, 21, 34] 11
5 12 1 A14 [12, 14, 20] 12
6 17 0 A17 [17, 10, 68] 17
7 10 1 A10 [] 17
8 68 0 A68 [] 32
9 7 0 A7 [] 32
10 68 1 A68 [] 10
11 34 0 A34 [] 11
12 6 0 A6 [] 11
13 24 0 A24 [] 12
14 20 0 A20 [] 11
15 21 0 A21 [] 11
16 34 1 A34 [] 11
17 14 0 A14 [] 12
18 20 1 A20 [] 12
19 68 2 A68 [] 17

python - Iterating over multi-index pandas dataframe

I´m trying to iterate over a huge pandas dataframe (over 370.000 rows) based on the index.
For each row the code should look back on the last 12 entries of this index (if available) and sum up based on (running) quarters / semesters / year.
If there is no information or not enough information (only last 3 months) then the code should consider the other months / quarters as 0.
Here is a sample of my dataframe:
This is the expected output:
So looking at DateID "1" we don´t have any other information for this row. DateID "1" is the last month in this case (month 12 so to say) and therefore in Q4 and H2. All other previous month are not existing and therefore not considered.
I already found a working solution but its very inefficient and takes a huge amount of time that is not acceptable.
Here is my code sample:
for company_name, c in df.groupby('Account Name'):
for i, row in c.iterrows():
i += 1
if i < 4:
q4 = c.iloc[:i]['Value$'].sum()
q3 = 0
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 3 < i < 7:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[:i-3]['Value$'].sum()
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 6 < i < 10:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[:i-6]['Value$'].sum()
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 9 < i < 13:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
else:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[i-12:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
new_df = new_df.append({'Account Name':row['Account Name'], 'DateID': row['DateID'],'Q4':q4,'Q3':q3,'Q2':q2,'Q1':q1,'H1':h1,'H2':h2,'Year':year},ignore_index=True)
As I said I´m looking for a more efficient way to calculate these numbers as I have almost 10.000 Account Names and 30 Date ID´s per Account.
Thanks a lot!

If I got you right, this should calculate your figures:
grouped= df.groupby('Account Name')['Value$']
last_3= grouped.apply(lambda ser: ser.rolling(window=3, min_periods=1).sum())
last_6= grouped.apply(lambda ser: ser.rolling(window=6, min_periods=1).sum())
last_9= grouped.apply(lambda ser: ser.rolling(window=9, min_periods=1).sum())
last_12= grouped.apply(lambda ser: ser.rolling(window=12, min_periods=1).sum())
df['Q4']= last_3
df['Q3']= last_6 - last_3
df['Q2']= last_9 - last_6
df['Q1']= last_12 - last_9
df['H1']= df['Q1'] + df['Q2']
df['H2']= df['Q3'] + df['Q4']
This outputs:
Out[19]:
Account Name DateID Value$ Q4 Q3 Q2 Q1 H1 H2
0 A 0 33 33.0 0.0 0.0 0.0 0.0 33.0
1 A 1 20 53.0 0.0 0.0 0.0 0.0 53.0
2 A 2 24 77.0 0.0 0.0 0.0 0.0 77.0
3 A 3 21 65.0 33.0 0.0 0.0 0.0 98.0
4 A 4 22 67.0 53.0 0.0 0.0 0.0 120.0
5 A 5 31 74.0 77.0 0.0 0.0 0.0 151.0
6 A 6 30 83.0 65.0 33.0 0.0 33.0 148.0
7 A 7 23 84.0 67.0 53.0 0.0 53.0 151.0
8 A 8 11 64.0 74.0 77.0 0.0 77.0 138.0
9 A 9 35 69.0 83.0 65.0 33.0 98.0 152.0
10 A 10 32 78.0 84.0 67.0 53.0 120.0 162.0
11 A 11 31 98.0 64.0 74.0 77.0 151.0 162.0
12 A 12 32 95.0 69.0 83.0 65.0 148.0 164.0
13 A 13 20 83.0 78.0 84.0 67.0 151.0 161.0
14 A 14 15 67.0 98.0 64.0 74.0 138.0 165.0
15 B 0 44 44.0 0.0 0.0 0.0 0.0 44.0
16 B 1 43 87.0 0.0 0.0 0.0 0.0 87.0
17 B 2 31 118.0 0.0 0.0 0.0 0.0 118.0
18 B 3 10 84.0 44.0 0.0 0.0 0.0 128.0
19 B 4 13 54.0 87.0 0.0 0.0 0.0 141.0
20 B 5 20 43.0 118.0 0.0 0.0 0.0 161.0
21 B 6 28 61.0 84.0 44.0 0.0 44.0 145.0
22 B 7 14 62.0 54.0 87.0 0.0 87.0 116.0
23 B 8 20 62.0 43.0 118.0 0.0 118.0 105.0
24 B 9 41 75.0 61.0 84.0 44.0 128.0 136.0
25 B 10 39 100.0 62.0 54.0 87.0 141.0 162.0
26 B 11 46 126.0 62.0 43.0 118.0 161.0 188.0
27 B 12 26 111.0 75.0 61.0 84.0 145.0 186.0
28 B 13 24 96.0 100.0 62.0 54.0 116.0 196.0
29 B 14 34 84.0 126.0 62.0 43.0 105.0 210.0
32 C 2 12 12.0 0.0 0.0 0.0 0.0 12.0
33 C 3 15 27.0 0.0 0.0 0.0 0.0 27.0
34 C 4 45 72.0 0.0 0.0 0.0 0.0 72.0
35 C 5 22 82.0 12.0 0.0 0.0 0.0 94.0
36 C 6 48 115.0 27.0 0.0 0.0 0.0 142.0
37 C 7 45 115.0 72.0 0.0 0.0 0.0 187.0
38 C 8 11 104.0 82.0 12.0 0.0 12.0 186.0
39 C 9 27 83.0 115.0 27.0 0.0 27.0 198.0
For the following test data:
data= {'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [33, 20, 24, 21, 22, 31, 30, 23, 11, 35, 32, 31, 32, 20, 15, 44, 43, 31, 10, 13, 20, 28, 14, 20, 41, 39, 46, 26, 24, 34, 12, 15, 45, 22, 48, 45, 11, 27]
}
df= pd.DataFrame(data)
Edit:: If you want to count the unique entires over the same period, you can do that as follows:
def get_nunique(np_array):
unique, counts= np.unique(np_array, return_counts=True)
return len(unique)
df['Category'].rolling(window=3, min_periods=1).apply(get_nunique)

I didn't want to overload the answer above completely, so I add a new one for your second part:
# define a function that
# creates the unique counts
# by aggregating period_length times
# so 3 times for the quarter mapping
# and 6 times for the half year
# it's basically doing something like
# a sliding window aggregation
def get_mapping(df, period_lenght=3):
df_mapping= None
for offset in range(period_lenght):
quarter= (df['DateID']+offset) // period_lenght
aggregated= df.groupby([quarter, df['Account Name']]).agg({'DateID': max, 'Category': lambda ser: len(set(ser))})
incomplete_data= ((aggregated['DateID']+offset+1)//period_lenght <= aggregated.index.get_level_values(0)) & (aggregated.index.get_level_values(0) >= period_lenght)
aggregated.drop(aggregated.index[incomplete_data].to_list(), inplace=True)
aggregated.set_index('DateID', append=True, inplace=True)
aggregated= aggregated.droplevel(0, axis='index')
if df_mapping is None:
df_mapping= aggregated
else:
df_mapping= pd.concat([df_mapping, aggregated], axis='index')
return df_mapping
# apply it for 3 months and merge it to the source df
df_mapping= get_mapping(df, period_lenght=3)
df_mapping.columns= ['unique_3_months']
df_with_3_months= df.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
# do the same for 6 months and merge it again
df_mapping= get_mapping(df, period_lenght=6)
df_mapping.columns= ['unique_6_months']
df_with_6_months= df_with_3_months.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
This results in:
Out[305]:
Account Name DateID Value$ Category unique_3_months unique_6_months
0 A 0 10 1 1 1
1 A 1 12 2 2 2
2 A 1 38 1 2 2
3 A 2 20 3 3 3
4 A 3 25 3 3 3
5 A 4 24 4 2 4
6 A 5 27 8 3 5
7 A 6 30 5 3 6
8 A 7 47 7 3 5
9 A 8 30 4 3 5
10 A 9 17 7 2 4
11 A 10 20 8 3 4
12 A 11 33 8 2 4
13 A 12 45 9 2 4
14 A 13 19 2 3 5
15 A 14 24 10 3 3
15 A 14 24 10 3 4
15 A 14 24 10 3 4
15 A 14 24 10 3 5
15 A 14 24 10 3 1
15 A 14 24 10 3 2
16 B 0 41 2 1 1
17 B 1 13 9 2 2
18 B 2 17 6 3 3
19 B 3 45 7 3 4
20 B 4 11 6 2 4
21 B 5 38 8 3 5
22 B 6 44 8 2 4
23 B 7 15 8 1 3
24 B 8 50 2 2 4
25 B 9 27 7 3 4
26 B 10 38 10 3 4
27 B 11 25 6 3 5
28 B 12 25 8 3 5
29 B 13 14 7 3 5
30 B 14 25 9 3 3
30 B 14 25 9 3 4
30 B 14 25 9 3 5
30 B 14 25 9 3 5
30 B 14 25 9 3 1
30 B 14 25 9 3 2
31 C 2 31 9 1 1
32 C 3 31 7 2 2
33 C 4 26 5 3 3
34 C 5 11 2 3 4
35 C 6 15 8 3 5
36 C 7 22 2 2 5
37 C 8 33 2 2 4
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 1
38 C 9 16 5 2 2
38 C 9 16 5 2 2
The output is based on the following input data:
data= {
'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [10, 12, 38, 20, 25, 24, 27, 30, 47, 30, 17, 20, 33, 45, 19, 24, 41, 13, 17, 45, 11, 38, 44, 15, 50, 27, 38, 25, 25, 14, 25, 31, 31, 26, 11, 15, 22, 33, 16],
'Category': [1, 2, 1, 3, 3, 4, 8, 5, 7, 4, 7, 8, 8, 9, 2, 10, 2, 9, 6, 7, 6, 8, 8, 8, 2, 7, 10, 6, 8, 7, 9, 9, 7, 5, 2, 8, 2, 2, 5]
}
df= pd.DataFrame(data)

Merged dataframe seems missing two rows

I had run the below code :
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
print(pd.merge(df1,df3, on='HPI'))
I am getting the output as :
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
My Question here is
1) Why I am having so big dataframe. HPI has only 4 values but in output 6 rows has been generated.
2) If merge will take all the values from HPI then why the value 80 and 88 hasn't been taken twice each?

You get 85 4 times, because duplicated in df1 and df2 in joined columns HPI . And 88 with 80 are unique, so inner join return alo one row for each.
Apparently, the inner join means that if there is a match on the join column in both tables, every row will be matched the maximum number of times possible.
So before merge need remove duplicates for correct output.
df1 = df1.drop_duplicates('HPI')
df3 = df3.drop_duplicates('HPI')
Samples with dupes values in HPI columns and outputs:
#2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 85 - 2x2, value 85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 85 3 55 52 8
2 85 3 55 53 6
3 85 2 55 52 8
4 85 2 55 53 6
5 88 2 65 50 9
#2 dupes 80, 2dupes 85
df1 = pd.DataFrame({'HPI':[80,85,80,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2dupes 85 , unique 80
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#4dupes 80, 2x1, 4dupes 85 - 2x2, values 80,85 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 65 50 7
2 85 3 55 52 8
3 85 3 55 53 6
4 85 2 55 52 8
5 85 2 55 53 6
#2dupes 80
df1 = pd.DataFrame({'HPI':[80,80,82,83],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#2 dupes 85
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#2dupes 80, 2x1value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 3 55 50 7
#4dupes 80
df1 = pd.DataFrame({'HPI':[80,80,80,80],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
#3 dupes 80
df3 = pd.DataFrame({'HPI':[80,80,80,85],
'Unemployment':[7, 8, 9, 6],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
#12dupes 80, 4x3, value 80 in both columns
print(pd.merge(df1,df3, on='HPI'))
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
0 80 2 50 50 7
1 80 2 50 52 8
2 80 2 50 50 9
3 80 3 55 50 7
4 80 3 55 52 8
5 80 3 55 50 9
6 80 2 65 50 7
7 80 2 65 52 8
8 80 2 65 50 9
9 80 2 55 50 7
10 80 2 55 52 8
11 80 2 55 50 9

As jezrael wrote, you have 6 rows because the values for HPI=85 in df1 and df3 are not unique. On the contrary on df1 and df3 you have only a value for HPI=80 and for HPI=88.
If I make an assumption and consider also your index, I can guess that what you want is something like this:
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment
index
2001 80 2 50 50 7
2002 85 3 55 52 8
2003 88 2 65 50 9
2004 85 2 55 53 6
If you want something like this, then you can do:
pd.merge(df1, df3, left_index=True, right_index=True, on='HPI')
But I am just making an assumption, so I dont know if this is the output you would like.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Data partition on known and unknown rows - python-3.x

Related

Set upperbound in a column for a specific group by using Python

Need help on agg function after groupby for doing operation last - first

update the same value of parent column as child list values in pandas data frame

python - Iterating over multi-index pandas dataframe

Merged dataframe seems missing two rows

Categories

Resources