Merge 2 dataframes using the first column as the index - python-3.x

df 1:
Condition Currency Total Hours
0 Used USD 100
1 Used USD 75
2 Used USD 13
3 Used USD NaN
df 2:
Condition Currency Total Hours
1 Used USD 99
3 New USD 1000
Desired Result:
Condition Currency Total Hours
0 Used USD 100
1 Used USD 99
2 Used USD 13
3 New USD 1000
How would I merge the two dataframes using the first column as the index (index) and overwrite the values of df1 with those of df2?
I have tried a variety of variations and nothing seems to work. A few examples I tried:
pd.merge(df, df1) = result is an empty dataframe
df.combine_first(df1) = the result is a dataframe but with the same values as df1

Try update:
df.update(df2)
print(df)
Output:
Condition Currency Total Hours
0 Used USD 100.0
1 Used USD 99.0
2 Used USD 13.0
3 New USD 1000.0

Related

How to exclude rows from a groupby operation

I am working on a groupby operation using the attribute column but I want to exclude the desc_type 1 and 2 that will be used to calculate total discount inside each attrib.
pd.DataFrame({'ID':[10,10,10,20,30,30],'attribute':['attrib_1','desc_type1','desc_type2','attrib_1','attrib_2','desc_type1'],'value':[100,0,0,100,30,0],'discount':[0,6,2,0,0,13.3]})
output:
ID attribute value discount
10 attrib_1 100 0
10 desc_type1 0 6
10 desc_type2 0 2
20 attrib_1 100 0
30 attrib_2 30 0
30 desc_type1 0 13.3
I want to groupby this dataframe by attribute but excluding the desc_type1 and desc_type2.
The desired output:
attribute ID_count value_sum discount_sum
attrib_1 2 200 8
attrib_2 1 30 13.3
explanations:
attrib_1 has discount_sum=8 because ID 30 that belongs to attrib_1has two desc_type
attrib_2 has discount_sum=13.3 because ID 10 has one desc_type
ID=20 has no discounts types.
What I did so far:
df.groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
But the line above does not exclude the desc_type 1 and 2 from the groupby
Important: an ID may have a discount or not.
link to the realdataset: realdataset
You can fill the attributes per ID, then groupby.agg:
m = df['attribute'].str.startswith('desc_type')
group = df['attribute'].mask(m).groupby(df['ID']).ffill()
out = (df
.groupby(group, as_index=False)
.agg(**{'ID_count': ('ID', 'nunique'),
'value_sum': ('value', 'sum'),
'discount_sum': ('discount', 'sum')
})
)
output:
ID_count value_sum discount_sum
0 2 200 8.0
1 1 30 13.3
Hello I think this helps :
df.loc[(df['attribute'] != 'desc_type1') &( df['attribute'] != 'desc_type2')].groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
Output :
ID value discount
attribute
attrib_1 2 200 0.0
attrib_2 1 30 0.0

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

How to remove duplicate values in dataframe while preserving the rest of the row in Pandas?

I am working on some gross profit reports in a jupyter notebook. I have exported the data out of our CRM as a csv and am using Pandas to with with the data. Some of the data is being duplicated in a couple of columns. I need to remove those duplicate values in those columns, but preserve the rest of the row.
I have tried to drop_duplicates on a subset of the two columns, but it removes the entire row.
INV INV SUB PO Number PO Subtotal \
0 INV-002504 USD 350.00 PO-03977 240
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29 PO-03889 4751.19
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21
Rows 4 & 5 are an example being duplicated in the PO Number & PO Subtotal columns.
I expect the output to remove the duplicate so the value is only shown once in all cases.
INV INV SUB PO Number PO Subtotal \
0 INV-002504 USD 350.00 PO-03977 240
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21
Use DataFrame.duplicated to check which rows contain duplicates based on PO Number & PO Subtotal. Then conditionally replace the value by '' with np.where:
m = df.duplicated(['PO Number', 'PO Subtotal'])
df['PO Number'] = np.where(m, '', df['PO Number'])
df['PO Subtotal'] = np.where(m, '', df['PO Subtotal'])
Or using .loc to select the correct rows and columns and replace those rows with '':
m = df.duplicated(['PO Number', 'PO Subtotal'])
df.loc[m, ['PO Number', 'PO Subtotal']] = ''
Output
INV INV SUB PO Number PO Subtotal
0 INV-002504 USD 350.00 PO-03977 240.0
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295.0
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21

Sorting and Grouping in Pandas data frame column alphabetically

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

Resources