How to remove duplicate values in dataframe while preserving the rest of the row in Pandas? - python-3.x

I am working on some gross profit reports in a jupyter notebook. I have exported the data out of our CRM as a csv and am using Pandas to with with the data. Some of the data is being duplicated in a couple of columns. I need to remove those duplicate values in those columns, but preserve the rest of the row.
I have tried to drop_duplicates on a subset of the two columns, but it removes the entire row.
INV INV SUB PO Number PO Subtotal \
0 INV-002504 USD 350.00 PO-03977 240
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29 PO-03889 4751.19
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21
Rows 4 & 5 are an example being duplicated in the PO Number & PO Subtotal columns.
I expect the output to remove the duplicate so the value is only shown once in all cases.
INV INV SUB PO Number PO Subtotal \
0 INV-002504 USD 350.00 PO-03977 240
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21

Use DataFrame.duplicated to check which rows contain duplicates based on PO Number & PO Subtotal. Then conditionally replace the value by '' with np.where:
m = df.duplicated(['PO Number', 'PO Subtotal'])
df['PO Number'] = np.where(m, '', df['PO Number'])
df['PO Subtotal'] = np.where(m, '', df['PO Subtotal'])
Or using .loc to select the correct rows and columns and replace those rows with '':
m = df.duplicated(['PO Number', 'PO Subtotal'])
df.loc[m, ['PO Number', 'PO Subtotal']] = ''
Output
INV INV SUB PO Number PO Subtotal
0 INV-002504 USD 350.00 PO-03977 240.0
1 INV-002507 USD 1,400.00 PO-03846 603.56
2 NaN NaN PO-03847 295.0
3 INV-002489 USD 891.25 PO-03861 658.31
4 INV-002453 USD 3,132.50 PO-03889 4751.19
5 INV-002537 USD 3,856.29
6 INV-002420 USD 592.43 PO-03577 1188.46
7 INV-002415 USD 10,779.00 PO-03727 5389.21

Related

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

update columns based on id pandas

df_2:
order_id date amount name interval is_sent
123 2020-01-02 3 white today false
456 NaT 2 blue weekly false
789 2020-10-11 0 red monthly false
135 2020-6-01 3 orange weekly false
I am merging two dataframes locating when a date is greater than the previous result as well as looking to see if a data type has changed:
df_1['date'] = pd.to_datetime(df_1['date'])
df_2['date'] = pd.to_datetime(df_2['date'])
res = df_1.merge(df_2, on='order_id', suffixes=['_orig', ''])
m = res['date'].gt(res['date_orig']) | (res['date_orig'].isnull() & res['date'].notnull())
changes_df = res.loc[m, ['order_id', 'date', 'amount', 'name', 'interval', 'is_sent']]
After locating all my entities I am changing changes_df['is_sent'] to true:
changes_df['is_sent'] = True
after the above is ran changes_df is:
order_id date amount name interval is_sent
123 2020-01-03 3 white today true
456 2020-12-01 2 blue weekly true
135 2020-6-02 3 orange weekly true
I want to then update only the values in df_2['date'] and df_2['is_sent'] to equal changes_df['date'] and changes_df['is_sent']
Any insight is greatly appreciated.
Let us try update with set_index
cf = changes_df[['order_id','date','is_sent']].set_index('order_id')
df_2 = df_2.set_index('order_id')
df_2.update(cf)
df_2.reset_index(inplace=True)
df_2
order_id date amount name interval is_sent
0 123 2020-01-03 3 white today True
1 456 2020-12-01 2 blue weekly True
2 789 2020-10-11 0 red monthly False
3 135 2020-6-02 3 orange weekly True
df3 = df2.combine_first(
cap_df1).reindex(df.index)
This is my solution

Merge 2 dataframes using the first column as the index

df 1:
Condition Currency Total Hours
0 Used USD 100
1 Used USD 75
2 Used USD 13
3 Used USD NaN
df 2:
Condition Currency Total Hours
1 Used USD 99
3 New USD 1000
Desired Result:
Condition Currency Total Hours
0 Used USD 100
1 Used USD 99
2 Used USD 13
3 New USD 1000
How would I merge the two dataframes using the first column as the index (index) and overwrite the values of df1 with those of df2?
I have tried a variety of variations and nothing seems to work. A few examples I tried:
pd.merge(df, df1) = result is an empty dataframe
df.combine_first(df1) = the result is a dataframe but with the same values as df1
Try update:
df.update(df2)
print(df)
Output:
Condition Currency Total Hours
0 Used USD 100.0
1 Used USD 99.0
2 Used USD 13.0
3 New USD 1000.0

Sorting and Grouping in Pandas data frame column alphabetically

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

Resources