Join two dataframes based on closest combination that sums up to a target value - python-3.x

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000

Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

Related

How to exclude rows from a groupby operation

I am working on a groupby operation using the attribute column but I want to exclude the desc_type 1 and 2 that will be used to calculate total discount inside each attrib.
pd.DataFrame({'ID':[10,10,10,20,30,30],'attribute':['attrib_1','desc_type1','desc_type2','attrib_1','attrib_2','desc_type1'],'value':[100,0,0,100,30,0],'discount':[0,6,2,0,0,13.3]})
output:
ID attribute value discount
10 attrib_1 100 0
10 desc_type1 0 6
10 desc_type2 0 2
20 attrib_1 100 0
30 attrib_2 30 0
30 desc_type1 0 13.3
I want to groupby this dataframe by attribute but excluding the desc_type1 and desc_type2.
The desired output:
attribute ID_count value_sum discount_sum
attrib_1 2 200 8
attrib_2 1 30 13.3
explanations:
attrib_1 has discount_sum=8 because ID 30 that belongs to attrib_1has two desc_type
attrib_2 has discount_sum=13.3 because ID 10 has one desc_type
ID=20 has no discounts types.
What I did so far:
df.groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
But the line above does not exclude the desc_type 1 and 2 from the groupby
Important: an ID may have a discount or not.
link to the realdataset: realdataset
You can fill the attributes per ID, then groupby.agg:
m = df['attribute'].str.startswith('desc_type')
group = df['attribute'].mask(m).groupby(df['ID']).ffill()
out = (df
.groupby(group, as_index=False)
.agg(**{'ID_count': ('ID', 'nunique'),
'value_sum': ('value', 'sum'),
'discount_sum': ('discount', 'sum')
})
)
output:
ID_count value_sum discount_sum
0 2 200 8.0
1 1 30 13.3
Hello I think this helps :
df.loc[(df['attribute'] != 'desc_type1') &( df['attribute'] != 'desc_type2')].groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
Output :
ID value discount
attribute
attrib_1 2 200 0.0
attrib_2 1 30 0.0

How to aggregate Python Pandas dataframe such that value of a variable corresponds to the row a variable is selected in aggfunc?

I have the following data
ID DATE AGE COUNT
1 Nat 16 1
1 2021-06-06 19 2
1 2020-01-05 20 3
2 Nat 23 3
2 Nat 16 3
2 2019-02-04 36 12
I want to aggregate this so that the DATE will be the earliest valid date (in time), while AGE will be extracted from the corresponding row the earliest date is selected. The output should be
ID DATE AGE COUNT
1 2021-06-06 19 1
2 2019-02-04 36 3
My code which gives this error TypeError: Must provide 'func' or named aggregation **kwargs..
df_agg = pd.pivot_table(df, index=['ID'],
values=['DATE', 'AGE'],
aggfunc={'DATE': np.min, 'AGE': None, 'COUNT': np.min})
I don't want to use 'AGE': np.min since for ID=1, AGE=16 will be extracted which is not what I want.
///////////// Edits ///////////////
Edits made to provide a more generic example.
You can try .first_valid_index():
x = df.loc[df.groupby("ID").apply(lambda x: x["DATE"].first_valid_index())]
print(x)
Prints:
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
EDIT: Using .pivot_table(). You can extract the "DATE"/"AGE" together as a list, for "COUNT" you can use np.min or "min". The second step would be explode the "DATE"/"AGE" list to separate columns:
df_agg = pd.pivot_table(
df,
index=["ID"],
values=["DATE", "AGE", "COUNT"],
aggfunc={
"DATE": lambda x: df.loc[x.first_valid_index()][
["DATE", "AGE"]
].tolist(),
"COUNT": "min",
},
)
df_agg[["DATE", "AGE"]] = pd.DataFrame(df_agg["DATE"].apply(pd.Series))
print(df_agg)
Prints:
COUNT DATE AGE
ID
1 1 2021-06-06 19
2 3 2019-02-04 36
You can sort values and drop the duplicates (sort_index is optional)
df.sort_values(['DATE']).drop_duplicates('ID').sort_index()
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
With groupby and transform:
df[df['DATE'] == df.groupby("ID")['DATE'].transform('min')]
Assuming you have an index, a simple solution would be:
def min_val(group):
group = group.loc[group.DATE.idxmin]
return group
df.groupby(['ID']).apply(min_val)
If you do not have an index you can use:
df.reset_index().groupby(['ID']).apply(min_val).drop(columns=['ID'])

taking top 3 in a groupby, and lumping rest into 'other category'

I am currently doing a groupby in pandas like this:
df.groupby(['grade'])['students'].nunique())
and the result I get is this:
grade
grade 1 12
grade 2 8
grade 3 30
grade 4 2
grade 5 600
grade 6 90
Is there a way to get the output such that I see the groups of the top 3, and everything else is classified under other?
this is what I am looking for
grade
grade 3 30
grade 5 600
grade 6 90
other (3 other grades) 22
I think you can add a helper column in the df and call it something like "Grouping".
name the top 3 rows with its original name and name the remaining as "other" and then just group by the "Grouping" column.
Can't do much without the actual input data, but if this is your starting dataframe (df) after your groupby -
grade unique
0 grade_1 12
1 grade_2 8
2 grade_3 30
3 grade_4 2
4 grade_5 600
5 grade_6 90
You can do a few more steps to get to your table -
ddf = df.nlargest(3, 'unique')
ddf = ddf.append({'grade': 'Other', 'unique':df['unique'].sum()-ddf['unique'].sum()}, ignore_index=True)
grade unique
0 grade_5 600
1 grade_6 90
2 grade_3 30
3 Other 22

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Find occurrences of conditional value from one column and count values from another column in a dataframe

I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4

Resources