I have the following
Input:
samples = [('001', 'RENAL', 'CHROMOPHOBE', 'KICH'),
('002', 'OVARIAN', 'HIGH_GRADE_SEROUS_CARCINOMA', 'LGSOC'),
('003', 'OVARIAN', 'OTHER', 'NaN'),
('001', 'COLORECTAL', 'ADENOCARCINOMA', 'KICH')]
labels = ['id', 'disease_type', 'disease_sub_type', 'study_abbreviation']
df = pd.DataFrame.from_records(samples, columns=labels)
df
id disease_type disease_sub_type study_abbreviation
0 001 RENAL CHROMOPHOBE KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
3 001 COLORECTAL ADENOCARCINOMA KICH
I want to be able to compress the repeated id, say 001 in this case so that I can have the disease_type and disease_sub_type, study_abbreviation merged into 1 cell each (nested).
Output
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH, KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
This is not for anything but admin work hence the stupid ask but would help greatly when I need to merge on other datasets, thanks again.
You could group by your 'id' column and use list as an aggregation:
df.groupby('id',as_index=False).agg(','.join)
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH,KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
I initially tried to do this directly in SQL Server but it seems like it can't be possible through query so I want to calculate this "Distribute" column in Excel. Below is the details of the question. Appreciate if someone can help here.
I have following column in Excel and want to calculate values in "Distribute" column.
Item
Qty
Customer
Rank
Min
Max
Distribute
001
1500
0101
1
250
600
????
001
1500
0104
2
0
500
????
001
1500
0103
3
100
300
????
001
1500
0105
4
200
300
????
002
2000
0104
1
200
600
????
002
2000
0105
2
150
700
????
002
2000
0101
3
100
200
????
002
2000
0103
4
100
500
????
002
2000
0102
5
50
200
????
003
800
0103
1
100
500
????
003
800
0102
2
50
200
????
003
800
0101
2
50
100
????
003
800
0104
3
50
80
????
There are multiple items (Item) and each item has fixed quantity available (Qty)
Each item is distributed in different customers (Customers) based on their rank (Rank). The ranks are group by for every item. Data is already sorted via Rank column for every item. Multiple customers against an item can have same rank.
From the total quantity (Qty) of each item, every customer must get minimum quantity mentioned in (Min) column irrespective of its rank.
The remaining quantity of every item must be distribute based on the rank of the customer making sure that it should not exceed to the maximum quantity mentioned in (Max) column.
It is OK, if total quantity of the item is not consumed after distribution maximum quantity to all customer.
What I am after is the result something like this:
Item
Qty
Customer
Rank
Min
Max
Distribute
001
1500
0101
1
250
600
600
001
1500
0104
2
0
500
500
001
1500
0103
3
100
300
200
001
1500
0105
4
200
300
200
002
2000
0104
1
200
600
600
002
2000
0105
2
150
700
700
002
2000
0101
3
100
200
200
002
2000
0103
4
100
500
450
002
2000
0102
5
50
200
50
003
800
0103
1
100
500
500
003
800
0102
2
50
200
200
003
800
0101
2
50
100
50
003
800
0104
3
50
80
50
Looking forward if you can provide a formula or solution here. Thanks for your help.
FORMULA BASED SOLUTION
Here is a possible formula based solution with multiple cells involed that assumes the table is already properly sorted (by rank with any order then by rank from smaller to greater) and will stay that way:
A
B
C
D
E
F
G
H
I
J
K
Item
Qty
Customer
Rank
Min
Max
[Cumulative] Qty - Min
Basic
[Cumulative] Remain
Extra
Distribute
1
1500
101
1
250
600
=MAX(0,IF(A1<>A2,B2-E2,G1-E2))
=IF(A2<>A1,MIN(B2,E2),MIN(G1,E2))
=IF(A1<>A2,AGGREGATE(15,6,G:G/(A:A=A2),1),MAX(0,I1-(F1-E1)))
=MIN(I2,F2-E2)
=H2+J2
1
1500
104
2
0
500
=MAX(0,IF(A2<>A3,B3-E3,G2-E3))
=IF(A3<>A2,MIN(B3,E3),MIN(G2,E3))
=IF(A2<>A3,AGGREGATE(15,6,G:G/(A:A=A3),1),MAX(0,I2-(F2-E2)))
=MIN(I3,F3-E3)
=H3+J3
1
1500
103
3
100
300
=MAX(0,IF(A3<>A4,B4-E4,G3-E4))
=IF(A4<>A3,MIN(B4,E4),MIN(G3,E4))
=IF(A3<>A4,AGGREGATE(15,6,G:G/(A:A=A4),1),MAX(0,I3-(F3-E3)))
=MIN(I4,F4-E4)
=H4+J4
1
1500
105
4
200
300
=MAX(0,IF(A4<>A5,B5-E5,G4-E5))
=IF(A5<>A4,MIN(B5,E5),MIN(G4,E5))
=IF(A4<>A5,AGGREGATE(15,6,G:G/(A:A=A5),1),MAX(0,I4-(F4-E4)))
=MIN(I5,F5-E5)
=H5+J5
2
2000
104
1
200
600
=MAX(0,IF(A5<>A6,B6-E6,G5-E6))
=IF(A6<>A5,MIN(B6,E6),MIN(G5,E6))
=IF(A5<>A6,AGGREGATE(15,6,G:G/(A:A=A6),1),MAX(0,I5-(F5-E5)))
=MIN(I6,F6-E6)
=H6+J6
2
2000
105
2
150
700
=MAX(0,IF(A6<>A7,B7-E7,G6-E7))
=IF(A7<>A6,MIN(B7,E7),MIN(G6,E7))
=IF(A6<>A7,AGGREGATE(15,6,G:G/(A:A=A7),1),MAX(0,I6-(F6-E6)))
=MIN(I7,F7-E7)
=H7+J7
2
2000
101
3
100
200
=MAX(0,IF(A7<>A8,B8-E8,G7-E8))
=IF(A8<>A7,MIN(B8,E8),MIN(G7,E8))
=IF(A7<>A8,AGGREGATE(15,6,G:G/(A:A=A8),1),MAX(0,I7-(F7-E7)))
=MIN(I8,F8-E8)
=H8+J8
2
2000
103
4
100
500
=MAX(0,IF(A8<>A9,B9-E9,G8-E9))
=IF(A9<>A8,MIN(B9,E9),MIN(G8,E9))
=IF(A8<>A9,AGGREGATE(15,6,G:G/(A:A=A9),1),MAX(0,I8-(F8-E8)))
=MIN(I9,F9-E9)
=H9+J9
2
2000
102
5
50
200
=MAX(0,IF(A9<>A10,B10-E10,G9-E10))
=IF(A10<>A9,MIN(B10,E10),MIN(G9,E10))
=IF(A9<>A10,AGGREGATE(15,6,G:G/(A:A=A10),1),MAX(0,I9-(F9-E9)))
=MIN(I10,F10-E10)
=H10+J10
3
800
103
1
100
500
=MAX(0,IF(A10<>A11,B11-E11,G10-E11))
=IF(A11<>A10,MIN(B11,E11),MIN(G10,E11))
=IF(A10<>A11,AGGREGATE(15,6,G:G/(A:A=A11),1),MAX(0,I10-(F10-E10)))
=MIN(I11,F11-E11)
=H11+J11
3
800
102
2
50
200
=MAX(0,IF(A11<>A12,B12-E12,G11-E12))
=IF(A12<>A11,MIN(B12,E12),MIN(G11,E12))
=IF(A11<>A12,AGGREGATE(15,6,G:G/(A:A=A12),1),MAX(0,I11-(F11-E11)))
=MIN(I12,F12-E12)
=H12+J12
3
800
101
2
50
100
=MAX(0,IF(A12<>A13,B13-E13,G12-E13))
=IF(A13<>A12,MIN(B13,E13),MIN(G12,E13))
=IF(A12<>A13,AGGREGATE(15,6,G:G/(A:A=A13),1),MAX(0,I12-(F12-E12)))
=MIN(I13,F13-E13)
=H13+J13
3
800
104
3
50
80
=MAX(0,IF(A13<>A14,B14-E14,G13-E14))
=IF(A14<>A13,MIN(B14,E14),MIN(G13,E14))
=IF(A13<>A14,AGGREGATE(15,6,G:G/(A:A=A14),1),MAX(0,I13-(F13-E13)))
=MIN(I14,F14-E14)
=H14+J14
VBA SOLUTION
Here is a possible VBA solution that assumes the table is already properly sorted (by rank with any order then by rank from smaller to greater) and will stay that way:
Sub SubDistribution()
Dim RngData As Range
Dim RngItem As Range
Dim RngQty As Range
Dim RngMin As Range
Dim RngMax As Range
Dim RngDistribute As Range
Dim VarArray() As Variant
Dim DblItemCol As Double
Dim DblQtyCol As Double
Dim DblMinCol As Double
Dim DblMaxCol As Double
Dim DblRow As Double
Dim DblCounter01 As Double
Dim DblQuantity As Double
Dim BlnFirstLap As Boolean
Set RngData = Range("A2")
Set RngQty = Range("B2")
Set RngItem = Range("A2")
Set RngMin = Range("E2")
Set RngMax = Range("F2")
Set RngDistribute = Range("G2")
DblItemCol = RngData.Column - RngItem.Column + 1
DblQtyCol = RngData.Column - RngQty.Column + 1
DblMinCol = RngData.Column - RngMin.Column + 1
DblMaxCol = RngData.Column - RngMax.Column + 1
Set RngData = Range(RngData, RngData.End(xlToRight).End(xlDown))
ReDim VarArray(1 To RngData.Rows.Count)
For DblRow = 1 To RngData.Rows.Count
If RngItem.Offset(DblRow).Value = RngItem.Offset(DblRow - 1).Value And BlnFirstLap = False Then
DblQuantity = RngQty.Offset(DblRow - 1).Value
BlnFirstLap = True
Else
If RngItem.Offset(DblRow).Value <> RngItem.Offset(DblRow - 1).Value Then
BlnFirstLap = False
End If
End If
If RngItem.Offset(DblRow).Value <> RngItem.Offset(DblRow - 1) Then
VarArray(DblRow) = Excel.WorksheetFunction.Min(RngQty.Offset(DblRow - 1), RngMin.Offset(DblRow - 1))
Else
VarArray(DblRow) = Excel.WorksheetFunction.Min(DblQuantity, RngMin.Offset(DblRow - 1))
End If
DblQuantity = Excel.WorksheetFunction.Max(0, DblQuantity - RngMin.Offset(DblRow - 1).Value)
If BlnFirstLap = True Then
DblCounter01 = DblCounter01 + 1
Else
For DblCounter01 = DblCounter01 To 0 Step -1
VarArray(DblRow - DblCounter01) = VarArray(DblRow - DblCounter01) + Excel.WorksheetFunction.Min(DblQuantity, RngMax.Offset(DblRow - 1 - DblCounter01) - RngMin.Offset(DblRow - 1 - DblCounter01))
DblQuantity = Excel.WorksheetFunction.Max(0, DblQuantity - (RngMax.Offset(DblRow - 1 - DblCounter01).Value - RngMin.Offset(DblRow - 1 - DblCounter01).Value))
Next
DblCounter01 = 0
End If
Next
RngDistribute.Resize(UBound(VarArray)).Value = Excel.WorksheetFunction.Transpose(VarArray)
End Sub
For the following dataframe, I need calculate the change in 'count', for each set of date, location_id, uid and include the set in the results.
# Sample DataFrame
df = pd.DataFrame({'date': ['2021-01-01', '2021-01-01','2021-01-01','2021-01-02', '2021-01-02','2021-01-02'],
'location_id':[1001,2001,3001, 1001,2001,3001],
'uid': ['001', '003', '002','001', '004','002'],
'uid_count':[1, 2,3 ,2, 2, 4]})
date location_id uid count
0 2021-01-01 1001 001 1
1 2021-01-01 2001 003 2
2 2021-01-01 3001 002 3
3 2021-01-02 1001 001 2
4 2021-01-02 2001 004 2
5 2021-01-02 3001 002 4
My desired results would look like:
# Desired Results
date location_id uid
2021-01-01 1001 001 0
2001 003 0
3001 002 0
2021-01-02 1001 001 1
2001 004 0
3001 002 1
I thought I could do this via groupby by using the following, but the desired calculation isn't made:
# Current code:
df.groupby(['date','location_id','uid'],sort=False).apply(lambda x: (x['count'].values[-1] - x['count'].values[0]))
# Current results:
date location_id uid
2021-01-01 1001 001 0
2001 003 0
3001 002 0
2021-01-02 1001 001 0
2001 004 0
3001 002 0
How can I get the desired results?
The following code works with the test dataframe, I'm not certain about a larger dataframe
.transform() is used to calculate the differences for consecutive occurrences of 'uid_count', for each uid, with the same index as df.
The issue with .groupby(['date','location_id','uid'], is that each group only contains a single value.
Remove 'uid_count' at the end, with .drop(columns='uid_count'), if desired.
import pandas as pd
# sort the dataframe
df = df.sort_values(['date', 'location_id', 'uid'])
# groupby and transform based on the difference in uid_count
uid_count_diff = df.groupby(['location_id', 'uid']).uid_count.transform(lambda x: x.diff()).fillna(0).astype(int)
# create a column in df
df['uid_count_diff'] = uid_count_diff
# set the index
df = df.set_index(['date', 'location_id', 'uid'])
# result
uid_count uid_count_diff
date location_id uid
2021-01-01 1001 001 1 0
2001 003 2 0
3001 002 3 0
2021-01-02 1001 001 2 1
2001 004 2 0
3001 002 4 1