Performing Calculations From A Pandas Data Frame with Multiple Conditions - python-3.x

Forgive the question as I'm a science major, not computer science and I'm teaching myself Python to help with a class project.
I have a Pandas data frame that I've imported from a .csv that looks like:
Item_ID Event_ID Value
27 83531 2533501.8
28 83531 1616262
31 83531 269829
32 83531 55.8
33 83531 269829
34 83531 4882
35 83531 269829
36 83531 4882
37 83531 55.8
38 83531 55.8
27 83532 7137904.8
28 83532 5873877.6
31 83532 497381
32 83532 55.7
33 83532 497381
34 83532 7568
35 83532 497381
36 83532 7568
37 83532 55.7
38 83532 55.7
This data is from a manual entry that is done multiple times daily where the Item_ID is type of measurement, Event_ID is the unique identifier for each "data entry event" by the user, and value is the value of the measurement.
I need to perform a number of calculations on each unique Event_Id.
Calc1 = ([28]/[27])*(([31]*[32])/[28])*(([33]-[34])/[33])
Calc2 = [36]/[35]
Calc3 = ([35]-[113])/[35]
Calc4 = [37]
Calc5 = [38]
Each number in the above formula represents an Item_ID. I want the replace the Item_ID in the formula with the value from the same row for each Event_ID.
This project was started a month ago and will run for 6 more weeks. By then, there will be to many data points to perform the calculations by hand.
As these calculations cannot be performed across Event_IDs, the formula for Event_ID 85831 would look like:
Calc1_Data = ([1616262]/[2533501.8])*(([269829]*[55.8])/[1616262])*(([269829]-[4882])/[269829])
Calc2_Data = [4882]/[269829]
Calc3_Data = ([497381]-[0])/[497381]) ***0 would be placed hear as Item_ID 113 does not exist for this
Event_ID
Calc4_Data = [55.7]
Calc5_Data = [55.7]
The results would then be put into a new data frame that I could then perform my analysis on.
Event_ID Clac1_Result Calc2_Result Calc3_Result Calc4_Result Calc5_Result
85829
85830
85331 RESULTS HERE
85332 RESULTS HERE
85833
85834
This is my first go at asking a question here since I've been able to find all of my other answers in the library docs or previously asked questions. If I didn't provide enough information let me know and I'll clarify if possible.
Thanks

You can use groupby followed by agg methods to do that.
First, define your calculations as functions:
# Define calculations
def Calc1(x):
return (x[28]/x[27])*((x[31]*x[32])/x[28])*((x[33]-x[34])/x[33])
def Calc2(x):
return x[36]/x[35]
# Calc3 = lambda x: (x[35]-x[113])/x[35] # commenting out because there's no 113 in the provided example
def Calc4(x):
return x[37]
def Calc5(x):
return x[38]
Then, perform the calculations using the groupby and agg:
df = df.set_index('Item_ID') # set 'Item_ID' to index so that we can use fewer code inside the functions
df = df.groupby('Event_ID').agg([Calc1, Calc2, Calc4, Calc5]) # group by Event_ID, and perform the set of specified calculations
df.columns = df.columns.droplevel(0) # reset column names
Output:
Calc1 Calc2 Calc4 Calc5
Event_ID
83531 5.835418 0.018093 55.8 55.8
83532 3.822212 0.015216 55.7 55.7

Related

How to add new calculations to existing table based on text field with multiple entries?

Question: How can I populate the [score [TXT]] columns with the specified calculation? Sometimes the calculations will be based off multiple rows depending on the value in the [game] column.
I have a table with Metascores and game names, and want to apply some sort of formula that automatically calculates the AVG, MAX, and MIN for the entry. The table above has my desired output. I am using Office 365 - Excel.
Current table
Metascore
score AVG
score MAX
score MIN
game
87
Assassin's Creed Odyssey
86
Assassin's Creed Odyssey
83
Assassin's Creed Odyssey
66
Bleeding Edge
62
Bleeding Edge
Desired output
Metascore
score AVG
score MAX
score MIN
game
87
85.3
87
83
Assassin's Creed Odyssey
86
85.3
87
83
Assassin's Creed Odyssey
83
85.3
87
83
Assassin's Creed Odyssey
66
64
66
62
Bleeding Edge
62
64
66
62
Bleeding Edge
Some titles only occur once, some several times. Is there a formula or script I can apply that loops through the table and applied the calculation, or a different suggestion of an output?
Thanks for your help!!
=UNIQUE(E2:E6) in for instance E10
=AVERAGEIF($E$2:$E$6,$E$10#,A2:A6) in A10 and copy to the right.
Or in one go using LET:
=LET(data,A2:E6,
game,INDEX(data,,5),
unique,UNIQUE(game),
CHOOSE({1,2,3,4,5},
AVERAGEIF(game, unique,INDEX(data,,1)),
AVERAGEIF(game, unique,INDEX(data,,2)),
AVERAGEIF(game, unique,INDEX(data,,3)),
AVERAGEIF(game, unique,INDEX(data,,4)),
unique))
Edit:
After the changed description, this is what you need:
In B2 use: =AVERAGEIF(E2:E6,E2:E6,A2:A6)
In C2 use: =MAXIFS(A2:A6,E2:E6,E2:E6)
In D2 use: =MINIFS(A2:A6,E2:E6,E2:E6)
If you'd have your data in two columns (let's say A20:B20) and would want a summary elsewhere, you could use the following:
=LET(data,A2:B6,
game,INDEX(data,,2),
score,INDEX(data,,1),
unique,UNIQUE(game),
CHOOSE({1,2,3,4},
unique,
AVERAGEIF(game,unique,score),
MAXIFS(score,game,unique),
MINIFS(score,game,unique)))

Cohort in Excel with aggregated monthly data

I'm trying to make a cohort in Excel Pivot with a dataset having:
aggregated number of monthly sign ups (month by month), aggregated number user of completed next step, number of months between sign up and the next action taken.
What I can't figure out when i do the pivot to have the cohort, is what to put into the value field in the pivot? Normally I would take the Customer IDs as value, but since I only have the data on aggregated monthly level I'm not sure if i put the number of sign ups or the number of next step completed?
Also how do I get the sum of each cohort so i can calculated the retention rate?
I hope this make sense.
Signup month Action completed month Months between sign up and action completed signups conversion to Action completed
Jan-17 Sep-18 20 95 71
Jan-17 Jan-18 12 95 77
Jan-17 Jun-18 17 96 72
Jan-17 Jan-18 12 92 78
Jan-17 Dec-18 23 91 78
Jan-17 Jul-18 18 100 73
Jan-17 Oct-18 21 92 79
Jan-17 Feb-18 13 95 70
Jan-17 Jan-18 12 91 79
Jan-17 May-18 16 93 71
Jan-17 Jun-18 17 95 72
Is this what you are looking to achieve?
REVISION #1
This layout shows the total number of signups, by the month in which the signup occurred, distributed by the number of months btwn the signup and action completed. The action completed month may be omitted and will still achieve the same result; it is there FYI only.
REVISION #2
This is an example of the average months between the signup and action. Is this what you are looking for?

DAX help: % monthly share of another table

I have a DAX formula for my Powerpivot I cannot get to solve and was hoping for help.
I have two pivot tables connected already
Showing a cohort of actions taken within Month 1,….X on the sign up month
Total Sign Ups on monthly basis
I have tried to attached the sheet here but somehow I cant so I have add a screenshot of the sheet.1
What I have so far is:
=DIVIDE(
SUM(Range[conversion to KYC completed]),
SUM('Range 1'[Sum of signups]))
But this does not give me what I want as I think I’m missing the monthly grouping somehow.
Question 1:
What I want is to get the share of actions completed within 1,...,X months out of the total sign up that given month (e.g. Jan) (so the data from Table 2)
Question 2:
In best case I would also like to show total sign ups in the beginning of the cohort to make the cohort easier to understand, so having the monthly total sign up (which the cohort is calculated based on). But now I cannot get just the totals month by month. Is there anyways just to add in a monthly total column in the pivot without applying these number as a value across all columns?
Something like this is the ultimate outcome for me 2
UPDATED WITH SAMPLE DATA
Signup month, KYC completed month, Age by month, signups, conversion to KYC completed
Jan-17 Jul-18 18 97 75
Jan-17 Jul-18 18 99 79
Jan-17 Dec-18 23 95 80
Feb-17 May-18 15 99 74
Feb-17 Jul-18 17 90 75
Feb-17 Jul-18 17 95 76
Feb-17 Aug-18 18 92 71
Mar-17 May-18 14 94 73
Apr-17 Jul-18 15 93 75
May-17 Sep-18 16 94 70
May-17 Oct-18 17 98 72
Jun-17 May-18 11 95 79
Jul-17 Oct-18 15 97 74
Jul-17 Jul-18 12 94 78
Aug-17 Sep-18 13 96 74
Sep-17 Nov-18 14 95 80
Sep-17 Oct-18 13 94 79
DESIRED OUTCOME
The % for Month 1....X is calculated KYC Completed / Monthly Sign up
OUTPUT WITH THIS CODE
=VAR SignUpMonth = IF(HASONEVALUE('Range 1'[Row Labels]), BLANK())
RETURN
DIVIDE(CALCULATE(SUM([conversion to KYC completed])),
CALCULATE(SUM('Range 1'[Sum of signups]),
FILTER(ALL(Range), Range[Signup month (Month Index)] = SignUpMonth)))
[
Thanks for the sample data Franzi. Still not too clear what you're asking for, but perhaps this will help a little.
Signed Up to Signed In Ratio =
VAR SignUpMonth = SELECTEDVALUE(Table1[Signup month], BLANK())
RETURN
DIVIDE(CALCULATE(SUM([conversion to KYC completed])),
CALCULATE(SUM(Table1[ signups]),
FILTER(ALL(Table1), Table1[Signup month] = SignUpMonth)))
So. Let's break it down.
If I understand correct, you want to see the cross section of number of signins for a given month ( x axis ) signup combo ( y axis ) and divide that number by the total signups ( y axis ) per signup month.
number of signins for a given month ( x axis ) signup combo ( y axis ):
CALCULATE(SUM([conversion to KYC completed]))
TOTAL signups ( y axis ) per signup month
CALCULATE(SUM(Table1[ signups]),
FILTER(ALL(Table1), Table1[Signup month] = SignUpMonth))

Adding Unique values and excluding a constant value

I need to add unique values and make sure that I am excluding a constant value (eg 10)
Acct # Value
9xxx123 50
9xxx123 50
9xxx123 10
9xxx123 15
9xxx234 10
9xxx234 25
9xxx234 25
9xxx234 30
The answer should be: 9xxx123 = 65 and for 9xxx234 = 55
On a different thread someone already suggested using the following:
=SUMPRODUCT((($A$2:$A$9=E2)*$B$2:$B$9)/(COUNTIFS($A$2:$A$9,E2,$B$2:$B$9,$B$2:$B$9)+($A$2:$A$9<>E2)))
But now I need to exclude the constant value.
Thanks!
Leo
To add skip 10 to the existing:
=SUMPRODUCT((($A$2:$A$9=E2)*($B$2:$B$9<>10)*$B$2:$B$9)/(COUNTIFS($A$2:$A$9,E2,$B$2:$B$9,$B$2:$B$9,$B$2:$B$9,"<>" & 10)+($A$2:$A$9<>E2)+($B$2:$B$9 = 10)))

I have a spreadsheet with rows of text in a single column

IE:
23 HL*3*2*23*0
24 PAT*19
25 NM1*QC*1*CUSTOMER*COLE
26 N3*228 PINEAPPLE CIRCLE
27 N4*CORA*PA*15108
28 DMG*D8*19940921*M
29 CLM*945405*5332.54***12>B>1*Y*A*Y*Y*P
30 HI*BK>2533
31 LX*1
32 SV1*HC>J2941*5332.54*UN*84***1
33 DTP*472*RD8*20110511-20110511
34 REF*6R*1099999731
35 NTE*ADD*GENERIC 12MG CARTRIDGE
36 LIN**N4*00013264681
37 CTP****7*UN
I want to populate column C with the text from row 29 as a min row with "945405" all the way to row 37 (the one with the text "CTP" in it). I cannot do this in VBA due to permissions. Is there a formula that will grab this value (it is always CLM * xxxxxx *...), assign it to column C using the "CLM" as the min row and CTP as the MAX row all the way through the SS? IE:
23 HL*3*2*23*0
24 PAT*19
25 NM1*QC*1*CUSTOMER*COLE
26 N3*228 PINEAPPLE CIRCLE
27 N4*CORA*PA*15108
28 DMG*D8*19940921*M
29 CLM*945405*5332.54***12>B>1*Y*A*Y*Y*P 945405
30 HI*BK>2533 945405
31 LX*1 945405
32 SV1*HC>J2941*5332.54*UN*84***1 945405
33 DTP*472*RD8*20110511-20110511 945405
34 REF*6R*1099999731 945405
35 NTE*ADD*GENERIC 12MG CARTRIDGE 945405
36 LIN**N4*00013264681 945405
37 CTP****7*UN 945405
38 NM1*DK*1*PATIENT*DEBORAH****XX*1
39 N3*123 MAIN ST*APT B
****Update*****
I was given permissions in VBA. How would I loop this?
Here is a clearer picture of what I am trying to accomplish
enter image description here
you can use the =MID(Source_Cell, Start_Position, Desired_Length) function to pull the substring. In your case it would be:
=MID(B29, 5, 6)
You can then put this formula in all of the cells you'd like it to be in.

Resources