How to dynamically create a cumulative overall total based on a non-cumulative categorical column in excel - excel

Slightly wordy title but here goes
I have a grid in excel which includes 3 columns (media spend, marginal revenue returns & media channel invested in) and I want to create the column below called desired cumulative spend
The reason the grid is structured in this way it does is that it represents an optimised spend laydown ordered by how much of each media channel's budget should be invested in until the marginal returns diminish such that it should be substituted for another media channel.
It is possible that this substitution can then be reversed back to the original channel if the new channel has a sharply diminishing curve, such that all marginal benefit associated to the new channel diminishes and the total spend level still means it is mathematically sensible to switch back to the original curve (maybe it has a lower base level but reduces less sharply). It is also possible that at the point in which the marginal benefit associated to the new channel diminishes, the best next step is to invest in a third channel.
The desired new spend column has two elements to it
it is a simple accumulation of spend from row to row when the
media channel is constant from row to row
it is a slightly more tricky accumulation of spend when the media
channel changes - then it needs to be able to reference back to the
last spend level associated to the channel which has been
substituted in. For row 4, the logic I am struggling with would need
to the running total from row 3 plus the new spend level associated
to row 4 minus the spend level the last time this channel was used
(row 2)
|spend | mar return | media | desired cumulative spend |
|------ |----------- |-------| ----------------------------------------- |
1 | £580 | 128 | chan1 | 580 |
2 | £620 | 121 | chan1 | 580+(620-580) |
3 | £900 | 115.8 | chan2 | 580+(620-580)+900 |
4 | £660 | 115.1 | chan1 | 580+(620-580)+900+(660-620) |
5 | £920 | 114 | chan2 | 580+(620-580)+900+(660-620)+(920-900) |
6 | £940 | 112 | chan2 | 580+(620-580)+900+(660-620)+(920-900)+(940-920) |

If my comment is the correct sugestion, then something like this should do it (£580 is at A2, so the first output is D2):
D2 =A2
D3 =D2+A3-IF(COUNTIF($C$2:C2,C3),INDEX(A:A,MAX(IF($C$2:C2=C3,ROW($A$2:A2)))))
D3 contains an array formula and must be confirmed with ctrl+shift+enter.
Now you can simply copy down from D3.

Related

How to reference another cell in a chart based on an aggregating total in another cell

So, the title might be confusing, so I'll outline like this:
I am making a weightloss chart. One of the clients gets to open a bag of legos as a reward for every 2lbs that he loses, as long as he does it based on a goal progression. For instance, if he weights 260, and loses 2lb, he gets his reward. However, if he gains a lb, now he has to lose 3lb to get his reward.
Currently, I have charts that look like this:
Column O
Column P
Current Weight
Amount Lost
263
8
Column L
Column M
Next Lego Bag
261
Lbs until next bag
2
After he hits 261, I want that cell that says 261 in Col M to say "259". So if he weighs in again, I want it to look like this automatically.
Column O
Column O
Current Weight
Amount Lost
260.5
10.5
Column L
Column M
Next Lego Bag
259
Lbs until next bag
1.5
What is the best way to automatically make that cell in Column M change when he hits the 2lb goal? I have a table that basically states all the goal weighs he needs to hit for each reward. It looks like this:
| Column Z | Column AA | Column AB | (formatting is being weird)
| -------- | -------- | -------- |
| Bag | Target Weight | Amount Lost |
| Bag 5 | 261 | 8 |
| Bag 6 | 259 | 10 |
| Bag 7 | 257 | 12 |
| Bag 8 | 255 | 14 |
| Bag 9 | 253 | 16 |
etc
I've tried a few things, but I'm coming up blank, because it won't always be in whole numbers the amount he loses, so matching it to the target weight has been tough.
In really, really simple terms, I need it to basically say this:
If current weight > goal 1, then A1 = goal 1. If current weight < Goal 1, then A1 = Goal 2, and all the way to Goal 21. However, A1 can't change to the next goal until current weight is less than that goal.
Thanks all
I have tried IF statements and Floor statements to get an ongoing changing thing, but it's not working.
In M2: =IF(MOD(O2+1,2)=0,2,MOD(O2+1,2))
In M1:
=O2-M2
Or using O365 in M1:
=LET(m,MOD(O2+1,2),
lbs,IF(m=0,2,m),
VSTACK(O2-lbs,lbs))

EXCEL: How to automatically create groups based on sum being less than X and not greater than Y

I have table in Excel with some information, the main column is Weight (in KG).
I need Excel to group Rows into groups, where each group's sum of Weight (in KG) is less than 24000 kg and greater than 23500 kg.
To do so manually is very time consuming, since there are thousands of rows with different Weight values.
table example:
ID | Weight (KG)
1 | 11360
2 | 22570
3 | 10440
4 | 20850
5 | 9980
6 | 9950
7 | 19930
8 | 9930
9 | 9616
10 | 9580
... and so on
The closest I got to solving the problem is adding 3 new columns: Total, Starts Group and Group Number.
Total function: =IF(SUM(B3+C2)>24000,B3,SUM(B3+C2)) - calculates current sum of Weight values in the current group
Starts group function: =IF(SUM(B3+C2)>24000,B3,SUM(B3+C2)) - checks if current row makes a new group
Group number function: =IF(D3,E2+1,E2) - all rows that contain same number are in the same group
The problem with this is that it doesn't create groups that are greater than 23500 too, but only that are less than 2400 kg.
It doesn't have to be in Excel, any app/script would work too, it just has to get the job done.
Desired output:
ID | Weight (KG) | Group ID
1 | 11360 | 1
2 | 2570 | 2
3 | 10440| 1
4 | 20850 | 2
5 | 180| 2
6 | 1950 | 1
So i want to get groups similar to these:
Group number 1 - Total 23750kg
Group number 2 - Total 2360kg
Url to my example table with functions I added:
https://1drv.ms/x/s!Au0UogL2uddbgTFJJ4TzSKLhPFPE?e=r02sPX
You may want to try this for total:
=IF(SUM(B3+C2)>24000;B3;IF(SUM(B3+C2)<=23500;SUM(B3+C2);B3))
edit:
I just saw you pasted the proposal into your sample file. You may need to replace the ; with , due to regional format settings.
The limitation remains:
first priority is <24k and second priority is >=23.5k
If the next row’s value makes the “jump” above 24k you may end up remaining below 23.5k and switching to the next group
edit2:
You may want to look up some optimization models and algorithms for your combination problem before trying to implement it in Excel.
Or try with simple rules, e.g. categorizing your rows such as weight over 20k, 16k, 12k,8k, 4k, 2k, 1k, 500, etc. and try to group/combine them accordingly

Average of ratios ("on the fly") in excel table

I have this excel table used as a DB named "csv" :
Ticket agent_wait client_wait
1 200 105
2 10 50
3 172 324
I'd like to calculate the average of the ratios of the agent wait. ration_agent being calculated as agent_wait / (agent_wait + client_wait).
If the table were like this:
Ticket agent_wait client_wait ratio_agent
1 200 105 0.65
2 10 50 0.16
3 172 324 0.24
I'd just do the average of the ratio_agent column with =AVERAGE(csv[ratio_agent]).
The problem is that this last column does not exist and I don't want to create an additional column just for this calculation.
Is there a way to do this with only a formula ?
I already tried
=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait])) but it gives me the answer for only one line.
You can use the formula you have used, but you need to enter it as an array formula. What this means is, after entering the formula, do not press Enter, but hold Ctrl+Shift and then press Enter. The resulting formula will turn into this after you do that:
{=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait]))}
And give your the value you are looking for. Use the correct columns (first csv[agent_want] to csv[client_wait]) if you are looking for the average client_wait instead.
It has come to me that your question might be an XY problem. Please take a read on this answer. It might help you decide on what you are actually looking for.
In brief if you want a measure of how much time:
agents spend waiting, out of all the waiting between agents and clients, calculate the totals first and get the average of these totals. Outliers e.g. a special case where an agent spent lots more time on a client than the client themselves will heavily affect this measure. Use this measure if you want to know how much time agents spend waiting when opposed to how much clients wait.
=SUM(csv(agent_wait)/sum(csv[agent_wait]+csv[client_wait]))
agents each spend waiting on any particular call, calculate the ratios first then the average of these. Outliers will not affect this measure by much and give an expected ratio of time an agent might spend on any interaction with a client. Use this measure if you want to have a guideline as to how much an agent should spend waiting for each unit of time a client spends waiting.
=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait]))
It also wouldn't be correct to do the =AVERAGE(csv[ration_agent]) calculation. An average of the average isn't the overall average. You need to sum the parts and then compute the overall average using those parts.
Ticket | agent_wait | client_wait | ratio_agent
------ | ---------- | ----------- | -----------
1 | 200 | 105 | 0.656
2 | 10 | 50 | 0.167
3 | 172 | 324 | 0.347
Total | 382 | 479 | ?????
The question is what goes in for the ?????.
If you take the average of the ratio_agent column (i.e. =AVERAGE(Table1[ratio_agent])) then you get 0.390.
But if you compute the ratio again, but with the column totals, like =csv[[#Totals],[agent_wait]]/(csv[[#Totals],[agent_wait]]+csv[[#Totals],[client_wait]]), then you get the true answer: 0.444.
To see how this is true try this set of data:
Ticket | agent_wait | client_wait | ratio_agent
------ | ---------- | ----------- | -----------
1 | 2000 | 2000 | 0.500
2 | 10 | 1 | 0.909
Total | 2010 | 2001 |
The average of the two ratios is 0.705, but it should be clear that if the total agent wait was 2010 and the total client wait was 2001 then the true average ratio must be closer to 0.500.
Computing it using the correct calculation you get 0.501.

Spark: count events based on two columns

I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.

Scaling values with a known upper limit

I have a column of values in Excel that I need to modify by a scale factor. Original column example:
| Value |
|:-----:|
| 75 |
| 25 |
| 25 |
| 50 |
| 0 |
| 0 |
| 100 |
Scale factor: 1.5
| Value |
|:-----:|
| 112.5 |
| 37.5 |
| 37.5 |
| 75 |
| 0 |
| 0 |
| 150 |
The problem is I need them to be within a range of 0-100. My first thought was take them as percentages of 100, but then quickly realized that this would be going in circles.
Is there some sort of mathematical method or Excel formula I could use to handle this so that I actually make meaningful changes to the values, such that when these numbers are modified, 150 is 100 but 37.5 might not be 25 and I'm not just canceling out my scale factor?
Assuming your data begin in cell A1, you can use this formula:
=MIN(100,A1*1.5)
Copy downward as needed.
You could do something like:
ScaledValue = (v - MIN(AllValues)) / (MAX(AllValues) - MIN(AllValues)) * (SCALE_MAX - SCALE_MIN) + SCALE_MIN
Say your raw data (a.k.a. AllValues) ranges from a MIN of 15 to a MAX of 83, and you want to scale it to a range of 0 to 100. To do that you would set SCALE_MIN = 0 and SCALE_MAX = 100. In the above equation, v is any single value in the data.
Hope that helps
Another option is:
ScaledValue = PERCENTRANK.INC(AllValues, v)
In contrast to my earlier suggestion, (linear --- preserves relative spacing of the data points), this preserves the order of the data but not spacing. Using PERCENTRANK.INC will have the effect that sparse data will get compressed closer together, and bunched data will get spread out.
You could also do a weighted combination of the two methods --- give the linear method a weight of say 0.5 so that relative spacing is partially preserved.

Resources