Optimising & Summarising Large Formulas - excel

I'm working on a spreadsheet which will forecast the changes to certain costs in our building business based on estimated inputs.
For example, we may speculate that the price for a carpenter to complete a fitout will increase by $8 per m2 in Brisbane in August. We would write this data as:
Area = Brisbane
Month = August
Cost Centre = Carpenter Fitout = 150
We split each of the costs for building into different cost centres, represented numerically.
Increase = $8
Unit = m2
Based on this data, we can speculate how much each cost will increase in the coming months, and this is what I'm trying to do automatically.
The following are representations of the tables that I'm using in the spreadsheets.
Raw Data
An example of how the data looks raw from the import worksheet.
Area | Month | Centre | Value | Unit
-------|-----------|--------|-------|------
Bris | August | 150 | 10 | %
Sydney | September | 350 | 15 | m2
Import Table
How the data will be imported into the data analysing worksheet. The area, month and cost centre are combined for the VLOOKUPs later.
Label | Value | Unit
-------------------|-------|------
BrisAugust150 | 10 | %
SydneySeptember350 | 15 | m2
Calculation Table
All of the units that can be used in the import, and which calculation they correspond to. m2, m2t, m3, and EACH all use the same calculation (calc 4).
Unit | Calc | Description
-----|------|------------
FLAT | 1 | = Increase_Value
% | 2 | = Month_Value * Increase_Value / 100
000 | 3 | = Standard_Value * Increase_Value / 1000
m2 | 4 | = Standard_Value * Increase_Value
m2t | 4 |
m3 | 4 |
EACH | 4 |
Centre Values
Examples of standard quantities/dimensions that correspond to each of the cost centres.
Centre | Value
-------|-------
50 | 6
100 | 12
150 | 17
200 |
250 | ...
300 |
350 |
400 | etc
Monthly Data Dumps (For each Area)
Raw data is pasted into here from the live database at the beginning of each month to represent the costs associated with them.
July August September October
Centre
50 7 16 ... etc
100 68
150
200
250 ...
300
350
400 etc
Example Outputs
A summarised version of how the output will look, where each of the cost centres are against each of the months, and if there is something from the import that corresponds to both of these the appropriate calculation will be done.
Brisbane:
July August September October
Centre
50
100
150 10%
200
250
300
350
400
Sydney:
July August September October
Centre
50
100
150
200
250
300
350 15m2
400
Formula So Far
A psuedo-code version of the formula that will be featured in each cell so far. I thought it would be easier to decipher with labels instead of cell references, IFNA formulas taken out, etc.
=CHOOSE(
VLOOKUP( // Determine whether to use calc 1, 2, 3, or 4.
VLOOKUP( // Unit of calculation (i.e. m2, EACH, etc).
Area&Month&Centre,
Import_Table_Value,
3,
FALSE
),
Calculation_Table,
2,
FALSE
),
VLOOKUP( // Calc 1: Flat increase will only look up the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
),
( // Calc 2: % increase.
VLOOKUP( // Lookup the value from the monthly data dump corresponding to the appropriate month & cost centre.
Centre, // Cost centre (for each row).
Monthly_Data_Dump,
Appropriate_Month_Column,
FALSE
) * VLOOKUP( // Lookup the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
) / 100
),
( // Calc 3: 000' increase
VLOOKUP( // Lookup the appropriate value from the cost centre values table.
Centre,
Centre_Values,
2,
FALSE
) * VLOOKUP( // Lookup the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
) / 1000
),
( // Calc 4: Linear increase.
VLOOKUP( // Lookup the appropriate value from the cost centre values table.
Centre,
Centre_Values,
2,
FALSE
) * VLOOKUP( // Lookup the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
)
)
)
Basically, the formula will lookup a number from 1-4 and "choose" which formula will be used to determine a cell's value (if at all).
The spreadsheet has over approximately 300,000 cells to update across all the different areas, and running the formula as is takes an hour or more. I'm trying to reduce all the bloat and improve the time taken for the sheet to compute.
I've been dabbling with using INDEX MATCH instead of the VLOOKUPS, as well as trying some of the general optimisation tips that can be found online but the results only take off 5-10 minutes.
I'm after a more solid solution and am looking for advice on how to do that.

Looking at this from a data perspective you have 4 sets of information which can be represented as
RAW | CALC | DUMP | CENTRE
----------|--------|----------|-----------
Area* | Unit* | Area* | Centre*
Month* | Calc | Month* | CentreValue
Centre*| | Centre*|
Value | | Dump |
Unit | | |
RAW is your Raw Data Table, CALC is your Calculation Table, DUMP is equivalent to your Monthly Data Dumps and CENTRE is your Centre Values table.
We can conceive of these as the tables of a database with the labels in each column above representing the columns of the corresponding table. Columns with an asterisk represent the primary key(s) of the table. So, for example, table RAW has 5 columns and is keyed on the combination of columns Area, Month and Centre.
In a real database, these 4 tables could be joined to form a "view" which looks like
VIEW
--------
Area*
Centre*
Month*
Value
Dump
CentreValue
Calc
An additional column, say Result can be added to this view and (assuming I have understand your pseudo-formula correctly) assigned as
Value if Calc = 1
Value*Dump if Calc = 2
Value*CentreValue/1000 if Calc = 3
Value*CentreValue if Calc = 4
At the risk of not knowing all the subtleties of your data, in your position I would be giving consideration to implementing the above using a database approach.
3 of your inputs (RAW, CALC and CENTRE) already appear to be in the required table format whilst the fourth (DUMP) is sourced from a database so you may be able to get in the required format from its source (if not you'll just have to bash it into shape - not difficult).
The use of SQL for joining the tables into the required view replaces that complex nested set of VLOOKUP's and is likely to be considerably more efficient and faster. MS Access would be a good solution, but if not available to you you could try using MS Query. The latter is accessed via the Data Tab of the ribbon (From Other Sources/From Microsoft Query) and can access tables which are set up as named ranges in an Excel workbook. With MS Query you will need to put the input tables in a different workbook from the results view.
Both Access and Query employ a visual method for joining tables together and there will be plenty of tutorial material available on the web. Excel can import a "view" from Access (where views are known as Queries) and if using Query, closing the query pop-up window results in a prompt about whereabouts in the workbook the data should be placed.
Once you have your results in a database table format in Excel a pivot table will quickly get it to your required output format.

Related

Cumulative prices based on dates that cover different pricing periods

I am trying to calculate how much someone needs to pay me, where the ticket prices are set for a certain period.
Guest Name | Arrival Date | Departure Date | adults | Child | Total |
J.Bloggs | 14/11/2019 | 18/11/2019 | 5 | 2 | 7 |
Price 01/11/2019 ~ 04/11/2019 = £3.40
Price 05/11/2019 ~ 15/11/2019 = £2.50
Price 16/11/2019 ~ 30/11/2019 = £1.90
I need to workout how to charge J.Bloggs £2.50 for 2 nights (14th & 15th) then £1.90 for 3 nights (16th, 17th and 18th) = £10.70 * 7 people = £74.90
The blue columns is the raw data, the green columns are formula based (I don't need the Price column, I was just using to work out the fees).
This is the formula I have in the price column at the moment, but know that won't work when the tickets are covering multiple periods.
{=INDEX(Pricing!$C:$C,MATCH(1,(Pricing!$A:$A<=IF(ISTEXT($A2),DATEVALUE($A2),$A2))*(Pricing!$B:$B>=IF(ISTEXT($A2),DATEVALUE($A2),$A2)),0))}
Use SUMPRODUCT with INDEX/MATCH:
=SUMPRODUCT(INDEX($C$8:$C$10,MATCH(ROW(INDEX($ZZ:$ZZ,A2):INDEX($ZZ:$ZZ,B2)),$A$8:$A$10)))*E2
this is an array formula and requires the use of Ctrl-Shift-Enter instead of Enter when exiting edit mode.
ROW(INDEX($ZZ:$ZZ,A2):INDEX($ZZ:$ZZ,B2)) creates an array of the dates.
MATCH(...,$A$8:$A$10) Takes that array and finds where it falls in the lookup table. Note: the lookup table must be sorted ascending.
INDEX($C$8:$C$10,...) Takes each of those matches in turn and creates an array of the output prices.
SUMPRODUCT(...) adds all the values together.
Then we simple multiply it by the number of guests.

Calculating the size of a table in Cassandra

In "Cassandra The Definitive Guide" (2nd edition) by Jeff Carpenter & Eben Hewitt, the following formula is used to calculate the size of a table on disk (apologies for the blurred part):
ck: primary key columns
cs: static columns
cr: regular columns
cc: clustering columns
Nr: number of rows
Nv: it's used for counting the total size of the timestamps (I don't get this part completely, but for now I'll ignore it).
There are two things I don't understand in this equation.
First: why do clustering columns size gets counted for every regular column? Shouldn't we multiply it by the number of rows? It seems to me that by calculating this way, we're saying that the data in each clustering column, gets replicated for each regular column, which I suppose is not the case.
Second: why do primary key columns don't get multiplied by the number of partitions? From my understanding, if we have a node with two partitions, then we should multiply the size of the primary key columns by two because we'll have two different primary keys in that node.
It's because of Cassandra's version < 3 internal structure.
There is only one entry for each distinct partition key value.
For each distinct partition key value there is only one entry for static column
There is an empty entry for the clustering key
For each column in a row there is a single entry for each clustering key column
Let's take an example :
CREATE TABLE my_table (
pk1 int,
pk2 int,
ck1 int,
ck2 int,
d1 int,
d2 int,
s int static,
PRIMARY KEY ((pk1, pk2), ck1, ck2)
);
Insert some dummy data :
pk1 | pk2 | ck1 | ck2 | s | d1 | d2
-----+-----+-----+------+-------+--------+---------
1 | 10 | 100 | 1000 | 10000 | 100000 | 1000000
1 | 10 | 100 | 1001 | 10000 | 100001 | 1000001
2 | 20 | 200 | 2000 | 20000 | 200000 | 2000001
Internal structure will be :
|100:1000: |100:1000:d1|100:1000:d2|100:1001: |100:1001:d1|100:1001:d2|
-----+-------+-----------+-----------+-----------+-----------+-----------+-----------+
1:10 | 10000 | | 100000 | 1000000 | | 100001 | 1000001 |
|200:2000: |200:2000:d1|200:2000:d2|
-----+-------+-----------+-----------+-----------+
2:20 | 20000 | | 200000 | 2000000 |
So size of the table will be :
Single Partition Size = (4 + 4 + 4 + 4) + 4 + 2 * ((4 + (4 + 4)) + (4 + (4 + 4))) byte = 68 byte
Estimated Table Size = Single Partition Size * Number Of Partition
= 68 * 2 byte
= 136 byte
Here all of the field type is int (4 byte)
There is 4 primary key column, 1 static column, 2 clustering key column and 2 regular column
More : http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/
As the author, I greatly appreciate the question and your engagement with the material!
With respect to the original questions - remember that this is not the formula to calculate the size of the table, it is the formula to calculate the size of a single partition. The intent is to use this formula with "worst case" number of rows to identify overly large partitions. You'd need to multiply the result of this equation by the number of partitions to get an estimate of total data size for the table. And of course this does not take replication into account.
Also thanks to those who responded to the original question. Based on your feedback I spent some time looking at the new (3.0) storage format to see whether that might impact the formula. I agree that Aaron Morton's article is a helpful resource (link provided above).
The basic approach of the formula remains sound for the 3.0 storage format. The way the formula works, you're basically adding:
the sizes of the partition key and static columns
the size of the clustering columns per row, times the number of rows
8 bytes of metadata for each cell
Updating the formula for the 3.0 storage format requires revisiting the constants. For example, the original equation assumes 8 bytes of metadata per cell to store a timestamp. The new format treats the timestamp on a cell as optional since it can be applied at the row level. For this reason, there is now a variable amount of metadata per cell, which could be as low as 1-2 bytes, depending on the data type.
After reading this feedback and rereading that section of the chapter, I plan to update the text to add some clarifications as well as stronger caveats about this formula being useful as an approximation rather than an exact value. There are factors it doesn't account for at all such as writes being spread over multiple SSTables, as well as tombstones. We're actually planning another printing this spring (2017) to correct a few errata, so look for those changes soon.
Here is the updated formula from Artem Chebotko:
The t_avg is the average amount of metadata per cell, which can vary depending on the complexity of the data, but 8 is a good worst case estimate.

How to calculate values dynamically from Excel table

I have a programming issue in Excel that I don't know how to solve it. I want to create an automatic Delivery Cost Program on Excel that will help me calculating the cost more easily.
The input variables are:
Quantity (Values for 1, 2-9,10-49,50+ and more)
Shipping method
Depending on the Quantity Value and Shipping method, Excel should lookup on the table and return the total shipping cost according the following Table:
------------------------------------------
Delivery | Per shipment fee
------------------------------------------
| 1 2-9 10-49 50+
------------------------------------------
Standard | 2,99 1,89 1,5 1,1
Expedited | 5,99 2 1,75 1,25
Priority | 10,99 3,39 2,25 1,35
------------------------------------------
Let me show you with some examples what I want to get:
1- Example:
- Quantity: 15
- Delivery: Expedited
- Total Cost = 15 * 1,75 = 26,25$
1,75$ is the returned value after looking on the table using the variable Quantity and Shipping Method.
I have tested doing =IF statements but sure that there is an easier way to do it.
I'm not very good on Excel programming so any help will be appreciated
Best regards and have a great day!
Assuming that your table has the delivery types in column A in rows 4 through 6 and that the quantities are in row 3 (columns B through E) the following formula should do it for you:
=INDEX(B4:E6,MATCH(B9,A4:A6,0),MATCH(C9,B3:E3,1)) * Quantity
Note, that the quantities in row 3 must be a number. So, the numbers should be 1, 2, 10, and 50 and not 1, 2-9, 10-59, 50+. There are two possibilities to achieve that:
Create a helper row and hide it (while only showing the row with the "names" as you wish for.
Change the number format on these cells: for the column containing the 2 the number format should then be "2-9" (custom number format). For the number 10 the number format would be "10-49" and the number format for the last column would be "50+". Like this you see what you wan to see while the cells still contain numbers only (for the upper formula to work correctly).

Executing a function over a grouped set

I have two columns of data in Excel, one containing dates and the other values 1 or 2 which represent a specific status (like an enum).
Here is an example data set:
2014/07/04 | 1
2014/07/04 | 1
2014/07/04 | 2
2014/07/04 | 1
2014/07/05 | 2
2014/07/06 | 1
2014/07/06 | 1
2014/07/06 | 2
I need to get a graph of the percentages of the number 1 over days; in the above example
July 4th: 75%
July 5th: 100%
July 6th: 66%.
I've tried pivot tables and charts with no luck because I can't write my own function for values (COUNTIF for ones divided by COUNT), only use the predefined ones which aren't of any use.
Does anyone know how I would go about doing this?
You could do this with a pivot chart.
I made up my own data so it doesn't quite match but it is the same idea.
For the pivot table fields I used
Legend Fields:Compare
Axis Fields:Date
Values: Count of compare
For values under Value field settings goto show value as and change this to % of column total and summarize value by Count.
If you only want to see number one just put a filter on the column labels.

Custom Formula for Grand Total column

I have a frequent problem where the formula I want to use in the Values area in my Pivot-Table is different than the formula I want to use for the Grand Total column of that row. I typically want to Sum the Values but I want to average the Sums. Here is what I normally would get if I pivoted the dates on the Column Labels, Meat Type on the Row Labels, and Sum Orders in the Values.
Row Lables | Day 1 | Day 2 | Day 3 | Grand Total
________________________________________________
Beef | 100 | 105 | 102 | 307
Chicken | 200 | 201 | 202 | 603
I get sums by day and a sum of all of the days in the Grand Total column. Here is what I want to have:
Row Lables | Day 1 | Day 2 | Day 3 | Grand Total (Avg of Day Totals)
________________________________________________
Beef | 100 | 105 | 102 | 102.3
Chicken | 200 | 201 | 202 | 201.0
In this case the Orders are still summed by day but the Grand Total is now an average of the sums. What I do now is copy and paste the Pivot data onto a seperate sheet then calculate the averages. If there was a way to do this with a custom Grand Total column it would be incredible. This is one of the biggest shortcomings of Pivot Tables for me but I'm hoping it is due to my ignorance, which it often is. Thanks for the help!
You can write a measure that checks the number of 'rows' in a particular filter context and nest that in an IF() to determine which to use.
If using PowerPivot V2 then it's:
=IF(HASONEVALUE(Calendar[Day]), SUM(AMOUNT), AVERAGE(AMOUNT))
If using PowerPivot V1 it's:
=IF(COUNTROWS(Calendar[Day])=1, SUM(AMOUNT), AVERAGE(AMOUNT))
Both do the same thing in that they assess the number of rows in the table in the given context and when the Meat Type is 'Beef' then the temporarily filtered table has one row. If it doesn't have one row then its going down the path of the AVERAGE()
This assumes your column headers 'Days' are in a table called Calendar (if you aren't using a separate Calendar table then you are missing the most powerful functionality of PowerPivot IMO).
Jacob
I can't think of a "good" way, but here's one option. Add the Amount field to the data area a second time and change the operation to Average. Then use conditional formatting to hide the averages in the data area and hide the sums in the total area.
You might be better off just using some array formulas in a do-it-yourelf pivot table. You lose the pivot table benefits, but get more flexibility with the data.

Resources