Calculating the size of a table in Cassandra - cassandra

In "Cassandra The Definitive Guide" (2nd edition) by Jeff Carpenter & Eben Hewitt, the following formula is used to calculate the size of a table on disk (apologies for the blurred part):
ck: primary key columns
cs: static columns
cr: regular columns
cc: clustering columns
Nr: number of rows
Nv: it's used for counting the total size of the timestamps (I don't get this part completely, but for now I'll ignore it).
There are two things I don't understand in this equation.
First: why do clustering columns size gets counted for every regular column? Shouldn't we multiply it by the number of rows? It seems to me that by calculating this way, we're saying that the data in each clustering column, gets replicated for each regular column, which I suppose is not the case.
Second: why do primary key columns don't get multiplied by the number of partitions? From my understanding, if we have a node with two partitions, then we should multiply the size of the primary key columns by two because we'll have two different primary keys in that node.

It's because of Cassandra's version < 3 internal structure.
There is only one entry for each distinct partition key value.
For each distinct partition key value there is only one entry for static column
There is an empty entry for the clustering key
For each column in a row there is a single entry for each clustering key column
Let's take an example :
CREATE TABLE my_table (
pk1 int,
pk2 int,
ck1 int,
ck2 int,
d1 int,
d2 int,
s int static,
PRIMARY KEY ((pk1, pk2), ck1, ck2)
);
Insert some dummy data :
pk1 | pk2 | ck1 | ck2 | s | d1 | d2
-----+-----+-----+------+-------+--------+---------
1 | 10 | 100 | 1000 | 10000 | 100000 | 1000000
1 | 10 | 100 | 1001 | 10000 | 100001 | 1000001
2 | 20 | 200 | 2000 | 20000 | 200000 | 2000001
Internal structure will be :
|100:1000: |100:1000:d1|100:1000:d2|100:1001: |100:1001:d1|100:1001:d2|
-----+-------+-----------+-----------+-----------+-----------+-----------+-----------+
1:10 | 10000 | | 100000 | 1000000 | | 100001 | 1000001 |
|200:2000: |200:2000:d1|200:2000:d2|
-----+-------+-----------+-----------+-----------+
2:20 | 20000 | | 200000 | 2000000 |
So size of the table will be :
Single Partition Size = (4 + 4 + 4 + 4) + 4 + 2 * ((4 + (4 + 4)) + (4 + (4 + 4))) byte = 68 byte
Estimated Table Size = Single Partition Size * Number Of Partition
= 68 * 2 byte
= 136 byte
Here all of the field type is int (4 byte)
There is 4 primary key column, 1 static column, 2 clustering key column and 2 regular column
More : http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/

As the author, I greatly appreciate the question and your engagement with the material!
With respect to the original questions - remember that this is not the formula to calculate the size of the table, it is the formula to calculate the size of a single partition. The intent is to use this formula with "worst case" number of rows to identify overly large partitions. You'd need to multiply the result of this equation by the number of partitions to get an estimate of total data size for the table. And of course this does not take replication into account.
Also thanks to those who responded to the original question. Based on your feedback I spent some time looking at the new (3.0) storage format to see whether that might impact the formula. I agree that Aaron Morton's article is a helpful resource (link provided above).
The basic approach of the formula remains sound for the 3.0 storage format. The way the formula works, you're basically adding:
the sizes of the partition key and static columns
the size of the clustering columns per row, times the number of rows
8 bytes of metadata for each cell
Updating the formula for the 3.0 storage format requires revisiting the constants. For example, the original equation assumes 8 bytes of metadata per cell to store a timestamp. The new format treats the timestamp on a cell as optional since it can be applied at the row level. For this reason, there is now a variable amount of metadata per cell, which could be as low as 1-2 bytes, depending on the data type.
After reading this feedback and rereading that section of the chapter, I plan to update the text to add some clarifications as well as stronger caveats about this formula being useful as an approximation rather than an exact value. There are factors it doesn't account for at all such as writes being spread over multiple SSTables, as well as tombstones. We're actually planning another printing this spring (2017) to correct a few errata, so look for those changes soon.

Here is the updated formula from Artem Chebotko:
The t_avg is the average amount of metadata per cell, which can vary depending on the complexity of the data, but 8 is a good worst case estimate.

Related

Pivot Table considers duplicate elements for the average calculation

I have a pivot table (with Data Model) based on below table
Id | Time
1 | 10
1 | 10
1 | 10
2 | 2
3 | 5
3 | 5
4 | 4
5 | 8
I need to calculate the average of time.
Average based on the pivot table calculation is 6.75 --> (10*3+2+5*2+4+8)/8
However, my expected result is 5.8 --> (10+2+5+4+8)/5
How can I setup the pivot table to not take into consideration the duplicate ?
Please note that I can't remove duplicate rows.
I tried to use AVERAGEX : =AVERAGEX(VALUES('Range'[Id]);'Range'[Time]) with DAX.
But I'm facing this issue: This formula is invalid or incomplete: 'Calculation error in measure 'Range'[475e7fe7-92b4-478c-bd5f-6e7c95df27d7]: A single value for column 'Time' in table 'Range' cannot be determined. This can happen when a measure formula refers to a column that contains many values without specifying an aggregation such as min, max, count, or sum to get a single result.'.
Thank you in advance !
Solution
=AVERAGEX(VALUES('Range'[Id]);CALCULATE(AVERAGE('Range'[Time])))
Solution
=AVERAGEX(VALUES('Range'[Id]);CALCULATE(AVERAGE('Range'[Time])))

Check if a Cell Value is between Two Values using Vlookup

In Excel, I have a table as follows, which shows pricing based on volume.
If you buy up to 4 items per month, the unit price is $100, 5 to 8 is $90, 9 to 20 is $80, anything above 20 is $50.
A | B | C
----------------
1 | 4 | 100
5 | 8 | 90
9 | 20 | 80
21| 1000 | 50
I have my monthly purchase volumes in another column, say d:
D
--
3
6
2
4
3
10
7
7
10
2
I need to find the unit prices (C column values) based on this series falling between the values of A and B columns. I know I can use a compound if statement like =IF(AND(D$1>=A1,B1>=D$1),C1,0) ... but since my pricing table is actually much larger than my example, this approach becomes convoluted. How can I do this with a Vlookup in an elegant way?
I'd go with the following in E1:
=INDEX(C$1:C$4,MATCH(D1,A$1:A$4))
which, at worst should be just as fast as VLOOKUP but at best is much faster.
This can be done by dragging the following formula down to cover the full column D:
=LOOKUP(2,1/($A$2:$A$5<=D2)/($B$2:$B$5>=D2),$C$2:$C$5)
This will take each D value, compare with A and B, locate which bucket it falls into, and pull the C value. If not found, will return an N/A.
Here is an approach using SUMIFS:
=SUMIFS($C$1:$C$4,$A$1:$A$4,"<="&E1,$B$1:$B$4,">="&E1)

How do I sum up a score based on 2 values?

I have this excel table
Status | Priority |
-------------------
Yes | High |
No | Medium |
N/A | Medium |
Yes | Low |
A bit | Bonus |
| |
| |
Each priority has a point value. Priority points can change to anything. They aren't in order. Note that lines can also be blank. Assuming that if priority is blank then status is also blank.
High = 3 points
Medium = 2 Points
Low = 1 Point
Bonus = 1 Point
Status's can be blank or any value. However if they are the following then they have coniditions:
Yes = Full point (eg. Yes with High priority gives 3 points) or (eg. Yes with Bonus gives 1 point).
A bit = Half a point (eg. A little with High priortiy gives half 1.5 points) or (eg. A little with Medium gives 1 point). Essentially halving the point.
If the Status is Yes then I want it to count the corresponding point value. So for the table above it should count up 4.5 points.
3 Points for Row 2
1 Point for Row 5
0.5 points for Row 6
I was wondering how I can do this?
I was going to do the following, but it only has one condition.
=COUNTIF(A2:A5, "Yes")
Using Tables and Named Ranges with structured references gives you a great deal of flexibility.
I first set up two tables
priorityTbl
statusTbl
With our Input, I Named the two Ranges Status and Priority
The total is then given by the formula:
=SUMPRODUCT(IFERROR(INDEX(statusTbl,MATCH(Status,statusTbl[Status],0),2),0),
IFERROR(INDEX(priorityTbl,MATCH(Priority,priorityTbl[Priority],0),2),0))
If you want to change the values you assign to the different Priority/Status items, you merely change them in the table.
You could also add new rows to the tables, if that is appropriate.
Note that I did not bother adding to the tables rows where the value might be zero, but you could if you wanted to.

How to sort a column based on exact matches with another column

I have an inventory table that looks like this (subset):
part number | price | quantity
10115 | 14.95 | 10
1050 | 5.95 | 12
1074 | 7.49 | 8
110-1353 | 13.99 | 22
and i also have another table in sheet 2 that looks like this (subset):
part number | quantity
10023 | 1
110-1353 | 3
10115 | 2
20112 | 1
I want to basically subtract the quantities in the second table from the ones in the first table. What is the best way of doing this? I have looked in to VLOOKUP and INDEX MATCH but they are not quite right for this. Would this perhaps actually be better in say an Access DB ?
I have add another two columns next to sheet 1 last column. Let us assume that the second table range is A1:B5.
Image:
Formulas:
Column D:
=IFNA(VLOOKUP(A2,Sheet2!$A$2:$B$5,2,FALSE),0)
Column E:
=C2-D2
If you wanted to tackle this using MS Access, the SQL code might look like this:
select
t1.[part number],
t1.price,
t1.quantity - nz(t2.quantity, 0) as qty
from
inventory t1 left join table2 t2 on t1.[part number] = t2.[part number]
Here, I assume that you have a table called inventory and a table called table2 (change these to suit your database).
A left join is used to ensure that all records from inventory are returned, regardless of whether a match is found in table2, and the Nz function is used to return 0 for records for which there is no part number match in table2.

Optimising & Summarising Large Formulas

I'm working on a spreadsheet which will forecast the changes to certain costs in our building business based on estimated inputs.
For example, we may speculate that the price for a carpenter to complete a fitout will increase by $8 per m2 in Brisbane in August. We would write this data as:
Area = Brisbane
Month = August
Cost Centre = Carpenter Fitout = 150
We split each of the costs for building into different cost centres, represented numerically.
Increase = $8
Unit = m2
Based on this data, we can speculate how much each cost will increase in the coming months, and this is what I'm trying to do automatically.
The following are representations of the tables that I'm using in the spreadsheets.
Raw Data
An example of how the data looks raw from the import worksheet.
Area | Month | Centre | Value | Unit
-------|-----------|--------|-------|------
Bris | August | 150 | 10 | %
Sydney | September | 350 | 15 | m2
Import Table
How the data will be imported into the data analysing worksheet. The area, month and cost centre are combined for the VLOOKUPs later.
Label | Value | Unit
-------------------|-------|------
BrisAugust150 | 10 | %
SydneySeptember350 | 15 | m2
Calculation Table
All of the units that can be used in the import, and which calculation they correspond to. m2, m2t, m3, and EACH all use the same calculation (calc 4).
Unit | Calc | Description
-----|------|------------
FLAT | 1 | = Increase_Value
% | 2 | = Month_Value * Increase_Value / 100
000 | 3 | = Standard_Value * Increase_Value / 1000
m2 | 4 | = Standard_Value * Increase_Value
m2t | 4 |
m3 | 4 |
EACH | 4 |
Centre Values
Examples of standard quantities/dimensions that correspond to each of the cost centres.
Centre | Value
-------|-------
50 | 6
100 | 12
150 | 17
200 |
250 | ...
300 |
350 |
400 | etc
Monthly Data Dumps (For each Area)
Raw data is pasted into here from the live database at the beginning of each month to represent the costs associated with them.
July August September October
Centre
50 7 16 ... etc
100 68
150
200
250 ...
300
350
400 etc
Example Outputs
A summarised version of how the output will look, where each of the cost centres are against each of the months, and if there is something from the import that corresponds to both of these the appropriate calculation will be done.
Brisbane:
July August September October
Centre
50
100
150 10%
200
250
300
350
400
Sydney:
July August September October
Centre
50
100
150
200
250
300
350 15m2
400
Formula So Far
A psuedo-code version of the formula that will be featured in each cell so far. I thought it would be easier to decipher with labels instead of cell references, IFNA formulas taken out, etc.
=CHOOSE(
VLOOKUP( // Determine whether to use calc 1, 2, 3, or 4.
VLOOKUP( // Unit of calculation (i.e. m2, EACH, etc).
Area&Month&Centre,
Import_Table_Value,
3,
FALSE
),
Calculation_Table,
2,
FALSE
),
VLOOKUP( // Calc 1: Flat increase will only look up the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
),
( // Calc 2: % increase.
VLOOKUP( // Lookup the value from the monthly data dump corresponding to the appropriate month & cost centre.
Centre, // Cost centre (for each row).
Monthly_Data_Dump,
Appropriate_Month_Column,
FALSE
) * VLOOKUP( // Lookup the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
) / 100
),
( // Calc 3: 000' increase
VLOOKUP( // Lookup the appropriate value from the cost centre values table.
Centre,
Centre_Values,
2,
FALSE
) * VLOOKUP( // Lookup the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
) / 1000
),
( // Calc 4: Linear increase.
VLOOKUP( // Lookup the appropriate value from the cost centre values table.
Centre,
Centre_Values,
2,
FALSE
) * VLOOKUP( // Lookup the increase value.
Area&Month&Centre,
Import_Value_Table,
2,
FALSE
)
)
)
Basically, the formula will lookup a number from 1-4 and "choose" which formula will be used to determine a cell's value (if at all).
The spreadsheet has over approximately 300,000 cells to update across all the different areas, and running the formula as is takes an hour or more. I'm trying to reduce all the bloat and improve the time taken for the sheet to compute.
I've been dabbling with using INDEX MATCH instead of the VLOOKUPS, as well as trying some of the general optimisation tips that can be found online but the results only take off 5-10 minutes.
I'm after a more solid solution and am looking for advice on how to do that.
Looking at this from a data perspective you have 4 sets of information which can be represented as
RAW | CALC | DUMP | CENTRE
----------|--------|----------|-----------
Area* | Unit* | Area* | Centre*
Month* | Calc | Month* | CentreValue
Centre*| | Centre*|
Value | | Dump |
Unit | | |
RAW is your Raw Data Table, CALC is your Calculation Table, DUMP is equivalent to your Monthly Data Dumps and CENTRE is your Centre Values table.
We can conceive of these as the tables of a database with the labels in each column above representing the columns of the corresponding table. Columns with an asterisk represent the primary key(s) of the table. So, for example, table RAW has 5 columns and is keyed on the combination of columns Area, Month and Centre.
In a real database, these 4 tables could be joined to form a "view" which looks like
VIEW
--------
Area*
Centre*
Month*
Value
Dump
CentreValue
Calc
An additional column, say Result can be added to this view and (assuming I have understand your pseudo-formula correctly) assigned as
Value if Calc = 1
Value*Dump if Calc = 2
Value*CentreValue/1000 if Calc = 3
Value*CentreValue if Calc = 4
At the risk of not knowing all the subtleties of your data, in your position I would be giving consideration to implementing the above using a database approach.
3 of your inputs (RAW, CALC and CENTRE) already appear to be in the required table format whilst the fourth (DUMP) is sourced from a database so you may be able to get in the required format from its source (if not you'll just have to bash it into shape - not difficult).
The use of SQL for joining the tables into the required view replaces that complex nested set of VLOOKUP's and is likely to be considerably more efficient and faster. MS Access would be a good solution, but if not available to you you could try using MS Query. The latter is accessed via the Data Tab of the ribbon (From Other Sources/From Microsoft Query) and can access tables which are set up as named ranges in an Excel workbook. With MS Query you will need to put the input tables in a different workbook from the results view.
Both Access and Query employ a visual method for joining tables together and there will be plenty of tutorial material available on the web. Excel can import a "view" from Access (where views are known as Queries) and if using Query, closing the query pop-up window results in a prompt about whereabouts in the workbook the data should be placed.
Once you have your results in a database table format in Excel a pivot table will quickly get it to your required output format.

Resources