I am trying to estimate the amount of space required for each column in a Cassandra wide row, but the numbers that I get are wildly conflicting.
I have a pretty standard wide row table to store some time series data:
CREATE TABLE raw_data (
id uuid,
time timestamp,
data list<float>,
PRIMARY KEY (id, time)
);
In my case, I store 20 floats in the data list.
Datastax provides some formulas for estimating user data size.
regular_total_column_size = column_name_size + column_value_size + 15
row_size = key_size + 23
primary_key_index = number_of_rows * ( 32 + average_key_size )
For this table, we get the following values:
regular_total_column_size = 8 + 80 + 15 = 103 bytes
row_size = 16 + 23 = 39 bytes
primary_key_index = 276 * ( 32 + 16 ) = 13248 bytes
I'm mostly interested in how the row grows, so the 103 bytes per column is of interest. I counted all the samples in my database and ended up with 29,241,289 unique samples. Multiplying it out I get an estimated raw_data table size of 3GB.
In reality, I have 4GB of compressed data as measured by nodetool cfstats right after compaction. It reports a compression ratio of 0.117. It averages out to 137 bytes per sample, on disk, after compression. That seems very high, considering:
only 88 bytes of that is user data
It's 34 bytes more per sample
This is after deflate compression.
So, my question is: how do I accurately forecast how much disk space Cassandra wide rows consume, and how can I minimize the total disk space?
I'm running a single node with no replication for these tests.
This may be due to compaction strategies. With size tiered compaction, the SSTables will build up to double the required space during compaction. For levelled compaction, around 10% extra space will be needed. Depending on compaction strategy, you need to take into account the additional disk spaced used.
Related
I'm looking at a hash table implementation and it says that the table will grow by doubling in size when it's 75% full, which gives it an average fill rate of:
(75 + 75 / 2) / 2 = 56%.
How did the author arrive at this formula? If the table tripled in size, would the 2's become 3's?
(75 + 75 / 2) / 2 = 56%.
It's basically saying that when it comes times for resizing, the table will be 75% full (the first 75), but the prior resizing would have happened when the table was half as big, so at that time the number of elements would have been half as much as needed for this resize, so 75 / 2. Outside the parentheses, the trailing / 2 takes the average of these two load factors.
If the table tripled in size, then we'd have:
(75 + 75 / 3) / 2 = 50%.
That reflects the load factor after a resize being only 25% now, but we still have a trailing / 2 to get an average over that initial 25% load factor and the 75% load factor at which it will resize again.
I am reading dataframe from JDBC source using partitioning as described here, using numPartitions, partitionColumn, upperBound, lowerBound. I've been using this quite often, but this time I noticed something weird. With numPartition = 32 and 124 distinct partition column values, this split data into 30 smaller chunks and 2 large.
Task 1 - partitions 1 .. 17 (17 values!)
Task 2 - partitions 18 .. 20 (3 values)
Task 3 - partitions 21 .. 23 (3 values)
Task 4 - partitions 24 .. 26 (3 values)
...
Task 30 - partitions 102 .. 104 (3 values)
Task 31 - partitions 105 .. 107 (3 values)
Task 32 - partitions 108 .. 124 (17 values!)
I'm just wondering whether this actually worked as expected and what I can do to make it split into even chunks apart from experimenting maybe with different values of numPartitions (note that I number of values can vary and I'm not always able to predict it).
I looked through source code of JDBCRelation.scala and found out that this is exactly how it's implemented. It first calculates stride as (upperBound - lowerBound) / numPartitions, which in my case is 124 / 32 = 3, and then remaining values are allocated evenly to fist and last partition.
I was a bit unlucky with the number of values, because if I had 4 more, then 128 / 32 = 4 and it would nicely align 32 partitions of 4 values each.
I ended up pre-querying the table for exact range and then manually providing predicates using:
val partRangeSql = "SELECT min(x), max(x) FROM table")
val (partMin, partMax) =
spark.read.jdbc(jdbcUrl, s"($partRangeSql) _", props).as[(Int, Int)].head
val predicates = (partMin to partMax).map(p => s"x = $p").toArray
spark.read.jdbc(jdbcUrl, s"table", predicates, props)
That makes it 124 partitions (one per value), so need to be careful with overloading database server (but I'm limiting number of executors so that not running more than 32 concurrent sessions).
I guess adjusting lowerBound/upperBound so that upperBound - lowerBound is a multiple of numPartitions would also do the work.
I have a data set that I'm trying to process in PySpark. The data (on disk as Parquet) contains user IDs, session IDs, and metadata related to each session. I'm adding a number of columns to my dataframe that are the result of aggregating over a window. The issue I'm running into is that all but 4-6 executors will complete quickly and the rest run forever without completing. My code looks like this:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
empty_col_a_cond = ((f.col("col_A").isNull()) |
(f.col("col_A") == ""))
session_window = Window.partitionBy("user_id", "session_id") \
.orderBy(f.col("step_id").asc())
output_df = (
input_df
.withColumn("col_A_val", f
.when(empty_col_a_cond, f.lit("NA"))
.otherwise(f.col("col_A")))
# ... 10 more added columns replacing nulls/empty strings
.repartition("user_id", "session_id")
.withColumn("s_user_id", f.first("user_id", True).over(session_window))
.withColumn("s_col_B", f.collect_list("col_B").over(session_window))
.withColumn("s_col_C", f.min("col_C").over(session_window))
.withColumn("s_col_D", f.max("col_D").over(session_window))
# ... 16 more added columns aggregating over session_window
.where(f.col("session_flag") == 1)
.where(f.array_contains(f.col("s_col_B"), "some_val"))
)
In my logs, I see this over and over:
INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
INFO UnsafeExternalSorter: Thread 92 spilling sort data of 9.2 GB to disk (2 times so far)
INFO UnsafeExternalSorter: Thread 91 spilling sort data of 19.3 GB to disk (0 time so far)
Which suggests that Spark can't hold all the windowed data in memory. I tried increasing the internal settings spark.sql.windowExec.buffer.in.memory.threshold and spark.sql.windowExec.buffer.spill.threshold, which helped a little but there are still executors not completing.
I believe this is all caused by some skew in the data. Grouping by both user_id and session_id, there are 5 entries with a count >= 10,000, 100 records with a count between 1,000 and 10,000, and 150,000 entries with a count less than 1,000 (usually count = 1).
input_df \
.groupBy(f.col("user_id"), f.col("session_id")) \
.count() \
.filter("count < 1000") \
.count()
# >= 10k, 6
# < 10k and >= 1k, 108
# < 1k, 150k
This is the resulting job DAG:
I'm trying to analyze the parseable output from atop, but the man section is not clearly to me, I count more words than manual page explain.
"The first part of each output-line consists of the following six fields: label (the name of the label), host (the name of this machine), epoch (the time of this interval as number of seconds since 1-1-1970), date (date of this interval in format YYYY/MM/DD), time (time of this interval in format HH:MM:SS), and interval (number of seconds elapsed for this interval).
The subsequent fields of each output-line depend on the label:
PRM:
Subsequent fields: PID, name (between brackets), state, page size for this machine (in bytes), virtual memory size (Kbytes), resident memory size (Kbytes), shared text memory size (Kbytes), virtual memory growth (Kbytes), resident memory growth (Kbytes), number of minor page faults, and number of major page faults."
https://linux.die.net/man/1/atop
So,
Standard fields + PRM fields
6 + 11 = 17
But I count 24 fields total
atop -r FILE -p PRM
sample output;
PRM hernan-Virtual-Machine 1591135517 2020/06/02 19:05:17 834 660
(cron) S 4096 38424 3288 44 38424 3288 247 1 3216 336 132 0 660 y 385
How should i read the output?
In "Cassandra The Definitive Guide" (2nd edition) by Jeff Carpenter & Eben Hewitt, the following formula is used to calculate the size of a table on disk (apologies for the blurred part):
ck: primary key columns
cs: static columns
cr: regular columns
cc: clustering columns
Nr: number of rows
Nv: it's used for counting the total size of the timestamps (I don't get this part completely, but for now I'll ignore it).
There are two things I don't understand in this equation.
First: why do clustering columns size gets counted for every regular column? Shouldn't we multiply it by the number of rows? It seems to me that by calculating this way, we're saying that the data in each clustering column, gets replicated for each regular column, which I suppose is not the case.
Second: why do primary key columns don't get multiplied by the number of partitions? From my understanding, if we have a node with two partitions, then we should multiply the size of the primary key columns by two because we'll have two different primary keys in that node.
It's because of Cassandra's version < 3 internal structure.
There is only one entry for each distinct partition key value.
For each distinct partition key value there is only one entry for static column
There is an empty entry for the clustering key
For each column in a row there is a single entry for each clustering key column
Let's take an example :
CREATE TABLE my_table (
pk1 int,
pk2 int,
ck1 int,
ck2 int,
d1 int,
d2 int,
s int static,
PRIMARY KEY ((pk1, pk2), ck1, ck2)
);
Insert some dummy data :
pk1 | pk2 | ck1 | ck2 | s | d1 | d2
-----+-----+-----+------+-------+--------+---------
1 | 10 | 100 | 1000 | 10000 | 100000 | 1000000
1 | 10 | 100 | 1001 | 10000 | 100001 | 1000001
2 | 20 | 200 | 2000 | 20000 | 200000 | 2000001
Internal structure will be :
|100:1000: |100:1000:d1|100:1000:d2|100:1001: |100:1001:d1|100:1001:d2|
-----+-------+-----------+-----------+-----------+-----------+-----------+-----------+
1:10 | 10000 | | 100000 | 1000000 | | 100001 | 1000001 |
|200:2000: |200:2000:d1|200:2000:d2|
-----+-------+-----------+-----------+-----------+
2:20 | 20000 | | 200000 | 2000000 |
So size of the table will be :
Single Partition Size = (4 + 4 + 4 + 4) + 4 + 2 * ((4 + (4 + 4)) + (4 + (4 + 4))) byte = 68 byte
Estimated Table Size = Single Partition Size * Number Of Partition
= 68 * 2 byte
= 136 byte
Here all of the field type is int (4 byte)
There is 4 primary key column, 1 static column, 2 clustering key column and 2 regular column
More : http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/
As the author, I greatly appreciate the question and your engagement with the material!
With respect to the original questions - remember that this is not the formula to calculate the size of the table, it is the formula to calculate the size of a single partition. The intent is to use this formula with "worst case" number of rows to identify overly large partitions. You'd need to multiply the result of this equation by the number of partitions to get an estimate of total data size for the table. And of course this does not take replication into account.
Also thanks to those who responded to the original question. Based on your feedback I spent some time looking at the new (3.0) storage format to see whether that might impact the formula. I agree that Aaron Morton's article is a helpful resource (link provided above).
The basic approach of the formula remains sound for the 3.0 storage format. The way the formula works, you're basically adding:
the sizes of the partition key and static columns
the size of the clustering columns per row, times the number of rows
8 bytes of metadata for each cell
Updating the formula for the 3.0 storage format requires revisiting the constants. For example, the original equation assumes 8 bytes of metadata per cell to store a timestamp. The new format treats the timestamp on a cell as optional since it can be applied at the row level. For this reason, there is now a variable amount of metadata per cell, which could be as low as 1-2 bytes, depending on the data type.
After reading this feedback and rereading that section of the chapter, I plan to update the text to add some clarifications as well as stronger caveats about this formula being useful as an approximation rather than an exact value. There are factors it doesn't account for at all such as writes being spread over multiple SSTables, as well as tombstones. We're actually planning another printing this spring (2017) to correct a few errata, so look for those changes soon.
Here is the updated formula from Artem Chebotko:
The t_avg is the average amount of metadata per cell, which can vary depending on the complexity of the data, but 8 is a good worst case estimate.