How do I calculate the size of a Cassandra Partition? - cassandra

CREATE TABLE IF NOT EXISTS video (key int, value int, PRIMARY KEY (key, value));
Here Partition Key is key and Clustering Key is value. No regular columns.
Assume, there are 1000000 rows in this partition.
What is the size of the partition?

To calculate the partition size, you need the following data points:
size of the partition key columns
size of static columns
size of cells in the partition (clustering + regular columns)
size of metadata overhead per row
In your case:
the size the partition key (a single int column) is 4 bytes
the size of static columns (there are none) is 0 bytes
the size of the cells (clustering int + 0 regular columns) is 4 + 0 bytes
the size of the metadata overhead is 8 bytes on average
So for 1M rows:
partition size = 4B + 0B + (4B x 1,000,000 cells) + (8B x 1,000,000,000 rows)
= 12,000,004 bytes
= 11.44 MB
Cheers!

Insert the desired number of records into your Cassandra table.
Wait for the flush to happen persisting records to the disk or invoke nodetool flush manually on your cluster node(s).
Navigate to the data directory. By default, data_file_directories will be persisting data to /var/lib/cassandra/data. Switch to <your_table_name-timeuuid> formatted directory
List the <sstable_version-Data.db> file to view its size. Note that this is just on a single node size. If you have more than one node in your cluster, you'd have to repeat the steps to calculate size on each node.
Alternatively, you could also run nodetool tablestats command on each node to understand statistics about a particular table.

Related

Spark condition on partition column from another table (Performance)

I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions

Performance of query with only partition key

Is the performance impacted if I provide only the partition key while querying a table containing both partition key and clustering key?
For example, for a table with partition key p1 and clustering key c1, would
SELECT * FROM table1 where p1 = 'abc';
be less efficient than
SELECT * FROM table1 where p1 = 'abc' and c1 >= 'some range start value' and c1 <= 'some range end value';
My goal is to fetch all rows with p1 = 'abc'.
Main cost in going to particular row vs a particular partition is that theres an extra work and necessity of deserializing the clustering key index at the beginning of the partition. Its a bit old and based on thrift but the gist of it remains true in the following:
http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
(note: row level bloom filter was removed)
When reading from a beginning of a partition you can save a little work which will improve the latency.
I wouldn't worry too much about it as long as your queries are not spanning multiple partitions. Then you will generally only have issues if the partitions get to be hundreds of mb or gb's in size.

Difference (if there is one) between spark.sql.shuffle.partitions and df.repartition

I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n") and re-partitioning a Spark DataFrame utilizing df.repartition(n).
The Spark documentation indicates that set spark.sql.shuffle.partitions=n configures the number of partitions that are used when shuffling data, while df.repartition seems to return a new DataFrame partitioned by the number of key specified.
To make this question clearer, here is a toy example of how I believe df.reparition and spark.sql.shuffle.partitions to work:
Let's say we have a DataFrame, like so:
ID | Val
--------
A | 1
A | 2
A | 5
A | 7
B | 9
B | 3
C | 2
Scenario 1: 3 Shuffle Partitions, Reparition DF by ID:
If I were to set sqlContext.sql("set spark.sql.shuffle.partitions=3") and then did df.repartition($"ID"), I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
Is my understanding off base here? In general, my questions are:
I am trying to optimize my partitioning of a dataframe as to avoid
skew, but to have each partition hold as much of the same key
information as possible. How do I achieve that with set
spark.sql.shuffle.partitions and df.repartiton?
Is there a link
between set spark.sql.shuffle.partitions and df.repartition? If
so, what is that link?
Thanks!
I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
No
5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
and no.
This is not how partitioning works. Partitioners map values to partitions, but mapping in general case is not unique (you can check How does HashPartitioner work? for a detailed explanation).
Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?
Indeed there is. If you df.repartition, but don't provide number of partitions then spark.sql.shuffle.partitions is used.

Cassandra UUID partition key and partition size

Given a table
CREATE TABLE sensors_by_id (
id uuid,
time timeuuid,
some_text text,
PRIMARY KEY (id, time)
)
Will this scale when there are a lot of entries? I´m not sure, if a UUID field is sufficient as a good partition key or is there a need to create some artificial key like week_first_day or something similar?
It's really depends on how will you insert your data - if you generate the UUID really randomly for every insert, then the chance of duplicates is very low, and you'll get so-called "skinny rows" (a lot of partitions with 1 row inside). Even if you start to get the duplicates, there will be not so many for every row...
It could be a problem with partition size cause cassandra has limit for disk size per one partition.
Good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB.
It is easy to calculate partition size by using that formula
You can read more about data modeling here.
So in your case with current schema for 1 000 000 rows count per one partition with average size 100 byte for some_text column will be:
Number of Values: (1000000 * (3 - 2 - 0) + 0) = 1000000
Partition Size on Disk: (16 + 0 + (1000000 * 116) + (8 * 1000000))
= 124000016 bytes (118.26 Mb)
So as you can see you out of limit with 118.26 Mb per one partition. So you need optimize your partition keys.
I calculated it using my open source project - cql-calculator.

Spark dataframe partition count

I am confused about how spark creates partitions in spark dataframe. Here is the list of steps and the partition size
i_df = sqlContext.read.json("json files") // num partitions returned is 4, total records 7000
p_df = sqlContext.read.format("csv").Other options // num partitions returned is 4 , total records: 120k
j_df = i_df.join(p_df, i_df.productId == p_df.product_id) // total records 7000, but num of partitions is 200
first two dataframes have 4 partitions, but as soon as i join them it shows 200 partitions. I was expecting that it will make 4 partitions after joining, but why is it showing 200.
I am running it on local with
conf.setIfMissing("spark.master", "local[4]")
The 200 is the default shuffle partition size. you can change it by setting spark.sql.shuffle.partitions

Resources