Cassandra : optimal partition size - cassandra

I plan to have a simple table like this (simple key/value use case) :
CREATE TABLE my_data (
id bigint,
value blob,
With the following caracteristics :
as you can see, one partition = one blob (value)
each value is always accessed by the corresponding key
each value is a blob of 1MB max (average also 1 MB)
with 1MB blob, it give 60 millions partitions
What do you think about the 1MB blob ? Is that OK for Cassandra ?
Indeed, I can divide my data further, to work with 1ko blob, but in that case, it will lead to many more partitions on Cassandra (more than 600 millions ?), and many more partitions to retreive the data for a same client side query..

The general recommendation is to stay as close to 100MB partition sizes although this isn't a hard limit. There are some edge cases were partitions can get beyond 1GB and still be acceptable for some workloads as long as you're willing to accept the tradeoffs.
However in your case, 1MB blobs is a strong recommendation but again not a hard limit. You will notice a significant performance hit for larger blob sizes if you were to do a reasonable load test.
600 million partitions is not a problem at all. Cassandra is designed to handle billions, trillions of partitions and beyond. Cheers!


CosmosDb search on the Index vs partition Key

By default in cosmosDb, all properties in documents are indexed, so why should I care to do researches on the partition key while the searches on index works perfectly as well and cost nothing ?
I have a cosmosDb with one million of document like this with each of them contain an array, the partition key is "tankId" e.g.:
"id": "67acdb16-80dd-4a6c-a5b0-118d5f5fdb97",
"tankId": "67acdb16-80dd-4a6c-a5b0-118d5f5fdb97"
"UserIds": [
If I do a request on "UserIds" on this million documents which is not a partition key but indexed property, it takes only 3.32 RU !!! Wow.
WHERE ARRAY_CONTAINS(c.UserIds, "905336a5-bf96-444f-bb11-3eedb65c3760")
Is it a good practice to do that kind of request ? I am a little bit worried on my design.
It start's mattering once your number of physical partitions starts growing. Using the partition key will allow Cosmos to map the query to a logical partition that resides in a physical partition. Therefore the query won't be a so called 'cross-partition query' and it won't have to check the index of other physical partitions (that also would consume RU).
In your case you are talking about a million documents which likely use a lot less than 50GB of data (the max size of a physical partition) so it's all stored in the same physical partition. Therefore you won't have any noticable effects on the RU usage.
So to anwser your underlying question whether you should make any changes. Is your database mostly read heavy? Do you have any property that is often used for querying? Are you assured that your partitions remain under the logical partition size limit (20GB)? If yes, then you should likely consider it in your design. Even then it'll only matter once your data starts to split in physical partitions.

Why varying blob size gives different performance?

My cassandra table looks like this -
CREATE TABLE cs_readwrite.cs_rw_test (
part_id bigint,
s_id bigint,
begin_ts bigint,
end_ts bigint,
blob_data blob,
PRIMARY KEY (part_id, s_id, begin_ts, end_ts)
) WITH CLUSTERING ORDER BY (s_id ASC, begin_ts DESC, end_ts DESC)
When I insert 1 million row per client, with 8 kb blob per row and test the speed of insertions from different client hosts the speed is almost constant at ~100 mbps. But with the same table definition, from same client hosts if I insert rows with 16 bytes of blob data then my speed numbers are dramatically low ~4 to 5 mbps. Why is there such a speed difference? I am only measuring write speeds for now. My main concern is not speed (though some inputs will help) when I add more clients I see speed is almost constant for bigger blob size but for 16 bytes blob the speed is increasing only by 10-20% per added client before it becomes constant.
I have also looked at bin/nodetool tablehistograms output and do adjust number of partitions in my test data so no partition is > 100 mb.
Any insights/ links for documentation would be helpful. Thanks!
I think you are measuring the throughput in the wrong way. The throughput should be measured in transactions per second and not in data written per second.
Even though the amount of data written can play a role in determining the write throughput of a system but usually it depends on many other factors.
Compaction Strategy like STCS is write-optimized whereas LOCS is
Connection speed and latency between the client and the cluster, and
between machines in the cluster
CPU usage of the node which is processing data, sending data to other
replicas and waiting for their acknowledgment.
Most of the writes are immediately written in memory instead of being written directly in the disk which basically makes the impact of the amount of data being written on final write throughput almost negligible whereas other fixed things like Network delay, CPU to coordinate the processing of data across nodes, etc have a bigger impact.
The way you should see it is that with 8KB of payload you are getting X transactions per second and with 16 Bytes you are getting Y transactions per second. Y will always be better than X but it will not be linearly proportional to the size difference.
You can find how writes are handled in cassandra explained in detail here.
Theres management overhead in Cassandra per row/partition, the more data (in bytes) you have in each row the less that overhead impacts throughput in bytes/sec. The reverse is true if you look at rows per sec as a metric of throughput. The larger the payloads the worse your rows/sec throughput would get.

Cassandra partitioning strategy for systems with skewed traffic

Please bear with me for slightly longer problem description.
I am a newbie to Cassandra world and I am trying to migrate my current product from oracle based data layer to Cassandra.
In order to support range queries I have created an entity like below:
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event),
event_type, client_request_id, id)
) with clustering order by (created_date desc);
Now, I have come across several documentation/resources/blogs that mentions that I should be keeping my partition size less than 100 mb for an optimally performing cluster. With the volume of traffic my system handles per day for a certain combinations of partitioning key, there is no way i can keep it less than 100 mb with above partitioning key.
To fix this i introduced a new factor called bucket_id and was thinking of assigning it hour of the day value to further break partitions into smaller chunks and keep them less than 100 mb(Even though this means i have to do 24 reads to serve traffic details for one day, but i am fine with some inefficiency in reads). Here is the schema with bucket id
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
bucket_id int,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event,
bucket_id), event_type, client_request_id, id)
) with clustering order by (created_date desc);
Even with this, couple of combinations of
goes more than 100 mb while all other volume sits comfortably within the range.
With this situation in mind I have below questions:
Is it an absolute blunder to have few of your partitions go beyond 100 mb limit?
Though with even smaller bucket say 15 min window, I get all combinations of partition key under 100 mb but that too creates heavily skewed partitions, meaning that high volume combinations of partition key goes up till 80 mb while remaining once are well under 15 mb. Is this something that will adversely impact performance of my cluster?
Is there a better way to solve this problem?
Here is some more info that I thought may be useful:
Avg row size for this entity is around 200 bytes
I am also considering a load future proofing factor of 2 and estimating for double the load.
Peak load for a specific combination of partition key is around 2.8 Million records in a day
the same combination has peak traffic hour of about 1.4 million records
and the same in 15 min window is around 550,000 records.
Thanks in advance for your inputs!!
Your approach with the bucket id looks good. Answering your questions:
No, it's not a hard limit, and actually, it might be too low taking into account hardware improvements over the last few years. I have seen partitions of 2 GB and 5 GB (though they can give you a lot of headaches when doing repairs), but those are extreme cases. Don't go near those values. Bottom line, if you don't go WAY above those 100 MB, you will be fine. If you have at least 15 GB of ram, use G1GC and you're golden.
A uniform distribution on the partition sizes is important to keep the data load balanced throughout the cluster, and it's also good so that you're confident that your queries will be close to an average latency (because they will be reading the approximate same sizes of data), but it's not something that will give performance issues on its own.
The approach looks good, but if that's a time series, which I think it is taking into account what you said, then I recommend that you use TWCS (Time Window Compaction Strategy) in my_system.my_system_log_dated. Check how to configure this compaction strategy, because the time window you set will be very important.
I was able to device bucketisation that prevents any risks to cluster health due to any unexpected traffic spike. Same has been described here

Azure Cosmos DB - Understanding Partition Key

I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.
I've read the documentation here: ( and still am unsure how to proceed with the naming convention of this partition key.
Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.
In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.
EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from #Porschiey.
Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.
But, in a nutshell:
The PartitionKey is a property that will exist on every single object that is best used to group similar objects together.
Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.
EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)
* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.
Partition key acts as a logical partition.
Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.
What effect does this have on the query?
While querying you would put your partition key as feed options and won't include it in your filter.
e.g: If your query was
SELECT * FROM T WHERE T.CustomerId= 'CustomerId';
It will be Now
var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};
var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();
I've put together a detailed article here Azure Cosmos DB. Partitioning.
What's logical partition?
Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.
There are two main limitation on Physical Partitions:
Max throughput: 10k RUs
Max data size (sum of sizes of all LPs stored in this PP): 50GB
Logical partition has one - 20GB limit in size.
NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.
How to select right partition key for my container?
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id of the document or a composite field). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.
Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why large partitions are a terrible choice in most cases even though documentation says "select whatever works best for you"
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO
Primary Key: Uniquely identifies the data
Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).
Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.
Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.
You can read more about it here:
Each partition on a table can store up to 10GB (and a single table can store as many document schema types as you like). You have to choose your partition key though such that all the documents that get stored against that key (so fall into that partition) are under that 10GB limit.
I'm thinking about this too right now - so should the partition key be a date range of some type? In that case, it would really depend on how much data is getting stored in a period of time.
You are defining a logical partition.
Underneath, physically the data is split into physical partitions by Azure.
Ideally a partitionKey should be a primary Key, or a field with high cardinality to ensure proper distribution, with the self generated id field within that partition also set to the primary key, that will help with documentFetchById much faster.
You cannot change a partitionKey once container is created.
Looking at the dataset, captureId is a good candidate for partitionKey, with id set manually to this field, and not an auto generated cosmos one.
There is documentation available from Microsoft about partition keys. According to me you need to check the queries or operations that you plan to perform with cosmos DB. Are they read-heavy or write-heavy? if read heavy it is ideal to choose a partition key in the where clause that will be used in the query, if it is a write heavy operation then look for a key which has high cardinality
Always point reads /writes are better since it consumes way less RU's than running other queries

Cassandra read performance degrade as we increase data on nodes

DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
userID text,
timestamp timestamp,
PRIMARY KEY (userID, timestamp)
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.
