Cassandra read performance degrade as we increase data on nodes - cassandra

DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
Problem:
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
EDIT 1
EDIT 2
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
EDIT 3
cassandra.yaml

Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.

Related

Timeout on CQL COUNT() restricted by partition key

I have a table in cassandra DB that is populated. It provably has around 10000 records. When I try to execute select count(*), my query times out. Surprisingly, it times out even when i restrict the query with the partition key. The table has a column that is filled with a lot of text. I can't understand how that would be a problem, but i thought, i'd mention it. Any suggestions?
Doing a COUNT() of the rows in a partition shouldn't timeout unless it contains thousands and thousands of rows. More importantly, the query is most likely timing out when the partition contains thousands of tombstones.
You haven't provided a lot of information in your question and ideally you should have included:
the table schema
a sample query
In any case if you are storing queue-like datasets and deleting rows after they've been processed (because it's a queue) then you are generating lots of tombstones within the partition. Once you've reached the maximum tombstone_failure_threshold (default is 100K tombstones) then Cassandra will stop reading any more rows.
Unfortunately, it's hard to say what's happening in your case without the necessary details. Cheers!
A SELECT COUNT(*) needs to scan the entire database and can potentially take an extremely long time - much longer than the typical timeout. Ideally, such a query would be paged - periodically returning empty pages until the final page contains the count - to avoid timing out. But currently in Cassandra - and also in Scylla - this isn't done.
As Erick noted in his reply, everything becomes worse if you also have a lot of tombstones: You said you only have 10,000 rows, but it's easy to imagine a use case where the data changes frequently, and you actually have for each row 100 deleted rows - so Cassandra needs to scan through 1 million rows (most of them already dead), not 10,000.
Another issue to consider is that when your cluster is very large, scanning usually contact nodes sequentially, and each node many times (depending on the number of vnodes), so the scan time on a very large cluster will be large even if there are just a few actual rows in the database. By the way, unlike a regular scan, an aggregation like COUNT(*) can actually be done internally in parallel. Scylla recently implemented this and it speeds up counts (and other aggregation), but if I understand correctly, this feature is not in Cassandra.
Finally, you said that "Surprisingly, it times out even when i restrict the query with the partition key.". The question is how you restricted the query with a partition key. If you restricted the partition key itself to a range, it will still be slow because Cassandra still needs to scan all the partitions and compare their keys to the range. What you should have done is to restrict the token of the partition key, e.g., something like
where token(p) >= -9223372036854775808 and token(p) < ....

Cassandra : optimal partition size

I plan to have a simple table like this (simple key/value use case) :
CREATE TABLE my_data (
id bigint,
value blob,
PRIMARY KEY (id)
)
With the following caracteristics :
as you can see, one partition = one blob (value)
each value is always accessed by the corresponding key
each value is a blob of 1MB max (average also 1 MB)
with 1MB blob, it give 60 millions partitions
What do you think about the 1MB blob ? Is that OK for Cassandra ?
Indeed, I can divide my data further, to work with 1ko blob, but in that case, it will lead to many more partitions on Cassandra (more than 600 millions ?), and many more partitions to retreive the data for a same client side query..
Thanks
The general recommendation is to stay as close to 100MB partition sizes although this isn't a hard limit. There are some edge cases were partitions can get beyond 1GB and still be acceptable for some workloads as long as you're willing to accept the tradeoffs.
However in your case, 1MB blobs is a strong recommendation but again not a hard limit. You will notice a significant performance hit for larger blob sizes if you were to do a reasonable load test.
600 million partitions is not a problem at all. Cassandra is designed to handle billions, trillions of partitions and beyond. Cheers!

Co-coordinator pressure using IN query on single partition key with 9000 records 4MB size per partiton size

I have 1000 partitions per table and cust_id is partition key and bucket_id and timestamp are the cluster keys.
Every hour one bucket_id and timestamp entry are recorded per cust_id.
Each day 24 * 1 = 24 rows will be recorded per partiton.
One year approx 9000 records per partion.
Partion size is 4MB approx.
---> 20 nodes Cassandra cluster single DC and RF=3
I want to select random five buckets for last 90 days data using IN query.
select cust_id,bucket_id,timestamp from customer_data where
cust_id='tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' and
bucket_id IN (0,2,5,7,8)
and timestamp >='2020-03-01 00:00:00' and
timestamp <='2020-06-01 00:00:00';
Please confirm, does this approach cause any issues with coordinator pressure and query timeouts?
How much data can a coordinator bear and return data without any issue?
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
If I run same kind of query for 10 Mil customers, does this affect coordinator pressure? Does it increase the chances to get a read timeout error?
It's could be hard to get definitive yes/no answer to these questions - there are some unknowns in them. For example, what version of Cassandra, how much memory is allocated for instance, what disks are used for data, what compaction strategy is used for a table, what consistency level do you use for reading the data, etc.
Overall, on the recent versions of Cassandra and when using SSDs, I won't expect problems with that, until you have hundreds of items in the IN list, especially if you're using consistency level LOCAL_ONE and prepared queries - all drivers use token-aware load balancing policy by default, and will route request to the node that holds the data, so it will be both coordinator & data node. Use of other consistency levels would put more pressure to the coordinating node, but it still should work quite good. The problem with read timeouts could start if you use HDDs, and overall incorrectly size the cluster.
Regarding the 10Mil customers - in your query you're doing select by partition key, so query is usually sent to a replica directly (if you use prepared statements). To avoid problems you shouldn't do IN for partition key column (cust_id in your case) - if you do queries for individual customers, driver will spread queries over the whole cluster & you'll avoid increased pressure on coordinator nodes.
But as usual, you need to test your table schema & cluster setup to prove this. I would recommend to use NoSQLBench - benchmark/load testing tool that was recently open sourced by DataStax - it was built for quick load testing of cluster and checking data models, and incorporates a lot of knowledge in area of performance testing.
Please try to ask one question per question.
Regarding the how much a coordinator node can handle, Alex is correct in that there are several factors which contribute to that.
Size of the result set.
Heap/RAM available on the coordinator node.
Network consistency between nodes.
Storage config (spinning, SSD, NFS, etc).
Coordinator pressure will vary widely based on these parameters. My advice, is to leave all timeout threshold settings at their defaults. They are there to protect your nodes from becoming overwhelmed. Timeouts are Cassandra's way of helping you figure out how much it can handle.
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
Based on your description, the primary key definition should look like this:
PRIMARY KEY ((cust_id),bucket_id,timestamp)
The data will be stored on disk by partition, and sorted by the cluster keys, similar to this (assuming ascending order on bucket_id and descending order on timestamp:
cust_id bucket_id timestamp
'tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' 0 2020-03-02 04:00:00
2020-03-01 22:00:00
1 2020-03-27 16:00:00
2 2020-04-22 05:00:00
2020-04-01 17:00:00
2020-03-05 22:00:00
3 2020-04-27 19:00:00
4 2020-03-27 17:00:00
5 2020-04-12 08:00:00
2020-04-01 12:00:00
Cassandra reads through the SSTable files in that order. It's important to remember that Cassandra reads sequentially off disk. When queries force it to perform random reads, that's when things may start to get a little slow. The read path has structures like partition offsets and bloom filters which help it figure out which files (and where inside them) have the data. But within a partition, it will need to scan clustering keys and figure out what to skip and what to return.
Depending on how many updates these rows have taken, it's important to remember that the requested data may stretch across multiple files. Reading one file is faster than reading more than one.
At the very least, you're forcing it to stay on one node by specifying the partition key. But you'll have to test how much a coordinator can return before causing problems. In general, I wouldn't specify double digits of items in an IN clause.
In terms of optimizing file access, Jon Haddad (now of Apple) has a great article on this: Apache Cassandra Performance Tuning - Compression with Mixed Workloads It focuses mainly on the table compression settings (namely chunk_length_in_kb) and has some great tips on how to improve data access performance. Specifically, the section "How Data is Read" is of particular interest:
We pull chunks out of SSTables, decompress them, and return them to the client....During the read path, the entire chunk must be read and decompressed. We’re not able to selectively read only the bytes we need. The impact of this is that if we are using 4K chunks, we can get away with only reading 4K off disk. If we use 256KB chunks, we have to read the entire 256K.
The point of this ^ relevant to your question, is that by skipping around (using IN) the coordinator will likely read data that it won't be returning.

Why varying blob size gives different performance?

My cassandra table looks like this -
CREATE TABLE cs_readwrite.cs_rw_test (
part_id bigint,
s_id bigint,
begin_ts bigint,
end_ts bigint,
blob_data blob,
PRIMARY KEY (part_id, s_id, begin_ts, end_ts)
) WITH CLUSTERING ORDER BY (s_id ASC, begin_ts DESC, end_ts DESC)
When I insert 1 million row per client, with 8 kb blob per row and test the speed of insertions from different client hosts the speed is almost constant at ~100 mbps. But with the same table definition, from same client hosts if I insert rows with 16 bytes of blob data then my speed numbers are dramatically low ~4 to 5 mbps. Why is there such a speed difference? I am only measuring write speeds for now. My main concern is not speed (though some inputs will help) when I add more clients I see speed is almost constant for bigger blob size but for 16 bytes blob the speed is increasing only by 10-20% per added client before it becomes constant.
I have also looked at bin/nodetool tablehistograms output and do adjust number of partitions in my test data so no partition is > 100 mb.
Any insights/ links for documentation would be helpful. Thanks!
I think you are measuring the throughput in the wrong way. The throughput should be measured in transactions per second and not in data written per second.
Even though the amount of data written can play a role in determining the write throughput of a system but usually it depends on many other factors.
Compaction Strategy like STCS is write-optimized whereas LOCS is
read-optimized.
Connection speed and latency between the client and the cluster, and
between machines in the cluster
CPU usage of the node which is processing data, sending data to other
replicas and waiting for their acknowledgment.
Most of the writes are immediately written in memory instead of being written directly in the disk which basically makes the impact of the amount of data being written on final write throughput almost negligible whereas other fixed things like Network delay, CPU to coordinate the processing of data across nodes, etc have a bigger impact.
The way you should see it is that with 8KB of payload you are getting X transactions per second and with 16 Bytes you are getting Y transactions per second. Y will always be better than X but it will not be linearly proportional to the size difference.
You can find how writes are handled in cassandra explained in detail here.
Theres management overhead in Cassandra per row/partition, the more data (in bytes) you have in each row the less that overhead impacts throughput in bytes/sec. The reverse is true if you look at rows per sec as a metric of throughput. The larger the payloads the worse your rows/sec throughput would get.

One bigger partition or few smaller but more distributed partitions for Range Queries in Cassandra?

We have a table that stores our data partitioned by files. One file is 200MB to 8GB in json - but theres a lot of overhead obviously. Compacting the raw data will lower this drastically. I ingested about 35 GB of json data and only one node got slightly more than 800 MB data. This is possibly due to "write hotspots" -- but we only write once and read only. We do not update data. Currently, we have one partition per file.
By using secondary indexes, we search for partitions in the database that contain a specific geolocation (= first query) and then take the result of this query to range query a time range of the found partitions (= second query). This might even be the whole file if needed but in 95% of the queries only chunks of a partition are queried.
We have a replication factor of 2 on a 6 node cluster. Data is fairly even distributed, every node owns 31,9% to 35,7% (effective) data according to nodetool status *tablename*.
Good read performance is key for us.
My questions:
How big is too big for a partition in terms of volume or row size? Is there a rule of thumb for this?
For Range Query performance: Is it better to split up our "big" partitions to have more smaller partitions? We built our schema with "big" partitions because we thought that when we do range queries on a partition, it would be good to have it all on one node so data can be fetched easily. Note that the data is also available on one replica due to RF 2.
C* supports very huge rows, but it doesn't mean it is a good idea to go to that level. The right limit depends on specific use cases, but a good ballpark value could be between 10k and 50k. Of course, everything is a compromise, so if you have "huge" (in terms of bytes) rows then heavily limit the numbers of rows in each partition. If you have "small" (in terms of bytes) rows them you can relax that limit a bit. This is because one partition means one node only due to your RF=1, so all your query for a specific partition will hit only one node.
Range queries should ideally go to one partition only. A range query means a sequential scan on your partition on the node getting the query. However, you will limit yourself to the throughput of that node. If you split your range queries between more nodes (that is you change the way you partition your data by adding something like a bucket) you need to get data from different nodes as well performing parallel queries, directly increasing the total throughput. Of course you'd lose the order of your records within different buckets, so if the order in your partition matters, then that could not be feasible.

Resources