Co-coordinator pressure using IN query on single partition key with 9000 records 4MB size per partiton size - cassandra

I have 1000 partitions per table and cust_id is partition key and bucket_id and timestamp are the cluster keys.
Every hour one bucket_id and timestamp entry are recorded per cust_id.
Each day 24 * 1 = 24 rows will be recorded per partiton.
One year approx 9000 records per partion.
Partion size is 4MB approx.
---> 20 nodes Cassandra cluster single DC and RF=3
I want to select random five buckets for last 90 days data using IN query.
select cust_id,bucket_id,timestamp from customer_data where
cust_id='tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' and
bucket_id IN (0,2,5,7,8)
and timestamp >='2020-03-01 00:00:00' and
timestamp <='2020-06-01 00:00:00';
Please confirm, does this approach cause any issues with coordinator pressure and query timeouts?
How much data can a coordinator bear and return data without any issue?
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
If I run same kind of query for 10 Mil customers, does this affect coordinator pressure? Does it increase the chances to get a read timeout error?

It's could be hard to get definitive yes/no answer to these questions - there are some unknowns in them. For example, what version of Cassandra, how much memory is allocated for instance, what disks are used for data, what compaction strategy is used for a table, what consistency level do you use for reading the data, etc.
Overall, on the recent versions of Cassandra and when using SSDs, I won't expect problems with that, until you have hundreds of items in the IN list, especially if you're using consistency level LOCAL_ONE and prepared queries - all drivers use token-aware load balancing policy by default, and will route request to the node that holds the data, so it will be both coordinator & data node. Use of other consistency levels would put more pressure to the coordinating node, but it still should work quite good. The problem with read timeouts could start if you use HDDs, and overall incorrectly size the cluster.
Regarding the 10Mil customers - in your query you're doing select by partition key, so query is usually sent to a replica directly (if you use prepared statements). To avoid problems you shouldn't do IN for partition key column (cust_id in your case) - if you do queries for individual customers, driver will spread queries over the whole cluster & you'll avoid increased pressure on coordinator nodes.
But as usual, you need to test your table schema & cluster setup to prove this. I would recommend to use NoSQLBench - benchmark/load testing tool that was recently open sourced by DataStax - it was built for quick load testing of cluster and checking data models, and incorporates a lot of knowledge in area of performance testing.

Please try to ask one question per question.
Regarding the how much a coordinator node can handle, Alex is correct in that there are several factors which contribute to that.
Size of the result set.
Heap/RAM available on the coordinator node.
Network consistency between nodes.
Storage config (spinning, SSD, NFS, etc).
Coordinator pressure will vary widely based on these parameters. My advice, is to leave all timeout threshold settings at their defaults. They are there to protect your nodes from becoming overwhelmed. Timeouts are Cassandra's way of helping you figure out how much it can handle.
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
Based on your description, the primary key definition should look like this:
PRIMARY KEY ((cust_id),bucket_id,timestamp)
The data will be stored on disk by partition, and sorted by the cluster keys, similar to this (assuming ascending order on bucket_id and descending order on timestamp:
cust_id bucket_id timestamp
'tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' 0 2020-03-02 04:00:00
2020-03-01 22:00:00
1 2020-03-27 16:00:00
2 2020-04-22 05:00:00
2020-04-01 17:00:00
2020-03-05 22:00:00
3 2020-04-27 19:00:00
4 2020-03-27 17:00:00
5 2020-04-12 08:00:00
2020-04-01 12:00:00
Cassandra reads through the SSTable files in that order. It's important to remember that Cassandra reads sequentially off disk. When queries force it to perform random reads, that's when things may start to get a little slow. The read path has structures like partition offsets and bloom filters which help it figure out which files (and where inside them) have the data. But within a partition, it will need to scan clustering keys and figure out what to skip and what to return.
Depending on how many updates these rows have taken, it's important to remember that the requested data may stretch across multiple files. Reading one file is faster than reading more than one.
At the very least, you're forcing it to stay on one node by specifying the partition key. But you'll have to test how much a coordinator can return before causing problems. In general, I wouldn't specify double digits of items in an IN clause.
In terms of optimizing file access, Jon Haddad (now of Apple) has a great article on this: Apache Cassandra Performance Tuning - Compression with Mixed Workloads It focuses mainly on the table compression settings (namely chunk_length_in_kb) and has some great tips on how to improve data access performance. Specifically, the section "How Data is Read" is of particular interest:
We pull chunks out of SSTables, decompress them, and return them to the client....During the read path, the entire chunk must be read and decompressed. We’re not able to selectively read only the bytes we need. The impact of this is that if we are using 4K chunks, we can get away with only reading 4K off disk. If we use 256KB chunks, we have to read the entire 256K.
The point of this ^ relevant to your question, is that by skipping around (using IN) the coordinator will likely read data that it won't be returning.

Related

Provisioned write capacity in Cassandra

I need to capture time-series sensor data in Cassandra. The best practices for handling time-series data in DynamoDB is as follow:
Create one table per time period, provisioned with write capacity less than 1,000 write capacity units (WCUs).
Before the end of each time period, prebuild the table for the next period.
As soon as a table is no longer being written to, reduce its provisioned write capacity. Also reduce the provisioned read capacity of earlier tables as they age, and archive or delete the ones whose contents will rarely or never be needed.
Now I am wondering how I can implement the same concept in Cassandra! Is there any way to manually configure write/read capacity in Cassandra as well?
This really depends on your own requirements that you need to discuss with development, etc.
There are several ways to handle time-series data in Cassandra:
Have one table for everything. As Chris mentioned, just include the time component into partition key, like a day, and store data per sensor/day. If the data won't be updated, and you know in advance how long they will be kept, so you can set TTL to data, then you can use TimeWindowCompactionStrategy. Advantage of this approach is that you have only one table and don't need to maintain multiple tables - that's make easier for development and maintenance.
The same approach as you described - create a separate table for period of time, like a month, and write data into them. In this case you can effectively drop the whole table when data "expires". Using this approach you can update data if necessary, and don't require to set TTL on data. But this requires more work for development and ops teams as you need to reach multiple tables. Also, take into account that there are some limits on the number of tables in the cluster - it's recommended not to have more than 200 tables as every table requires a memory to keep metadata, etc. Although, some things, like, a bloom filter, could be tuned to occupy less memory for tables that are rarely read.
For cassandra just make a single table but include some time period in the partition key (so the partitions do not grow indefinitely and get too large). No table maintenance and read/write capacity is really more dependent on workload and schema, size of cluster etc but shouldn't really need to be worried about except for sizing the cluster.

Cassandra read performance degrade as we increase data on nodes

DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
Problem:
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
EDIT 1
EDIT 2
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
EDIT 3
cassandra.yaml
Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.

High CPU Usage in Cassandra 2.0

Running a 4 node cluster cassandra version 2.0.9. Recently since a
month we are seeing a huge spike in the CPU usage on all the nodes.
tpstats gives me high Native-transport-requests. Attaching screenshot
for 3 nodes tpstats
Node 1
Node 2
Node 3
From where should I start debugging?
Also if you see from first picture when the load becomes high the read
and write becomes low . This is understandable as the majority of the
requests drop
How to mitigate tombstones? I probably get that question from our dev teams a dozen times per month. The easiest way, is to not do DELETEs, and I'm dead serious about that. Otherwise, you can model your tables in such a way to mitigate tombstones in a better way.
For example, let's say I have a simple table to keep track of order status. As an order can have several different statuses (pending, picking, shipped, received, returned, etc...) a lazy way is to have one row per order, and either DELETE or run an in-place update to change the status (depending on whether or not status is a part of your key). A better way, is to convert it to a time series and perform deletes via a TTL. The table would look something like this:
CREATE TABLE orderStatus (orderid UUID,
updateTime TIMEUUID,
status TEXT,
PRIMARY KEY (ordered, status))
with CLUSTERING ORDER BY (updateTime DESC);
Let's say I know that I really only care about order status for a max of 30 days, so all status upserts have a TTL of 30 days...
INSERT INTO orderStatus (orderid,updateTime,status)
VALUES (UUID(),now(),'pending') USING TTL 2592000;
That table will support queries for order status by orderid, sorted by the update time descending. That way, I can SELECT from that table for an id with a LIMIT 1, and always get the most recent status. Additionally, those statuses will get deleted automatically after 30 days. Now, TTLing data still creates tombstones. But those tombstones are separate from the newer orders (the ones I probably care about more), so I typically don't have to worry about those tombstones interfering in my queries (because they're all grouped in partitions that I won't be querying often).
That's one example, but I hope the idea behind modeling for tombstone mitigation is clear. Mainly, the idea is to partition your table in such a way that the tombstones are kept separate from the data that you query most-often.
Is there a way by which we can monitor which queries are running slow on the server?
No, there really isn't a way to do that. But, you should be able to request all queries from your developers for problem keyspaces/tables. And that should be easy, because a table should really only be able to support one or two queries. If your developers built a table that supports 5 or 6 different queries, they're doing it wrong.
When you look at the queries, these are some red flags you should question:
Unbound queries (SELECTs without WHERE clauses).
Queries with ALLOW FILTERING.
Use of secondary indexes.
Use of IN.
Use of BATCH statements (I have seen a batch statement tip-over a node before).

Cassandra TTL VS. Rotating Keyspaces for data Queueing

I am using Casandra 2.0
My write load is somewhat similar to the queueing antipattern mentioned here: datastax
I am looking at pushing 30 - 40GB of data into cassandra every 24 hours and expiring that data within 24 hours. My current approach is to set a TTL on everything that I insert.
I am experimenting with how I partition my data as seen here: cassandra wide vs skinny rows
I have two column families. The first family contains metadata and the second contains data. There are N metadata to 1 data and a metadata may be rewritten M times throughout the day to point to a new data.
I suspect that the metadata churn is causing problems with reads in that finding the right metadata may require scanning all M items.
I suspect that the data churn is leading to excessive work compacting and garbage collecting.
It seems like creating a keyspace for each day and dropping the old keyspace after 24 hours would remove remove the need to do compaction entirely.
Aside from having to handle issues with what keyspace the user reads from on requests that overlap keyspaces, are there any other major flaws with this plan?
From my practice using partitioning is much better idea than using ttl.
It reduces cpu pressure
It partitions your data in Oracle manner, so searches are faster.
You can change your mind and keep the old data; using ttl it is difficult(I see one option - to migrate data before deletion)
If your rows are wide your can make them narrower.

Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds. (Queue)

Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!
This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.
The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.
Use RabbitMQ. Cassandra is probably a bad choice for this application.

Resources