Reduce cassandra tombstones - cassandra

I have a table to store messages which are failed to process and I am retrying to process messages every 5 minutes through scheduler.
When message gets processed successfully, respective row from table is deleted, so that same message should not get processed again.
To fetch rows from table query is SELECT * FROM <table_name> , due to which we are facing tombstone issues if large number of rows gets deleted.
Table have timestamp as partition key and message_name(TEXT) as clustering key, TTL of 7 days and gc_grace_second of 2 days
As per my requirement, I need to delete records otherwise duplicate record will get processed. Is there any solution to avoid tombstone issues?

So I see two problems here.
Cassandra is being used as a queuing mechanism, which is an established anti-pattern.
All partitions are being queried with SELECT * FROM <table_name>, because there isn't a WHERE clause.
So with Cassandra, some data models and use cases will generate tombstones. At that point, there's not a whole lot to be done, except to design the data model so as to not query them.
So my thought here, would be to partition the table differently.
CREATE TABLE messages (
day TEXT,
message_time TIMESTAMP,
message_text TEXT,
PRIMARY KEY ((day),message_time))
WITH CLUSTERING ORDER BY (message_time DESC);
With this model, you can query all messages for a particular day. You can also run a range query on day and message_time. Ex:
SELECT * FROM messages
WHERE day='20210827'
AND message_time > '2021-08-27 04:00';
This will build a result set of all messages since 2021-08-27 04:00. Any tombstones generated outside of the requested time range (in this case, before 04:00) will not be queried.
Note that (based on the delete pattern) you could still have tombstones within the given time range. But the idea here, is that the WHERE clause limits the "blast radius," so querying a smaller number of tombstones shouldn't be a problem.

Unfortunately, there isn't a quick fix to your problem.
The challenge for you is that you're using Cassandra as a queue and it isn't a good idea because you run exactly into that tombstone hell. I'm sure you've seen this blog post by now that talks queues and queue-like datasets being an anti-pattern for Cassandra.
It is possible to avoid generating lots of tombstones if you model your data differently in buckets with each bucket mapping to a table. When you're done processing all the items in the bucket, TRUNCATE the table. This idea came from Ryan Svihla in his blog post Understanding Deletes where he goes through the idea of "partitioning tables". Cheers!

Related

Timeout on CQL COUNT() restricted by partition key

I have a table in cassandra DB that is populated. It provably has around 10000 records. When I try to execute select count(*), my query times out. Surprisingly, it times out even when i restrict the query with the partition key. The table has a column that is filled with a lot of text. I can't understand how that would be a problem, but i thought, i'd mention it. Any suggestions?
Doing a COUNT() of the rows in a partition shouldn't timeout unless it contains thousands and thousands of rows. More importantly, the query is most likely timing out when the partition contains thousands of tombstones.
You haven't provided a lot of information in your question and ideally you should have included:
the table schema
a sample query
In any case if you are storing queue-like datasets and deleting rows after they've been processed (because it's a queue) then you are generating lots of tombstones within the partition. Once you've reached the maximum tombstone_failure_threshold (default is 100K tombstones) then Cassandra will stop reading any more rows.
Unfortunately, it's hard to say what's happening in your case without the necessary details. Cheers!
A SELECT COUNT(*) needs to scan the entire database and can potentially take an extremely long time - much longer than the typical timeout. Ideally, such a query would be paged - periodically returning empty pages until the final page contains the count - to avoid timing out. But currently in Cassandra - and also in Scylla - this isn't done.
As Erick noted in his reply, everything becomes worse if you also have a lot of tombstones: You said you only have 10,000 rows, but it's easy to imagine a use case where the data changes frequently, and you actually have for each row 100 deleted rows - so Cassandra needs to scan through 1 million rows (most of them already dead), not 10,000.
Another issue to consider is that when your cluster is very large, scanning usually contact nodes sequentially, and each node many times (depending on the number of vnodes), so the scan time on a very large cluster will be large even if there are just a few actual rows in the database. By the way, unlike a regular scan, an aggregation like COUNT(*) can actually be done internally in parallel. Scylla recently implemented this and it speeds up counts (and other aggregation), but if I understand correctly, this feature is not in Cassandra.
Finally, you said that "Surprisingly, it times out even when i restrict the query with the partition key.". The question is how you restricted the query with a partition key. If you restricted the partition key itself to a range, it will still be slow because Cassandra still needs to scan all the partitions and compare their keys to the range. What you should have done is to restrict the token of the partition key, e.g., something like
where token(p) >= -9223372036854775808 and token(p) < ....

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?
As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date
To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

Selecting records in Cassandra based on Time range in frequent intervals

I have a table in Cassandra where i am storing events as they are coming in , different processing are done on the events at different stages. The events are entered into the table with the event occurrence time. I need to get all the events whose event time is less than a certain time and do some processing on them. As its a select range query and its invariably will use scatter gather. Can some one suggest best way to do this. This process is going to happen in every 5 secs and scatter gather happening in Cassandra happening frequently is not a good idea as its an overhead on Cassandra itself which will degrade my overall application Performance.
The table is as below:
PAS_REQ_STAGE (PartitionKey = EndpointID, category ; clusterkey= Automation_flag,alertID)
AlertID
BatchPickTime: Timestamp
Automation_Threshold
ResourceID
ConditionID
category
Automation_time: Timestamp
Automation_flag
FilterValue
Eventtime which i have referred above is the BatchPickTime..
A scheduler wakes up at regular interval and gets all the records whose BatchPickTime is Less than the current scheduler wakeup time and sweeps them off from the table to process them.
Because of this usecase i cannot provide any specific Partition key for the query as it will have to get all data which has expired and is less than the current scheduler wake-up time.
Hi and welcome to Stackoverflow.
Please post your schema and maybe some example code with your question - you can edit it :)
The Cassandra-way of doing this is to denormalize data if necessary and build your schema around your queries. In your case I would suggest putting your events in to a table together with a time bucket:
CREATE TABLE events (event_source int, bucket timestamp,
event_time timestamp, event_text text PRIMARY KEY ((event_source, bucket),event_time));
The reason for this is that it is very efficent in cassandra to select a row by its so called partition key (in this example (event_source, bucket)) as such a query hits only one node. The reminder of the primary key is called clustering columns and defines the order of data, here all events for a day inside the bucket are sorted by event_time.
Try to model your event table in a way that you do not need to make multiple queries. There is a good and free data modeling course from DataStax available: https://academy.datastax.com/resources/ds220-data-modeling
One note - be careful when using cassandra as queue - this is maybe an antipattern and you might be better of with a message queue as ActiveMQ or RabbitMQ or similar.

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds. (Queue)

Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!
This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.
The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.
Use RabbitMQ. Cassandra is probably a bad choice for this application.

Resources