I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?
As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date
To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.
Related
I have a table in cassandra DB that is populated. It provably has around 10000 records. When I try to execute select count(*), my query times out. Surprisingly, it times out even when i restrict the query with the partition key. The table has a column that is filled with a lot of text. I can't understand how that would be a problem, but i thought, i'd mention it. Any suggestions?
Doing a COUNT() of the rows in a partition shouldn't timeout unless it contains thousands and thousands of rows. More importantly, the query is most likely timing out when the partition contains thousands of tombstones.
You haven't provided a lot of information in your question and ideally you should have included:
the table schema
a sample query
In any case if you are storing queue-like datasets and deleting rows after they've been processed (because it's a queue) then you are generating lots of tombstones within the partition. Once you've reached the maximum tombstone_failure_threshold (default is 100K tombstones) then Cassandra will stop reading any more rows.
Unfortunately, it's hard to say what's happening in your case without the necessary details. Cheers!
A SELECT COUNT(*) needs to scan the entire database and can potentially take an extremely long time - much longer than the typical timeout. Ideally, such a query would be paged - periodically returning empty pages until the final page contains the count - to avoid timing out. But currently in Cassandra - and also in Scylla - this isn't done.
As Erick noted in his reply, everything becomes worse if you also have a lot of tombstones: You said you only have 10,000 rows, but it's easy to imagine a use case where the data changes frequently, and you actually have for each row 100 deleted rows - so Cassandra needs to scan through 1 million rows (most of them already dead), not 10,000.
Another issue to consider is that when your cluster is very large, scanning usually contact nodes sequentially, and each node many times (depending on the number of vnodes), so the scan time on a very large cluster will be large even if there are just a few actual rows in the database. By the way, unlike a regular scan, an aggregation like COUNT(*) can actually be done internally in parallel. Scylla recently implemented this and it speeds up counts (and other aggregation), but if I understand correctly, this feature is not in Cassandra.
Finally, you said that "Surprisingly, it times out even when i restrict the query with the partition key.". The question is how you restricted the query with a partition key. If you restricted the partition key itself to a range, it will still be slow because Cassandra still needs to scan all the partitions and compare their keys to the range. What you should have done is to restrict the token of the partition key, e.g., something like
where token(p) >= -9223372036854775808 and token(p) < ....
I have a table to store messages which are failed to process and I am retrying to process messages every 5 minutes through scheduler.
When message gets processed successfully, respective row from table is deleted, so that same message should not get processed again.
To fetch rows from table query is SELECT * FROM <table_name> , due to which we are facing tombstone issues if large number of rows gets deleted.
Table have timestamp as partition key and message_name(TEXT) as clustering key, TTL of 7 days and gc_grace_second of 2 days
As per my requirement, I need to delete records otherwise duplicate record will get processed. Is there any solution to avoid tombstone issues?
So I see two problems here.
Cassandra is being used as a queuing mechanism, which is an established anti-pattern.
All partitions are being queried with SELECT * FROM <table_name>, because there isn't a WHERE clause.
So with Cassandra, some data models and use cases will generate tombstones. At that point, there's not a whole lot to be done, except to design the data model so as to not query them.
So my thought here, would be to partition the table differently.
CREATE TABLE messages (
day TEXT,
message_time TIMESTAMP,
message_text TEXT,
PRIMARY KEY ((day),message_time))
WITH CLUSTERING ORDER BY (message_time DESC);
With this model, you can query all messages for a particular day. You can also run a range query on day and message_time. Ex:
SELECT * FROM messages
WHERE day='20210827'
AND message_time > '2021-08-27 04:00';
This will build a result set of all messages since 2021-08-27 04:00. Any tombstones generated outside of the requested time range (in this case, before 04:00) will not be queried.
Note that (based on the delete pattern) you could still have tombstones within the given time range. But the idea here, is that the WHERE clause limits the "blast radius," so querying a smaller number of tombstones shouldn't be a problem.
Unfortunately, there isn't a quick fix to your problem.
The challenge for you is that you're using Cassandra as a queue and it isn't a good idea because you run exactly into that tombstone hell. I'm sure you've seen this blog post by now that talks queues and queue-like datasets being an anti-pattern for Cassandra.
It is possible to avoid generating lots of tombstones if you model your data differently in buckets with each bucket mapping to a table. When you're done processing all the items in the bucket, TRUNCATE the table. This idea came from Ryan Svihla in his blog post Understanding Deletes where he goes through the idea of "partitioning tables". Cheers!
I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.
We are processing the oldest data as it comes into the time-series table. I am taking care to make sure that the oldest entries expire as soon as they are processed. Expectation is to have all the deletes at the bottom part of the clustering column of TimeUUID. So query will always read time slot without any deleted entries.
Will this scheme work? Are there any impacts of the expired columns that I should be aware of?
So keeping the timeuuid as part of clustering key guarantee the sort order to provide the most recent data.
If Cassandra 3.1 (DSE 5.x) and above :-
Now regarding the deletes, "avoid manual and use TWCS": Here is how
Let's say every X minutes your job process the data. Lets say X = 5min, (hopefully less than 24hours). Set the compaction to TWCS: Time Window Compaction Strategy and lets assume with TTL of 24hours.
WITH compaction= {
'compaction_window_unit': 'HOURS',
'compaction_window_size': '1',
};
Now there are 24buckets created in a day, each with one hour of data. These 24 buckets simply relates to 24 sstables (after compaction) in your Cassandra data directory. Now during the 25hour, the entire 1st bucket/sstable would automatically get dropped by TTL. Hence instead of coding for deletes, let Cassandra take care of the cleanup. The beauty of TWCS is to TTL the entire data within that sstable.
Now the READs from your application always goes to the recent bucket, 24th sstable in this case always. So the reads would never have to scan through the tombstones (caused by TTL).
If Cassandra 2.x or DSE 4.X, if TWCS isn't available yet :-
A way out till you upgrade to Cassandra 3.1 or above is to use artificial buckets. Say you introduce a time bucket variable as part of the partition key and keep the bucket value to be date and hour. This way each partition is different and you could adjust the bucket size to match the job processing interval.
So when you delete, only the processed partition gets deleted and will not come in the way while reading unprocessed ones. So scanning of tombstones could be avoided.
Its an additional effort on application side to start writing to the correct partition based on the current date/time bucket. But its worth it in production scenario to avoid Tombstone scan.
You can use TWCS to easily manage expired data, and perform filtering by some timestamp column on query time, to ensure that your query always getting the last results.
How do you "taking care" about oldest entries expiry? Cassandra will not show records with expired ttl, but they will persist in sstables until next compaction for this sstable. If you are deleting the rows by yourself, you can't make sure that your query will always read latest records, since Cassandra is eventually consistent, and theoretically there's can be some moments, when you will read stale data (or many such moments, based on your consistency settings).
I know this is not the best way to use Cassandra, but the type of my data requires reading all data from the last week. However when using Collection-types in CQL3, I ran into certain limitations which prevent me from doing normal date-range queries.
So I have set up Cassandra (currently single node, probably more in the future) with the following table
CREATE TABLE cache (tag text, id int, tags map<text,text>,
PRIMARY KEY (tag, id) );
ALTER TABLE cache WITH GC_GRACE_SECONDS = 0;
I am inserting with a TTL of one week to automatically remove the items from the Cache.
I tried to follow the suggestions mentioned in this article to avoid reading many tombstones by selecting by "minimum id", which I persist elsewhere to avoid reading old data:
SELECT * FROM cache WHERE tag = ? AND id >= ?
The id is basically some sort of timestamp which is constantly increasing, i.e. I only insert higher values over time and constantly remove older ids from the table.
But I still get warnings about thresholds being reached
WARN 08:59:06,286 Read 5001 live and 5702 tombstoned cells in cache (see tombstone_warn_threshold)
And if I do not run manual compaction/scrubbing regularly I get exceptions and queries fail.
However based on my understanding from the articles and documentation, I should be avoiding most if not all tombstones here as I query on equality for the tag, which allows Cassandra to only look for those areas and I use a minimum id which allows Cassandra to start reading only after most of the tombstones, so why are there still tombstone warnings/exceptions reported?
Map k/v pair is actually a column (name, value and timestamp): so, if you are issuing a lot of deletions of map elements (expiring by TTL is also the case) -- this is the source of this warning. Because you are still reading full maps (with lots of tombstones in them). Also, TTL setting on map is applied on per-element basis.
Second, this is multiplied by >= predicate in your select query.
If this is the case, you should remodel your data access pattern to use only EQ relations in SELECT query and bump id more often. Also, this access pattern will allow you to get rid of clustering part of your PRIMARY KEY.
So, if you do not issue lots of deletions on that map, you can try to use tag text, time timeuuid, name text, data text model and slice it precisely by time.