I am having a problem with Cassandra 2.1.17. I have a table with about 40k "rows" in it. One partition I am having a problem with has maybe about 5k entries in it.
Table is:
create table billing (
accountid uuid,
date timeuuid,
credit double,
debit double,
type text,
primary key (accountid,date)
) with clustering order by (date desc)
So there is a lot of inserting and deleting from this table.
My problem is that somehow it seems to get corrupt I think because I am no longer able to select data past a certain point from a partition.
From cqlsh I can run soemthing like this.
SELECT accoutid,date,credit,debit,type FROM billing WHERE accountid=XXXXX-xxxx-xxxx-xxxxx... AND date < 3d466d80-189c-11e7-8a57-f33cbced2fc5 limit 2;
First I did a select limit of 10000 it works up to around 5000 rows pageing through them then towards the end it will give a timeout error.
I then use the second from last timeuuid and select limit 2 it will fail limit 1 will work.
If I use the last timeuuid as a < and limit to 1 it will also fail.
So just looking for what I can do here I am not sure what is wrong and not sure how I can fix/diagnose what happened.
I have tired a repair and force a compaction. but it still seems to have the issue.
Thank you for any help.
Try to start with running manual compaction on your table.
You can increase read_request_timeout_in_ms parameter in cassandra config.
Consider moving to leveled compaction strategy if you are having a lot of deletes and updates.
I think you got too many tombstones in this partition.
What is a tombstone ?
To remember that a record has been deleted Cassandra creates a special value called a "tombstone". A tombstone has a TTL as any other value has but it is not compacted as easily as any other value is. Cassandra keeps it longer to avoid such inconsistency as data reappearence.
How to watch tombstones ?
nodetool cfstats gives you an idea of how many tombstones you have on average per slice
How to fix the issue ?
The duration a tombstone is preserved is gc_grace_seconds. You have to reduce it and then run a major compaction to fix the issue.
It looks to me like you are hitting a lot of tombstones when you do selects. The thing is while they are there cassandra still has to go over them. There might be multiple factors like ttl with insert statements, a lot of deletes, inserting of nulls etc.
My bet would be that you would need to adjust gc_grace_seconds on table and run repairs more often. But be careful and don't set it to to low (one round of repair has to finish before this time).
It's all nicely explained here:
https://opencredo.com/cassandra-tombstones-common-issues/
Related
I have a cassandra cluster with less delete use case. I found in my system.log "Read 10 live and 5645464 tombstones cells in keyspace.table" What does it mean? please help to understand.
Thanks.
For Cassandra, all the information recorded is immutable. This means that when you have a delete operation (explicit with a delete statement or with a Time To Live [TTL] clause), the database will add another record with a special flag named tombstone. All these records will stay on the database until the gc_grace_seconds periods have passed; the default is 10 days.
In your case, the engine found out that most of the records retrieved were deleted, but they are still waiting for the gc_grace_seconds to pass, to let compaction reclaim the space. One possible option to fix the issue is to decrease gc_grace_seconds for that table.
For more information, please refer to this article from the Last Pickle.
One more important thing to keep in mind when working with Cassandra is that tombstones cells do not directly correlate to deletes.
When you insert null value to an attribute when performing your insert, Cassandra internally marks that attribute/cell as a tombstone. So, even if you don't have a lot of deletes happening, you could end up with an enormous number of tombstones. Easy and simple solution is to not insert null values for an attribute while inserting.
As per this statement Read 10 live and 5645464 tombstones cells in keyspace.table goes, there might be a table scan for a query happening that is scanning 10 cells and 5645464 number of tombstones (cells with null value) while doing so is what I am guessing. Need to understand what type of queries are being executed to gain more insight into that.
I have a table named 'holder' which has the single partition in which for every one hour we will have 60K entries,
I have another table named 'holderhistory' which has the 'date' as partitionId, so every day's record from 'holder' table will be copied to the 'holderhistory'
There will be a job running in the application
i) which collects all the older entries in holder table and copy to the holderhistory table
ii) Delete the older entries from holder table
NOW the issue is - there will be too many tombstones created in the holder table.
As default the tombstones will be cleared after 10 days (864000 seconds) gc_grace_seconds
But I don't want to keep the tombstone for more than 3 hours,
1) so It is good to set the gc_grace_seconds to 3 hours?
2) Or It is good to set the default_time_to_live to 3 hours?
Which is the best solution for deleting the tombstone?
Also what are the consequence on reducing the gc_grace_seconds from 10 days to 3 hours? where we will have impact?
Anyhelp is appreciated.
If you reduce the GCGraceSeconds parameter too low and the recovery time of any node longer than the GCGraceSeconds, in such case, once one of these nodes came back online, it would mistakenly think that all of the nodes that had received the delete had actually missed a write and it would start repairing all of the other nodes. I would recommend to use efault_time_to_live and give a try.
To answer your particular case : as the table 'holder' contains only one partition, you can delete the whole partition with a single "delete by partition key" statement, effectively creating a single tombstone.
If you delete the partition once a day, you'll end up with 1 tombstone per day... that's quite acceptable.
1) with gc_grace_seconds equals 3 hours, and if RF > 1, you will not be guaranteed to recover consistently from a node failure longer than 3 hours
2) with default_time_to_live equals 3 hours, each record will be deleted by creating a tombstone 3 hours after insertion
So you could keep default gc_grace_seconds set to 10 days, and take care to delete your daily records with something like DELETE FROM table WHERE PartitionKey = X
EDIT: Answering to your comment about hinted handoff...
Let's say RF = 3, gc_grace_second = 3h and a node goes down. The 2 others replicas continue to receive mutations (insert, update, delete), but they can't replicate them to the offline node. In that case, hints will be stored on disk temporarily, to be sent later if the dead node comes back.
But a hint expires after gc_grace_seconds, after what it will never been sent.
Now if you delete a row, it will generate a tombstone in the sstables of the 2 replicas and a hint in the coordinator node. After 3 hours, the tombstones are removed from the online nodes by the compaction manager, and the hint expires.
Later when your dead node comes back, it still have the row, and it can't know that this row has been deleted because no hint and no more tombstone exist on replicas... thus it's a zombie row.
You might also find this support blog article useful:
https://academy.datastax.com/support-blog/cleaning-tombstones-datastax-dse-and-apache-cassandra
We would like to create a Cassandra table with Simple Primary Key that is consisted of UUID column.
The table will look like:
CREATE TABLE simple_table(
id UUID PRIMARY KEY,
col1 text,
col2 text,
col3 UUID
);
This table will potentially store few billions of rows, and the rows should expire after some time (few months) using the TTL feature.
I have few questions regarding the efficiency of this table:
What is the efficiency of a query against this table using the primary key? Meaning, how Cassandra finds a specific row after resolving in which partition it resides?
Considering that the rows will expire and create many tombstones, how does this will effect the reads and writes to this table? Let's say that we expire the data after 180 days, if I am not mistaken, the ratio of tombstones would be 10/180~=0.056 (when 10 is the gc_grace_periods in days).
In your case, the primary key is equal to the partition key, so you have so-called "skinny" partitions, consisting of one row. If you remove data, then instead of data inside partition you'll have only tombstone, and it's not a problem. If the data is expired, then it will be simply removed during compaction - gc_grace_period isn't applied here - it's required only when you explicitly remove the data - we need to keep tombstone because other nodes may need to "catch up" with changes if they weren't able to receive delete operation. You can find more details about data deletion in following document.
Problem with tombstones arise when you have many (thousands) of rows inside the same partition, for example, if you use several clustering keys. And when such data is deleted, then the tombstone is generated, and should be skipped when we read data inside partition.
P.S. Have you seen this blog post that explains how deletions happen?
After reading the blog (and the comments) that #Alex referred me to, I concluded that tombstones are created for expired rows due to default_time_to_live of the table.
Those tombstones will be cleaned only after gc_grace_periods have passed. See this stack overflow question.
Regarding my first questions this datastax page describes it pretty well.
We have a redundant column that we'd like to delete from our Cassandra database (version 2.1.15). This is a text column represents the majority of data on disk (15 nodes X 1.8 TB per node).
The easiest option just seems to be an alter table to remove that column, and then let Cassandra compaction take care of things (also running Cassandra Reaper to manage repairs). However, given the size of the dataset I'm concerned I will knock over the cluster with a massive delete.
Other options I've consider is a process that will run through the keyspace setting the value to null, but I think this will have the same effect as removing the column, but is more under out control (but also requires writing something to do this).
Would anyone have any advice on how to approach this?
Thanks!
Dropping a column does mark the deleted values as tombstones. The column value becomes unavailable immediately and the column data is removed in the next compaction cycle.
If you want to to expedite the removal of the column before the compaction occurs, you can run nodetool upgradesstables to remove the data, after you use the ALTER TABLE command to change the metadata for the column.
See Documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/alter_table_r.html
If I remember correctly, drop of column doesn't really mark the deleted values with tombstone, but instead inserts corresponding entry into system.dropped_columns table, and then code, like, SerializationHelper & BTreeRow, performs filtering on the fly. The data will be deleted when compaction will happen.
Explicitly setting the value to null won't make situation better because you'll add data to the table.
I would recommend to test deletion on small cluster & check how it behaves.
I totally understand the count(*) from table where partitionId = 'test' will return the count of the rows. I could see that it takes the same time as select * from table where partitionId = 'test.
Is there any other alternative in Cassandra to retrieve the count of the rows in an efficient way?
You can compare results of select * & select count(*) if you run cqlsh, and enable tracing there with tracing on command - it will print time that is required for execution of corresponding command. The difference between both queries is only in what amount of data should be returned back.
But anyway, to find number of rows Cassandra needs to hit SSTable(s), and scan entries - performance could be different if you have partition spread between multiple SSTables - this may depend on your compaction strategy for tables, that is selected based on your reading/writing patterns.
As Alex Ott mentioned, the COUNT(*) needs to go through the entire partition to know that total.
The fact is that Cassandra wants to avoid locks and as a result they do not maintain a number of row in their sstables and each time you do an INSERT, UPDATE, or DELETE, you may actually overwrite another entry which is just marked as a tombstone (i.e. it's not an in place overwrite, instead it saves the new data at the end of the sstable and marks the old data as dead).
The COUNT(*) will go through the sstables and count all the entries not marked as a tombstone. That's very costly. We're used to SQL having the total number of rows in a table or an index so COUNT(*) on those is instantaneous... not here.
One solution I've used is to have Elasticsearch installed on your Cassandra cluster. One of the parameters Elasticsearch saves in their stats is the number of rows in a table. I don't remember the exact query, but more or less you can just a count request and you get a result in like 100ms, always, whatever the number is. Even in the 10s of millions of rows. Just like with a SELECT COUNT(*) ... the result will always be an approximation if you have many writes happening in parallel. It will stabilize if the writes stop for long enough (possibly about 1 or 2 seconds).