I do a select with tracing ON and see:
Skipped 0/1 non-slice-intersecting sstables
included 0 due to tombstones [ReadStage-<N>]
So is it working to ignore tombstones? The trace:
Read 0 live rows and 2 tombstone cells
is clear: it is reading tombstones
Let's say there was a Column A.
You added value x to Column A.
Then you deleted Column A.
Instead of immediately deleting value x, Cassandra will add a marker for Column A which is called tombstone. The tombstone is also an individual record in itself just like the original value x.
Let's say the two updates were written in different sstables (Cassandra storage).
Now when you are reading the value, Cassandra will get the value x and the tombstone for Column A. It will see that tombstone was written after the value x so it will not return any value.
Skipped 0/1 non-slice-intersecting sstables
included 0 due to tombstones
This is basically confirming the same.
Based on talking to some Cassandra admins:
" Skipping sstables is Cassandra telling us it eliminated the tombstones efficiently, this is ok
" Deleting everything in a partition in general helps ensure Cassandra is not bogged down with tombstones
Related
During minor compaction, to reclaim a row tombstone, how cassandra is checking whether the row exists in other sstables? It just checks partition key by bloom filter or checks row key?
For example, there are 3 sstables: s1, s2 and s3. Assume s1 has the row key 'p.c1', where p is partition key and c1 is clustering key. s2 has the row key 'p.c2' and s3 has the tombstone for the row key 'p.c2'. In this case, when minor compaction is triggered on s2 and s3, the row 'p.c2' will be reclaimed after compaction?
Thanks a lot.
Cassandra combines all the fragments of a partition from the active memtable and SSTables to determine if a tombstone can be dropped from an SSTable(s) being compacted.
Similar to read requests, Cassandra checks the memtable, bloom filter, partition key cache or partition summary, and the partition index to locate the fragments of the data/partition/row on disk.
For reference, have a look at How Cassandra reads data. Cheers!
I have a cassandra cluster with less delete use case. I found in my system.log "Read 10 live and 5645464 tombstones cells in keyspace.table" What does it mean? please help to understand.
Thanks.
For Cassandra, all the information recorded is immutable. This means that when you have a delete operation (explicit with a delete statement or with a Time To Live [TTL] clause), the database will add another record with a special flag named tombstone. All these records will stay on the database until the gc_grace_seconds periods have passed; the default is 10 days.
In your case, the engine found out that most of the records retrieved were deleted, but they are still waiting for the gc_grace_seconds to pass, to let compaction reclaim the space. One possible option to fix the issue is to decrease gc_grace_seconds for that table.
For more information, please refer to this article from the Last Pickle.
One more important thing to keep in mind when working with Cassandra is that tombstones cells do not directly correlate to deletes.
When you insert null value to an attribute when performing your insert, Cassandra internally marks that attribute/cell as a tombstone. So, even if you don't have a lot of deletes happening, you could end up with an enormous number of tombstones. Easy and simple solution is to not insert null values for an attribute while inserting.
As per this statement Read 10 live and 5645464 tombstones cells in keyspace.table goes, there might be a table scan for a query happening that is scanning 10 cells and 5645464 number of tombstones (cells with null value) while doing so is what I am guessing. Need to understand what type of queries are being executed to gain more insight into that.
As per Question regarding Tombstone, why doesn't upserts create tombstones?
As per datastax documentation, How is data updated ? for every upsert, cassandra considers as delete followed by insert, as the new timestamps of the insert overwrites the old timestamp. The old timestamp data has to be marked as delete which relates to tombstone.
Why do we have contradicting statements? or else am I missing anything here?
Usecase:
Data is inserted with unique key (uuid) in Cassandra and some of the columns in this data keeps updating frequently. Which approach do you recommend?
Inserting the same data with new column values in the
Insert query.
Updating the existing record based on given uuid
with new column values in the update query.
Which approach does or doesn't create tombstones? and how does Cassandra handle both queries?
As Russ pointed out, you may want to read other similar questions on this topic. However,
An upsert/overwrite is just-another-cell, with a name, a timestamp and a value.
A tombstone is just like an overwrite, except it gets one extra field indicating that it's been deleted, so that it isn't returned as valid output. The reason tombstones are often harmful is that they can accumulate in bad data models, even when people think the data is gone - and skipping them to get to live data actually requires memory.
When you update/upsert as you describe, the cell you create SHADOWS (obsoletes) the previous cell, which will be removed upon compaction. That previous cell is NOT a tombstone, even though it's no longer live/active - it will be compacted away and completely replaced by the new, live, highest-timestamp value as soon as compaction allows.
The biggest thing to keep in mind is this: tombstones aren't necessarily removed by compaction - they're kept around (persisted/rewritten) for at least gc_grace_seconds, and potentially even long if they need to shadow/cover other cells in sstables not-yet-compacted. Because of this, tombstones stay around for a long time, but shadowed/overwritten cells are gc'd as soon as the sstable they're in is compacted.
I have a single node cassandra cluster, I use the current minute as partition key and insert rows with TTL of 12 hours.
I see a couple of issue I can't explain
The /var/lib/cassandra/data/<key_space>/<table_name> contains multiple files, lots of them are really old (way older then 12 hours, something like 2 days)
When I try to perform a query in cqlsh I get a lot of logs that seem to indicate that my table contain lots of tombstones
log:
WARN [SharedPool-Worker-2] 2015-01-26 10:51:39,376 SliceQueryFilter.java:236 - Read 0 live and 1571042 tombstoned cells in <table_name>_name (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:40,472 SliceQueryFilter.java:236 - Read 0 live and 1557919 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:41,630 SliceQueryFilter.java:236 - Read 0 live and 1589764 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:42,877 SliceQueryFilter.java:236 - Read 0 live and 1582163 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,081 SliceQueryFilter.java:236 - Read 0 live and 1550989 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,869 SliceQueryFilter.java:236 - Read 0 live and 1566246 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:45,582 SliceQueryFilter.java:236 - Read 0 live and 1577906 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:46,443 SliceQueryFilter.java:236 - Read 0 live and 1571493 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:47,701 SliceQueryFilter.java:236 - Read 0 live and 1559448 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:49,255 SliceQueryFilter.java:236 - Read 0 live and 1574936 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
I've tried multiple compaction strategies, multithreaded compaction, I've tried running compaction manually with nodetool, also, I've tried forcing garbage collection with jmx.
One of my guesses is that the compaction doesn't delete tombstones files
Any ideas how to avoid disk space from getting too big, my biggest concern is running out of space, I'd rather store less (by making the ttl smaller but currently that doesn't help)
Tombstones will be preserved for 10 days using the default configuration. The reason for this is to make sure that offline nodes will be able to catch up with deletes when they join the cluster again. You can configure this value by setting the gc_grace_seconds setting.
I'm assuming you are using the timestamp as a clustering column within each partition when you say you are using the minute as the partition key, along with a TTL of 12 hours when you do the insert. This will build up tombstones in each partition since you are never deleting the entire row (i.e. a whole minute partition).
Suppose your keyspace is called k1 and your table is called t2, then you can run:
nodetool flush k1 t2
nodetool compact k1 t2
sstable2json /var/lib/cassandra/data/k1/t2/k1-t2-jb-<last version>-Data.db
then you'll see all the tombstones like this (marked with a "d" for deleted):
{"key": "00000003","columns": [["4:","54c7b514",1422374164512000,"d"], ["5:","54c7b518",1422374168501000,"d"], ["6:","54c7b51b",1422374171987000,"d"]]}
Now if you go and delete that row (i.e. delete from k1.t2 where key=3;), then do the flush, compact, and sstable2json again, you'll see it change to:
{"key": "00000003","metadata": {"deletionInfo": {"markedForDeleteAt":1422374340312000,"localDeletionTime":1422374340}},"columns": []}
So you see all the tombstones are gone and Cassandra only has to remember that the whole row was deleted at a certain time instead of little bits and pieces of the row being deleted at certain times.
Another way to get rid of the tombstones is to truncate the whole table. When you do that, Cassandra only needs to remember that the whole table was truncated at a certain time, and so no longer needs to keep tombstones prior to that time (since tombstones are used to tell other nodes that certain data was deleted, and if you can say the whole table was emptied at time x, then the details prior to that no longer matter).
So how could you apply this in your situation you ask. Well, you could use the hour and minute as your partition key, and then once an hour run a cron job to delete all the rows from 13 hours ago. Then on the next compaction, all the tombstones for that partition would be removed.
Or keep a separate table for each hour, and then truncate the table from 13 hours ago each hour using a cron job.
Another strategy that is sometimes useful is to "re-use" clustering keys. For example, if you were inserting data once per second, instead of having a high resolution timestamp as a clustering key, you could use the time modulo 60 seconds as the clustering key and keep the more unique timestamp as just a data field. So within each minute partition you would be changing tombstones (or outdated information) from yesterday back into live rows today, and then you wouldn't accumulate tombstones over many days.
So hopefully that gives you some ideas for things to try. Usually when you run into a tombstone problem, it's a sign that you need to re-think your schema a little bit.
I have a similar issue, only in my case there was just a single table that refused to shrink (old files are not deleted and their storage space keeps growing). I used nodetool compactionstats and saw there are a lot of pending compaction tasks.
Another interesting thing was i saw in the nodetool compactionstats always showed compactions of compaction type Compaction for the problematic table, but not of type Tombstone Compaction, as oppose to the tables that behaved good.
Could it be the problem?
Actually I am getting confused with some concepts regarding cassandra.
what do we Actually mean by updating Cassandra row? is it mean adding more column or updates in the value of the column. or it is both.?
When we are adding more column to a row. is the previous row in the sstable got invalidate and new row entry is inserted in the SSTABLE with the newly added rows.?
Since SSTable is immutable so each new update in Column data OR addition of Column OR Deletion of Column data will result in invalidating the previous row and inserting a new Row with all the previous column+new Column?
Please Help..
What do we Actually mean by updating Cassandra row? is it mean adding
more column or updates in the value of the column. or it is both.?
In cassandra, updating a row and inserting a row are the same operation, bot lead to adding data to a memtable (in-memory sstable) which is latter flushed to disk and becomes an sstable (also a log line is written to the commit log if persistent writes are enabled). If you insert a column (btw in cassandra terms, a column is the same as a cell, and a row is known as a partition, you might find this useful if you do any further reading) which already exists, e.g:
INSERT INTO db.tbl (id, value) VALUES ('text_id1', 'some text as a value');
INSERT INTO db.tbl (id, value) VALUES ('text_id1', 'some text as a value');
You'll end up with 1 partition, since the first one is overwritten by the second insert. This means that inserting partitions with duplicate keys leads to the previous one being overwritten (and the overwrite is based on the timestamp at the time of insert, last write wins).
When we are adding more column(cell) to a row(partition). is the
previous row in the sstable got invalidate and new row entry is
inserted in the SSTABLE with the newly added rows.?
For cql, the previous columns will just contain a null value. No invalidation will happen, you can alter schemas as you please. If you delete a column, its' data will be removed during the next compaction with the aim of reclaiming back disk space.
Since SSTable is immutable so each new update in Column data OR
addition of Column OR Deletion of Column data will result in
invalidating the previous row and inserting a new Row with all the
previous column+new Column?
Kind of, sstables are merged into larger sstables when necessary, how this is done depends on the compaction strategy that is being used. There are two flavours, size-tiered and levelled compaction. Covering how they work is a whole separate question that has been answered by people who are smarter than me so have a read here.
Updating is covered here:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_write_update_c.html
As you note, SSTables are immutable, so you're probably wondering what happens when a later write supercedes data already in an SSTable. The storage engine reads from all tables that might have data for a requested row (as determined by bloom filters for each table). Understanding the read path might clarify this for you:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_about_reads_c.html
Specifically:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_about_read_path_c.html