I do a select with tracing ON and see:
Skipped 0/1 non-slice-intersecting sstables
included 0 due to tombstones [ReadStage-<N>]
So is it working to ignore tombstones? The trace:
Read 0 live rows and 2 tombstone cells
is clear: it is reading tombstones
Let's say there was a Column A.
You added value x to Column A.
Then you deleted Column A.
Instead of immediately deleting value x, Cassandra will add a marker for Column A which is called tombstone. The tombstone is also an individual record in itself just like the original value x.
Let's say the two updates were written in different sstables (Cassandra storage).
Now when you are reading the value, Cassandra will get the value x and the tombstone for Column A. It will see that tombstone was written after the value x so it will not return any value.
Skipped 0/1 non-slice-intersecting sstables
included 0 due to tombstones
This is basically confirming the same.
Based on talking to some Cassandra admins:
" Skipping sstables is Cassandra telling us it eliminated the tombstones efficiently, this is ok
" Deleting everything in a partition in general helps ensure Cassandra is not bogged down with tombstones
New in operating cassandra clusters. Having a 1 DC and 14 node production cluster running # DCE v.2.1.15.
System log shows many TS warnings like below and are wondering if this is okay or due to the applications natur vs too low (default TS warn level=5000) or if we ought to run manual compactions between our nightly repairs (every node gets repaired once # week), raise TS warn level...
Hints appreciated!
WARN [SharedPool-Worker-1] 2016-08-18 11:45:02,536
SliceQueryFilter.java:320 - Read 0 live and 6251 tombstone cells in
KeyspaceMetadata.CF_RecentIndex for key: 3230303230305febd8fc98e0bf11e
5b870502699f4d249 (see tombstone_warn_threshold). 5000 columns were
requested, slices=[-] WARN [SharedPool-Worker-3] 2016-08-18
11:45:02,548 SliceQueryFilter.java:320 - Read 0 live and 6251
tombstone cells in KeyspaceMetadata.CF_MessageFlagsIndex for key:
3230303230305febd8fc98e 0bf11e5b870502699f4d249 (see
tombstone_warn_threshold). 1 columns were requested, slices=[1-1:!]
WARN [SharedPool-Worker-2] 2016-08-18 11:45:04,566
SliceQueryFilter.java:320 - Read 1 live and 1123 tombstone cells in
KeyspaceMetadata.CF_UIDIndex for key: 3230303230315f299f8d3ae0c011e593
3d7775140b84c3 (see tombstone_warn_threshold). 5000 columns were
requested, slices=[-] WARN [SharedPool-Worker-2] 2016-08-18
11:45:11,853 SliceQueryFilter.java:320 - Read 0 live and 6251
tombstone cells in KeyspaceMetadata.CF_RecentIndex for key:
3230303230305febd8fc98e0bf11e 5b870502699f4d249 (see
tombstone_warn_threshold). 5000 columns were requested, slices=[-]
WARN [SharedPool-Worker-2] 2016-08-18 11:45:11,864
SliceQueryFilter.java:320 - Read 0 live and 6251 tombstone cells in
KeyspaceMetadata.CF_MessageFlagsIndex for key: 3230303230305febd8fc98e
0bf11e5b870502699f4d249 (see tombstone_warn_threshold). 1 columns were
requested, slices=[1-1:!] WARN [SharedPool-Worker-1] 2016-08-18
11:46:09,624 SliceQueryFilter.java:320 - Read 2 live and 2537
tombstone cells in KeyspaceMetadata.CF_TimeIndex for key:
3230303030385ffebcbd200d9411e6b 9750c94c36d1038 (see
tombstone_warn_threshold). 5000 columns were requested, slices=[-]
WARN [SharedPool-Worker-3] 2016-08-18 11:47:31,434
SliceQueryFilter.java:320 - Read 2 live and 2544 tombstone cells in
KeyspaceMetadata.CF_TimeIndex for key: 3230303030345f6b87b24afbe111e5b
f7f828e02f15dd6 (see tombstone_warn_threshold). 5000 columns were
requested, slices=[-] WARN [SharedPool-Worker-1] 2016-08-18
11:49:13,870 SliceQueryFilter.java:320 - Read 3 live and 2540
tombstone cells in KeyspaceMetadata.CF_TimeIndex for key:
3230303030355f533d997cfbdf11e59 85390948f56b8a7 (see
tombstone_warn_threshold). 5000 columns were requested, slices=[-]
Hai steffen please go through your opscenter and see how many tables have Ts more than 5000 if the no.of tables are less, running manually compactions on those particular tables will be a good solution , as you mentioned you are repairing once a week i would suggest to check the data model of the keyspace why it is causing high no.of TS.
If you are not using Opscenter you can check no.of tombstones
sstable2json full_path | grep \"t\"| wc -l
When I do a table repair, I see a lot of warnings like the following:
WARN [CompactionExecutor:112958] 2016-04-07 15:39:33,160 SliceQueryFilter.java:236 - Read 10002 live and 2857 tombstoned cells
But I do not delete anything and do not set TTLs, so nothing has been deleted. Why are there so many tombstones? The datasize is about 200G, but I have inserted some cells with NULL.
I had the same issue recently. And the reason was that I was inserting NULL values. I just did not know that I did.
If you use a prepared statement and you do not set some of the parameters, or you set them to NULL, or if you insert a JSON object that does not contain a key for every column of the table, then you end up with a tombstone for each of those columns.
In case of the prepared statement you can avoid the tombstone by explicitly setting unused parameters as unset. See CASSANDRA-7304.
Unfortunately there is no such syntax/API for JSON inserts yet. Have a look at CASSANDRA-11424 to check the progress.
When I do a table scan like "select*from ..",I saw a lot of warns like following. However I did not delete anything. It is also weird that the delInfo does not make sense to me, the deletedAt timestamp was event eariler than when I created the table.
ned cells in yp_rtb.bidresponses (see tombstone_warn_threshold). 1001 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-17] 2015-04-21 14:46:23,473 SliceQueryFilter.java:242 - Read 1002 live and 2657 tombst
I have a single node cassandra cluster, I use the current minute as partition key and insert rows with TTL of 12 hours.
I see a couple of issue I can't explain
The /var/lib/cassandra/data/<key_space>/<table_name> contains multiple files, lots of them are really old (way older then 12 hours, something like 2 days)
When I try to perform a query in cqlsh I get a lot of logs that seem to indicate that my table contain lots of tombstones
log:
WARN [SharedPool-Worker-2] 2015-01-26 10:51:39,376 SliceQueryFilter.java:236 - Read 0 live and 1571042 tombstoned cells in <table_name>_name (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:40,472 SliceQueryFilter.java:236 - Read 0 live and 1557919 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:41,630 SliceQueryFilter.java:236 - Read 0 live and 1589764 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:42,877 SliceQueryFilter.java:236 - Read 0 live and 1582163 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,081 SliceQueryFilter.java:236 - Read 0 live and 1550989 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,869 SliceQueryFilter.java:236 - Read 0 live and 1566246 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:45,582 SliceQueryFilter.java:236 - Read 0 live and 1577906 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:46,443 SliceQueryFilter.java:236 - Read 0 live and 1571493 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:47,701 SliceQueryFilter.java:236 - Read 0 live and 1559448 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:49,255 SliceQueryFilter.java:236 - Read 0 live and 1574936 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
I've tried multiple compaction strategies, multithreaded compaction, I've tried running compaction manually with nodetool, also, I've tried forcing garbage collection with jmx.
One of my guesses is that the compaction doesn't delete tombstones files
Any ideas how to avoid disk space from getting too big, my biggest concern is running out of space, I'd rather store less (by making the ttl smaller but currently that doesn't help)
Tombstones will be preserved for 10 days using the default configuration. The reason for this is to make sure that offline nodes will be able to catch up with deletes when they join the cluster again. You can configure this value by setting the gc_grace_seconds setting.
I'm assuming you are using the timestamp as a clustering column within each partition when you say you are using the minute as the partition key, along with a TTL of 12 hours when you do the insert. This will build up tombstones in each partition since you are never deleting the entire row (i.e. a whole minute partition).
Suppose your keyspace is called k1 and your table is called t2, then you can run:
nodetool flush k1 t2
nodetool compact k1 t2
sstable2json /var/lib/cassandra/data/k1/t2/k1-t2-jb-<last version>-Data.db
then you'll see all the tombstones like this (marked with a "d" for deleted):
{"key": "00000003","columns": [["4:","54c7b514",1422374164512000,"d"], ["5:","54c7b518",1422374168501000,"d"], ["6:","54c7b51b",1422374171987000,"d"]]}
Now if you go and delete that row (i.e. delete from k1.t2 where key=3;), then do the flush, compact, and sstable2json again, you'll see it change to:
{"key": "00000003","metadata": {"deletionInfo": {"markedForDeleteAt":1422374340312000,"localDeletionTime":1422374340}},"columns": []}
So you see all the tombstones are gone and Cassandra only has to remember that the whole row was deleted at a certain time instead of little bits and pieces of the row being deleted at certain times.
Another way to get rid of the tombstones is to truncate the whole table. When you do that, Cassandra only needs to remember that the whole table was truncated at a certain time, and so no longer needs to keep tombstones prior to that time (since tombstones are used to tell other nodes that certain data was deleted, and if you can say the whole table was emptied at time x, then the details prior to that no longer matter).
So how could you apply this in your situation you ask. Well, you could use the hour and minute as your partition key, and then once an hour run a cron job to delete all the rows from 13 hours ago. Then on the next compaction, all the tombstones for that partition would be removed.
Or keep a separate table for each hour, and then truncate the table from 13 hours ago each hour using a cron job.
Another strategy that is sometimes useful is to "re-use" clustering keys. For example, if you were inserting data once per second, instead of having a high resolution timestamp as a clustering key, you could use the time modulo 60 seconds as the clustering key and keep the more unique timestamp as just a data field. So within each minute partition you would be changing tombstones (or outdated information) from yesterday back into live rows today, and then you wouldn't accumulate tombstones over many days.
So hopefully that gives you some ideas for things to try. Usually when you run into a tombstone problem, it's a sign that you need to re-think your schema a little bit.
I have a similar issue, only in my case there was just a single table that refused to shrink (old files are not deleted and their storage space keeps growing). I used nodetool compactionstats and saw there are a lot of pending compaction tasks.
Another interesting thing was i saw in the nodetool compactionstats always showed compactions of compaction type Compaction for the problematic table, but not of type Tombstone Compaction, as oppose to the tables that behaved good.
Could it be the problem?