I have a single node cassandra cluster, I use the current minute as partition key and insert rows with TTL of 12 hours.
I see a couple of issue I can't explain
The /var/lib/cassandra/data/<key_space>/<table_name> contains multiple files, lots of them are really old (way older then 12 hours, something like 2 days)
When I try to perform a query in cqlsh I get a lot of logs that seem to indicate that my table contain lots of tombstones
log:
WARN [SharedPool-Worker-2] 2015-01-26 10:51:39,376 SliceQueryFilter.java:236 - Read 0 live and 1571042 tombstoned cells in <table_name>_name (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:40,472 SliceQueryFilter.java:236 - Read 0 live and 1557919 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:41,630 SliceQueryFilter.java:236 - Read 0 live and 1589764 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:42,877 SliceQueryFilter.java:236 - Read 0 live and 1582163 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,081 SliceQueryFilter.java:236 - Read 0 live and 1550989 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:44,869 SliceQueryFilter.java:236 - Read 0 live and 1566246 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:45,582 SliceQueryFilter.java:236 - Read 0 live and 1577906 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:46,443 SliceQueryFilter.java:236 - Read 0 live and 1571493 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:47,701 SliceQueryFilter.java:236 - Read 0 live and 1559448 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
WARN [SharedPool-Worker-2] 2015-01-26 10:51:49,255 SliceQueryFilter.java:236 - Read 0 live and 1574936 tombstoned cells in <table_name> (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
I've tried multiple compaction strategies, multithreaded compaction, I've tried running compaction manually with nodetool, also, I've tried forcing garbage collection with jmx.
One of my guesses is that the compaction doesn't delete tombstones files
Any ideas how to avoid disk space from getting too big, my biggest concern is running out of space, I'd rather store less (by making the ttl smaller but currently that doesn't help)
Tombstones will be preserved for 10 days using the default configuration. The reason for this is to make sure that offline nodes will be able to catch up with deletes when they join the cluster again. You can configure this value by setting the gc_grace_seconds setting.
I'm assuming you are using the timestamp as a clustering column within each partition when you say you are using the minute as the partition key, along with a TTL of 12 hours when you do the insert. This will build up tombstones in each partition since you are never deleting the entire row (i.e. a whole minute partition).
Suppose your keyspace is called k1 and your table is called t2, then you can run:
nodetool flush k1 t2
nodetool compact k1 t2
sstable2json /var/lib/cassandra/data/k1/t2/k1-t2-jb-<last version>-Data.db
then you'll see all the tombstones like this (marked with a "d" for deleted):
{"key": "00000003","columns": [["4:","54c7b514",1422374164512000,"d"], ["5:","54c7b518",1422374168501000,"d"], ["6:","54c7b51b",1422374171987000,"d"]]}
Now if you go and delete that row (i.e. delete from k1.t2 where key=3;), then do the flush, compact, and sstable2json again, you'll see it change to:
{"key": "00000003","metadata": {"deletionInfo": {"markedForDeleteAt":1422374340312000,"localDeletionTime":1422374340}},"columns": []}
So you see all the tombstones are gone and Cassandra only has to remember that the whole row was deleted at a certain time instead of little bits and pieces of the row being deleted at certain times.
Another way to get rid of the tombstones is to truncate the whole table. When you do that, Cassandra only needs to remember that the whole table was truncated at a certain time, and so no longer needs to keep tombstones prior to that time (since tombstones are used to tell other nodes that certain data was deleted, and if you can say the whole table was emptied at time x, then the details prior to that no longer matter).
So how could you apply this in your situation you ask. Well, you could use the hour and minute as your partition key, and then once an hour run a cron job to delete all the rows from 13 hours ago. Then on the next compaction, all the tombstones for that partition would be removed.
Or keep a separate table for each hour, and then truncate the table from 13 hours ago each hour using a cron job.
Another strategy that is sometimes useful is to "re-use" clustering keys. For example, if you were inserting data once per second, instead of having a high resolution timestamp as a clustering key, you could use the time modulo 60 seconds as the clustering key and keep the more unique timestamp as just a data field. So within each minute partition you would be changing tombstones (or outdated information) from yesterday back into live rows today, and then you wouldn't accumulate tombstones over many days.
So hopefully that gives you some ideas for things to try. Usually when you run into a tombstone problem, it's a sign that you need to re-think your schema a little bit.
I have a similar issue, only in my case there was just a single table that refused to shrink (old files are not deleted and their storage space keeps growing). I used nodetool compactionstats and saw there are a lot of pending compaction tasks.
Another interesting thing was i saw in the nodetool compactionstats always showed compactions of compaction type Compaction for the problematic table, but not of type Tombstone Compaction, as oppose to the tables that behaved good.
Could it be the problem?
Related
I do a select with tracing ON and see:
Skipped 0/1 non-slice-intersecting sstables
included 0 due to tombstones [ReadStage-<N>]
So is it working to ignore tombstones? The trace:
Read 0 live rows and 2 tombstone cells
is clear: it is reading tombstones
Let's say there was a Column A.
You added value x to Column A.
Then you deleted Column A.
Instead of immediately deleting value x, Cassandra will add a marker for Column A which is called tombstone. The tombstone is also an individual record in itself just like the original value x.
Let's say the two updates were written in different sstables (Cassandra storage).
Now when you are reading the value, Cassandra will get the value x and the tombstone for Column A. It will see that tombstone was written after the value x so it will not return any value.
Skipped 0/1 non-slice-intersecting sstables
included 0 due to tombstones
This is basically confirming the same.
Based on talking to some Cassandra admins:
" Skipping sstables is Cassandra telling us it eliminated the tombstones efficiently, this is ok
" Deleting everything in a partition in general helps ensure Cassandra is not bogged down with tombstones
My idea is to add rows to Cassandra with TTL = 15 minutes so I'll be able to load realtime data (now - 15 minutes, now) w/o storing timestamps etc. My concern is that the rows with expiring TTL will be marked as tombstone (not actually deleted). I.e., will they count when I run select count(*) from realtime_table?
No, tombstoned rows won't be returned as a result - they will be skipped when reading the data.
But if you actively expiring the data, you may need to tune gc_grace_period, otherwise you can get too many not removed tombstones, and in some cases will start to get warning or error during read if read operation will need to skip tombstones (controlled by tombstone_warn_threshold & by tombstone_failure_threshold options of cassandra.yaml.
Here is the very good blog post that describes how data are deleted & cleaned up.
But select count(*) from table is real antipattern in Cassandra - you need to consider correct modelling of your data with partitions, etc.
I use Cassandra 3.0.12.
And I have a cassandra Column Family, or CQL table with the following schema:
CREATE TABLE win30 (
cust_id text,
tid timeuuid,
info text,
PRIMARY KEY (cust_id , tid )
) WITH CLUSTERING ORDER BY (tid DESC)
and compaction = {'class': 'DateTieredCompactionStrategy', 'max_sstable_age_days': 31 };
alter table win30 with default_time_to_live = '2592000';
I have set the default_time_to_live property for the entire table, but when I query the table,
select * from win30 order by tid desc limit 9999
Cassandra WARN that
Read xx live rows and xxxx tombstone for query xxxxxx (see tombstone_warn_threshold).
According to this doc How is data deleted,
Cassandra allows you to set a default_time_to_live property for an
entire table. Columns and rows marked with regular TTLs are processed
as described above; but when a record exceeds the table-level TTL,
Cassandra deletes it immediately, without tombstoning or compaction.
"but when a record exceeds the table-level TTL,Cassandra deletes it immediately, without tombstoning or compaction."
Why Cassandra still WARN for tombstone since I have set a default_time_to_live?
I insert data using some CQL like, without using TTL.
insert into win30 (cust_id, tid, info ) values ('123', now(), 'sometext');
a similar question but it does not use default_time_to_live
And it seems that I could set the unchecked_tombstone_compaction to true?
Another question, I select data with ordering the same as the CLUSTERING ORDER,
why Cassandra hit so many tombstones?
Why Cassandra still WARN for tombstone since I have set a default_time_to_live?
The way TTL works in Cassandra is that once the record is expired, its marked as tombstone (the same process of deletion of a record). So instead of manually having a purge job in RDBMS world, Cassandra enables you to cleanup old records based on their TTL. But it still follows through the same process as DELETE and hence the tombstone. Since your TTL value is '2592000' (30days), anything older than 30 days in the table gets expired (marked as tombstone - deleted).
Now the reason for the warning is that your SELECT statement is looking for records that are alive (non-deleted) and the warning message is for how many tombstoned (expired / deleted) records were encountered in the process. So while trying to serve 9999 alive records, the table hit X number of tombstones along the way.
Since the TTL is set at table level, any inserted record to this table will have a default TTL of 30days.
Here is the documentation reference, in case you want to read more.
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds.
Above reference is from this link
And it seems that I could set the unchecked_tombstone_compaction to true?
Its nothing related to the warning that you are getting. You could think about reducing gc_grace_seconds value (default 10 days) to get rid of tombstones quicker. But there is a reason for this value to be 10days.
Note that DateTieriedCompactionStrategy is depcreated and once you upgrade to 3.11 Apache Cassandra or DSE 5.1.2 there is TimeWindowCompactionStrategy which does a better job with handling tombstones.
New in operating cassandra clusters. Having a 1 DC and 14 node production cluster running # DCE v.2.1.15.
System log shows many TS warnings like below and are wondering if this is okay or due to the applications natur vs too low (default TS warn level=5000) or if we ought to run manual compactions between our nightly repairs (every node gets repaired once # week), raise TS warn level...
Hints appreciated!
WARN [SharedPool-Worker-1] 2016-08-18 11:45:02,536
SliceQueryFilter.java:320 - Read 0 live and 6251 tombstone cells in
KeyspaceMetadata.CF_RecentIndex for key: 3230303230305febd8fc98e0bf11e
5b870502699f4d249 (see tombstone_warn_threshold). 5000 columns were
requested, slices=[-] WARN [SharedPool-Worker-3] 2016-08-18
11:45:02,548 SliceQueryFilter.java:320 - Read 0 live and 6251
tombstone cells in KeyspaceMetadata.CF_MessageFlagsIndex for key:
3230303230305febd8fc98e 0bf11e5b870502699f4d249 (see
tombstone_warn_threshold). 1 columns were requested, slices=[1-1:!]
WARN [SharedPool-Worker-2] 2016-08-18 11:45:04,566
SliceQueryFilter.java:320 - Read 1 live and 1123 tombstone cells in
KeyspaceMetadata.CF_UIDIndex for key: 3230303230315f299f8d3ae0c011e593
3d7775140b84c3 (see tombstone_warn_threshold). 5000 columns were
requested, slices=[-] WARN [SharedPool-Worker-2] 2016-08-18
11:45:11,853 SliceQueryFilter.java:320 - Read 0 live and 6251
tombstone cells in KeyspaceMetadata.CF_RecentIndex for key:
3230303230305febd8fc98e0bf11e 5b870502699f4d249 (see
tombstone_warn_threshold). 5000 columns were requested, slices=[-]
WARN [SharedPool-Worker-2] 2016-08-18 11:45:11,864
SliceQueryFilter.java:320 - Read 0 live and 6251 tombstone cells in
KeyspaceMetadata.CF_MessageFlagsIndex for key: 3230303230305febd8fc98e
0bf11e5b870502699f4d249 (see tombstone_warn_threshold). 1 columns were
requested, slices=[1-1:!] WARN [SharedPool-Worker-1] 2016-08-18
11:46:09,624 SliceQueryFilter.java:320 - Read 2 live and 2537
tombstone cells in KeyspaceMetadata.CF_TimeIndex for key:
3230303030385ffebcbd200d9411e6b 9750c94c36d1038 (see
tombstone_warn_threshold). 5000 columns were requested, slices=[-]
WARN [SharedPool-Worker-3] 2016-08-18 11:47:31,434
SliceQueryFilter.java:320 - Read 2 live and 2544 tombstone cells in
KeyspaceMetadata.CF_TimeIndex for key: 3230303030345f6b87b24afbe111e5b
f7f828e02f15dd6 (see tombstone_warn_threshold). 5000 columns were
requested, slices=[-] WARN [SharedPool-Worker-1] 2016-08-18
11:49:13,870 SliceQueryFilter.java:320 - Read 3 live and 2540
tombstone cells in KeyspaceMetadata.CF_TimeIndex for key:
3230303030355f533d997cfbdf11e59 85390948f56b8a7 (see
tombstone_warn_threshold). 5000 columns were requested, slices=[-]
Hai steffen please go through your opscenter and see how many tables have Ts more than 5000 if the no.of tables are less, running manually compactions on those particular tables will be a good solution , as you mentioned you are repairing once a week i would suggest to check the data model of the keyspace why it is causing high no.of TS.
If you are not using Opscenter you can check no.of tombstones
sstable2json full_path | grep \"t\"| wc -l
I have this use case where I would need to constantly listen to a kafka topic and write to 2000 column families(15 columns each.. time series data) based on a column value from a Spark streaming App. I have a local Cassandra installation set up. Creating these column families takes around 1.5 hrs on a CentOS VM using 3 cores and and 12 gigs of ram. In my spark streaming app I'm doing some preprocessing for storing these stream events to Cassandra. I'm running into issues with the amount of time it takes for my streaming app to complete this.
I was trying to save 300 events to multiple column families(roughly 200-250) based on key for this my app takes around 10 minutes to save them. This seems to be strange as printing these events to screen grouped by key takes less than a minute, but only when I am saving them to Cassandra it takes time.
I have had no issues saving records in the order of 3 million to Cassandra . It took less than 3 minutes(but this was to a single column family in Cassandra).
My requirement is to be as real-time as possible and this seems to be nowhere close. Production environment would have roughly 400 events every 3 seconds.
Is there any tuning that i need to do With the YAML file in Cassandra or any changes to cassandra-connector itself
INFO 05:25:14 system_traces.events 0,0
WARN 05:25:14 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:14 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN 05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN 05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO 05:25:16 ParNew GC in 340ms. CMS Old Gen: 1308020680 -> 1454559048; Par Eden Space: 251658240 -> 0;
WARN 05:25:16 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:16 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN 05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN 05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO 05:25:17 ParNew GC in 370ms. CMS Old Gen: 1498825040 -> 1669094840; Par Eden Space: 251658240 -> 0;
WARN 05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:18 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN 05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN 05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN 05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO 05:25:19 ParNew GC in 382ms. CMS Old Gen: 1714792864 -> 1875460032; Par Eden Space: 251658240 -> 0;
W
I suspect you're hitting edge cases in cassandra related to the large number of CFs/columns defined in the schema. Typically when you see tombstone warnings, it's because you've messed up the data model. However, these are in system tables, so obviously you've done something to the tables that the authors didnt expect (lots and lots of tables, and probably drop/recreating them a lot).
Those warnings were added because scanning past tombstones looking for live columns causes memory pressure, which causes GC, which causes pauses, which causes slowness.
Can you squish the data into significantly fewer column families? You may also want to try clearing out the tombstones (drop gcgs for that table to zero, run major compaction on system if it's allowed?, raise it back to default).
You can refer to this blog for Spark-Cassandra connector tuning. You will get an idea on perf numbers that you can expect. Also You can try out another open source product SnappyData, which is the Spark database, which will give you very high performance in your use case.