We have a three node Cassandra cluster with RF 3. There is a table with SizeTieredCompaction strategy. In some cases, performing a major compaction nodetool compact --split-output -- <keyspace> <table> on this table, doesn't free up disk, but performing nodetool garbagecollect -- <keyspace> <table> frees up the disk. The gc_grace_seconds is set to 1 hour and default_time_to_live is set to 3 hours:
CREATE TABLE keyspace.table (
id text PRIMARY KEY,
json text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 10800
AND gc_grace_seconds = 3600
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Does anyone know the reason?
Thanks in advance!
Nodetool garbagecollect performs single sstable compaction, so it can shrink the size of individual files on disk. Garbagecollect has been available available since Cassandra 2.10 and removes deleted partitions and rows by default. If you specify -g cell it will also remove overwritten or deleted cells.
Nodetool compact Compaction combines several (typically four) smaller sstables together while also cleaning-up overwritten and expired data. Size-tiered compaction requires min_threshold tables to combine.
Compaction may also look at an estimate of number of droppable tombstones in a sstable, and compact a single table if the ratio is above the tombstone_threshold (0.2 or 20% by default)
The documentation on compact states:
...triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, min_threshold.
DSE 6.7 nodetool compact
Thus garbagecollect will always run, but compact will ignore a table if min_threshold (default 4) isn't satisfied and the droppable tombstone ratio is not very high. Also, garbagecollect requires less free disk space to run.
Related
When doing a repair on a Cassandra node, I sometimes see a lot of tombstone logs. The error looks like this:
org.apache.cassandra.db.filter.TombstoneOverwhelmingException: Scanned over 100001 tombstone rows during query 'SELECT * FROM my_keyspace.table_foo WHERE token(<my params>) >= token(<my params>) AND token(<my params>) <= 2988334221698479200 LIMIT 2147385647' (last scanned row partition key was ((<my params>), 7c650d21-797e-4476-93d5-b1248e187f22)); query aborted
I have read here that tombstones are inserted as a way to mark a record as deleted. However, I don't see any code in this project that runs a delete on this table - just a read and an insert. What am I missing - how can I prevent these TombStoneOverwhelmingExceptions?
Here is the table definition:
CREATE TABLE my_keyspace.table_foo(
foo1 text,
year int,
month int,
foo2 text,
PRIMARY KEY ((foo1, year, month), foo2)
) WITH CLUSTERING ORDER BY (foo2 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 6912000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99PERCENTILE';
However, I don't see any code in this project that runs a delete on this table - just a read and an insert.
The code might not be running DELETEs, but the table definition tells Cassandra to delete anything >= 80 days old. TTLs create tombstones.
AND default_time_to_live = 6912000
So the thought behind TTLs in a time series model, is that they are typically ordered by timestamp in descending order. What ends up happening, is that most use cases tend to care only about recent data, and the descending order by timestamp causes the tombstones to end up on the "bottom" of the partition, where they are rarely (if ever) queried.
To create that effect, you'd need to create a new table with a definition something like this:
PRIMARY KEY ((foo1, year, month), created_time, foo2)
) WITH CLUSTERING ORDER BY (created_time DESC, foo2 ASC)
#anthony, here is my pov.
As a first step, don't let tombstones inserted into the table
Use the full primary key during the read path so we skip having to read the tombstones. Data modeling is key to designing the tables based on your access patterns required on the reading side
We could go and adjust min_threshold and set it to 2 to do some aggressive tombstone eviction
Similarly, we could tweak common options (for e.g. unchecked_tombstone_compaction set to true or other properties/options) to evict them faster
I would encourage you to view a similar question and the answers that are documented here
I used Cassandra 3.6 Database and the table definition is this.
CREATE TABLE sg.products (
date_updated text,
time_added int,
id text,
best_seller text,
company text,
PRIMARY KEY (date_updated, time_added, id)
) WITH CLUSTERING ORDER BY (time_added ASC, id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Table have millions of data.
In "products" table I Drop the column best_seller, successfully Drop the column.
but when I check the space disk, it does not decree,
so I hit the query on google and I found this term "Tombstone",
so the Cassandra was not deleting the data, kind of save into tombstone.
Now my question is how do I delete the tombstone data?, so I can save the memory.
Or is there any way to save the memory?
Thanks in advance.
Tombstones drop
Cassandra will fully drop those tombstones when a compaction triggers, only after local_delete_time + gc_grace_seconds as defined on the table the data belongs to. Remember that all the nodes are supposed to have been repaired within gc_grace_seconds to ensure a correct distribution of the tombstones and prevent deleted data from reappearing.
See this line from your table definition:
AND gc_grace_seconds = 864000
That is the time period which tombstones will live for. 864000 seconds == 10 days. Tombstones exist for that duration to allow them adequate time to be distributed to the other nodes in your cluster. That way all of the other nodes are aware of the delete(s), and do not return the obsoleted values.
Once that 10 day period has passed, and the next time this table triggers compaction (after that 10 days), the tombstones will be removed.
Note that you can shorten that period by modifying that property on your table definition. Just make sure that you're running repairs within that timeframe.
I have 2 Cassandra nodes. I have a table with 3 text fields (all keys) and 2 counters. RF is 2.
I added another counter column to table. Mistakenly, I issued drop on that column. I reverted application back to old version to not use the column.
I have added another counter column, to replace the dropped one, with a different name. I changed application to use that new column.
Now, all my queries that have where clause fail with this error:
ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures" info={'failures': 2, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
And I see this in debug.log:
java.lang.IllegalStateException: [payable, revenue, to_pay, value] is not a subset of [revenue to_pay value]
payable is the old column, that was dropped, and to_pay is the new column.
What is happening?
Cassandra version is 3.11.
PS. I tried repairing, and it is running. Will it help?
EDIT:
Table schema:
CREATE TABLE backend_platform_prod.stats_counters (
date text,
key text,
revenue counter,
to_pay counter,
value counter,
PRIMARY KEY (date, key)
) WITH CLUSTERING ORDER BY (key ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
There was a payable counter field, that is dropped.
I tried backing up data with copy to, dropping table and recreating it, and restoring with copy from. It is working now, although some data seems missing (not very important).
I see payable column in system_schema.dropped_columns, but not in system_schema.columns.
Please check in System_schema keyspace .if the column is still existing delete that row.
We have the below table with ttl 24 hours or 1 day. We have 4 cassandra 3.0 node cluster and there will be a spark processing on this table. Once processed, it will truncate all the data in the tables and new batch of data would be inserted. This will be a continuous process.
Problem I am seeing is , we are getting more tombstones because data is truncated frequently everyday after spark finishes processing.
If I set gc_grace_seconds to default , there will be more tombstones. If I reduce gc_grace_seconds to 1 day will it be an issue ? even if I run repair on that table every day will that be enough.
How should I approach this problem, I know frequent deletes is an antipattern in Cassandra, is there any other way to solve this issue?
TABLE b.stag (
xxxid bigint PRIMARY KEY,
xxxx smallint,
xx smallint,
xxr int,
xxx text,
xxx smallint,
exxxxx smallint,
xxxxxx tinyint,
xxxx text,
xxxx int,
xxxx text,
xxxxx text,
xxxxx timestamp
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCom pactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandr a.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 86400
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
thank you
A truncate of a table should not invoke tombstones. So when you're saying "truncating" I assume you mean deleting. You can as you have already mentioned drop the gc_grace_seconds value, however this is means you have a smaller window for repairs to run to reconcile any data, make sure each node has the right tombstone for a given key etc or old data could reappear. Its a trade off.
However to be fair if you are clearing out the table each time, why not use the TRUNCATE command, this way you'll flush the table with no tombstones.
We have a Cassandra setup on our production. There are couple of tables with around 20M records in it. To reduce the number of records we deleted the unwanted records and have also set up ttl to remove data after some time. We have setup the grace period to 1 day now. We have also ran nodetool repair on each Cassandra node (one at a time). We have total 5 nodes in cluster with replication_factor as 3. Cassandra version is 2.1.14
In Cassandra log I constantly see the below error:
WARN [SharedPool-Worker-33] 2017-02-23 06:09:02,617 SliceQueryFilter.java:320 - Read 207 live and 3059 tombstone cells in event for key: 101:10001Njh:22017 (see tombstone_warn_threshold). 5000 columns were requested, slices=[-]
I ran the command nodetool cfhistograms myekyspace event; and below is the output of the same
I not able to analyze the above output fully, but I do know the sstable count is too high.
Any idea as to what we can do to fix this or optimize our Cassandra .
java heap size is set to 8 GB and we are using CMS garbage collection.
Output of nodetool cfstats mykeyspace.event
Table Structure
#chris-lohfink - Updated the question with the cfstats details and
CREATE TABLE vcs.events (
v_id text,
c_id text,
e_month int,
sid text,
e_id timeuuid,
cr_p_id text,
e_bucket text,
e_media map<text, text>,
e_meta map<text, text>,
e_met map<text, double>,
tag set<text>,
etime timestamp,
etype text,
isfin boolean,
r_mode text,
state text,
PRIMARY KEY ((v_id, c_id, e_month), sid, e_id)
) WITH CLUSTERING ORDER BY (sid ASC, e_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX events_id_idx ON mykeyspace.event (e_id);
CREATE INDEX events_type_idx ON mykeyspace.event (etype);
CREATE INDEX events_finalized_idx ON mykeyspace.event (isfin);
CREATE INDEX idx_state ON mykeyspace.event (state);
When you delete data in Cassandra the data, it is not removed immediately, instead Cassandra creates tombstones indicating the row/column is deleted. Tombstones are stored till the gc_grace_seconds.
In your case you have 300K records deleted daily, which indicated more tombstones are created and affecting your performance. You should work on your data model to avoid this errors.
See the slides from 34 to 42 about deletes and TTL in http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252
Also see the impact of data models on tombstones from below Cassandra docs:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets