Getting Cassandra tombstone_warn_threshold error - cassandra

We have a Cassandra setup on our production. There are couple of tables with around 20M records in it. To reduce the number of records we deleted the unwanted records and have also set up ttl to remove data after some time. We have setup the grace period to 1 day now. We have also ran nodetool repair on each Cassandra node (one at a time). We have total 5 nodes in cluster with replication_factor as 3. Cassandra version is 2.1.14
In Cassandra log I constantly see the below error:
WARN [SharedPool-Worker-33] 2017-02-23 06:09:02,617 SliceQueryFilter.java:320 - Read 207 live and 3059 tombstone cells in event for key: 101:10001Njh:22017 (see tombstone_warn_threshold). 5000 columns were requested, slices=[-]
I ran the command nodetool cfhistograms myekyspace event; and below is the output of the same
I not able to analyze the above output fully, but I do know the sstable count is too high.
Any idea as to what we can do to fix this or optimize our Cassandra .
java heap size is set to 8 GB and we are using CMS garbage collection.
Output of nodetool cfstats mykeyspace.event
Table Structure
#chris-lohfink - Updated the question with the cfstats details and
CREATE TABLE vcs.events (
v_id text,
c_id text,
e_month int,
sid text,
e_id timeuuid,
cr_p_id text,
e_bucket text,
e_media map<text, text>,
e_meta map<text, text>,
e_met map<text, double>,
tag set<text>,
etime timestamp,
etype text,
isfin boolean,
r_mode text,
state text,
PRIMARY KEY ((v_id, c_id, e_month), sid, e_id)
) WITH CLUSTERING ORDER BY (sid ASC, e_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX events_id_idx ON mykeyspace.event (e_id);
CREATE INDEX events_type_idx ON mykeyspace.event (etype);
CREATE INDEX events_finalized_idx ON mykeyspace.event (isfin);
CREATE INDEX idx_state ON mykeyspace.event (state);

When you delete data in Cassandra the data, it is not removed immediately, instead Cassandra creates tombstones indicating the row/column is deleted. Tombstones are stored till the gc_grace_seconds.
In your case you have 300K records deleted daily, which indicated more tombstones are created and affecting your performance. You should work on your data model to avoid this errors.
See the slides from 34 to 42 about deletes and TTL in http://www.slideshare.net/planetcassandra/8-axel-liljencrantz-23204252
Also see the impact of data models on tombstones from below Cassandra docs:
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

Related

Why do I sometimes have 10,000+ tombstones when I don't do DELETEs?

When doing a repair on a Cassandra node, I sometimes see a lot of tombstone logs. The error looks like this:
org.apache.cassandra.db.filter.TombstoneOverwhelmingException: Scanned over 100001 tombstone rows during query 'SELECT * FROM my_keyspace.table_foo WHERE token(<my params>) >= token(<my params>) AND token(<my params>) <= 2988334221698479200 LIMIT 2147385647' (last scanned row partition key was ((<my params>), 7c650d21-797e-4476-93d5-b1248e187f22)); query aborted
I have read here that tombstones are inserted as a way to mark a record as deleted. However, I don't see any code in this project that runs a delete on this table - just a read and an insert. What am I missing - how can I prevent these TombStoneOverwhelmingExceptions?
Here is the table definition:
CREATE TABLE my_keyspace.table_foo(
foo1 text,
year int,
month int,
foo2 text,
PRIMARY KEY ((foo1, year, month), foo2)
) WITH CLUSTERING ORDER BY (foo2 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 6912000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99PERCENTILE';
However, I don't see any code in this project that runs a delete on this table - just a read and an insert.
The code might not be running DELETEs, but the table definition tells Cassandra to delete anything >= 80 days old. TTLs create tombstones.
AND default_time_to_live = 6912000
So the thought behind TTLs in a time series model, is that they are typically ordered by timestamp in descending order. What ends up happening, is that most use cases tend to care only about recent data, and the descending order by timestamp causes the tombstones to end up on the "bottom" of the partition, where they are rarely (if ever) queried.
To create that effect, you'd need to create a new table with a definition something like this:
PRIMARY KEY ((foo1, year, month), created_time, foo2)
) WITH CLUSTERING ORDER BY (created_time DESC, foo2 ASC)
#anthony, here is my pov.
As a first step, don't let tombstones inserted into the table
Use the full primary key during the read path so we skip having to read the tombstones. Data modeling is key to designing the tables based on your access patterns required on the reading side
We could go and adjust min_threshold and set it to 2 to do some aggressive tombstone eviction
Similarly, we could tweak common options (for e.g. unchecked_tombstone_compaction set to true or other properties/options) to evict them faster
I would encourage you to view a similar question and the answers that are documented here

how do I delete the data from cassandra

I used Cassandra 3.6 Database and the table definition is this.
CREATE TABLE sg.products (
date_updated text,
time_added int,
id text,
best_seller text,
company text,
PRIMARY KEY (date_updated, time_added, id)
) WITH CLUSTERING ORDER BY (time_added ASC, id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Table have millions of data.
In "products" table I Drop the column best_seller, successfully Drop the column.
but when I check the space disk, it does not decree,
so I hit the query on google and I found this term "Tombstone",
so the Cassandra was not deleting the data, kind of save into tombstone.
Now my question is how do I delete the tombstone data?, so I can save the memory.
Or is there any way to save the memory?
Thanks in advance.
Tombstones drop
Cassandra will fully drop those tombstones when a compaction triggers, only after local_delete_time + gc_grace_seconds as defined on the table the data belongs to. Remember that all the nodes are supposed to have been repaired within gc_grace_seconds to ensure a correct distribution of the tombstones and prevent deleted data from reappearing.
See this line from your table definition:
AND gc_grace_seconds = 864000
That is the time period which tombstones will live for. 864000 seconds == 10 days. Tombstones exist for that duration to allow them adequate time to be distributed to the other nodes in your cluster. That way all of the other nodes are aware of the delete(s), and do not return the obsoleted values.
Once that 10 day period has passed, and the next time this table triggers compaction (after that 10 days), the tombstones will be removed.
Note that you can shorten that period by modifying that property on your table definition. Just make sure that you're running repairs within that timeframe.

Insert query replaces rows having same data field in Cassandra clustering column

I'm learning Cassandra, started off with v3.8. My sample keyspace/table looks like this
CREATE TABLE digital.usage (
provider decimal,
deviceid text,
date text,
hours varint,
app text,
flat text,
usage decimal,
PRIMARY KEY ((provider, deviceid), date, hours)
) WITH CLUSTERING ORDER BY (date ASC, hours ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Using a composite PRIMARY KEY with partition key as provider and deviceId, so that the uniqueness and distribution is done across the cluster nodes. Then the clustering keys are date and hours.
I have few observatons:
1) For a PRIMARY KEY((provider, deviceid), date, hours), while inserting multiple entries for hours field, only latest is logged and the previous are disappeared.
2) For a PRIMARY KEY((provider, deviceid), date), while inserting multiple entries for same date field, only latest is logged and the previous are disappeared.
Though i'm happy with above(point-1) behaviour, want to know whats happening in the background. Do I have to understand more about the clustering order keys?
PRIMARY KEY is meant to be unique.
Most of RDBMS throws error if you insert duplicate value in PRIMARY KEY.
Cassandra does not do Read before Write. It creates a new version of record with latest timestamp. When you insert data with same values for columns in primary key, new data will be created with latest timestamp and while querying (SELECT) record with only latest timestamp is returned back.
Example:
PRIMARY KEY((provider, deviceid), date, hours)
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test","test")
---- This will create a new record with let's say timestamp as 1
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test1","test1")
---- This will create a new record with let's say timestamp as 2
SELECT app,flat FROM digital.usage WHERE provider=1.0 AND deviceid='a' AND date='2017-07-27' AND hours=1
Will give
------------
| app | flat |
|-----|------|
|test1|test1 |
------------

issue with frequent truncates in Cassandra and 24 hour ttl create large tombstones

We have the below table with ttl 24 hours or 1 day. We have 4 cassandra 3.0 node cluster and there will be a spark processing on this table. Once processed, it will truncate all the data in the tables and new batch of data would be inserted. This will be a continuous process.
Problem I am seeing is , we are getting more tombstones because data is truncated frequently everyday after spark finishes processing.
If I set gc_grace_seconds to default , there will be more tombstones. If I reduce gc_grace_seconds to 1 day will it be an issue ? even if I run repair on that table every day will that be enough.
How should I approach this problem, I know frequent deletes is an antipattern in Cassandra, is there any other way to solve this issue?
TABLE b.stag (
xxxid bigint PRIMARY KEY,
xxxx smallint,
xx smallint,
xxr int,
xxx text,
xxx smallint,
exxxxx smallint,
xxxxxx tinyint,
xxxx text,
xxxx int,
xxxx text,
xxxxx text,
xxxxx timestamp
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCom pactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandr a.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 86400
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
thank you
A truncate of a table should not invoke tombstones. So when you're saying "truncating" I assume you mean deleting. You can as you have already mentioned drop the gc_grace_seconds value, however this is means you have a smaller window for repairs to run to reconcile any data, make sure each node has the right tombstone for a given key etc or old data could reappear. Its a trade off.
However to be fair if you are clearing out the table each time, why not use the TRUNCATE command, this way you'll flush the table with no tombstones.

Slow range queries in Cassandra

I am working on a single node. I have the following table to store a list of documents:
CREATE TABLE my_keyspace.document (
status text,
date timestamp,
doc_id text,
raw_content text,
title text,
PRIMARY KEY (status, date, doc_id)
) WITH CLUSTERING ORDER BY (date ASC, doc_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX doc_id_idx ON my_keyspace.document (doc_id);
I am doing a lot of queries like:
SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND data>=start_date AND data<=end_date;
For some reason it is very slow, at first the warnings that I had were this:
[2016-07-26 18:10:46] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)
[2016-07-26 18:10:52] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)
Thinking the issue was linked to having too many tombestones I did:
ALTER TABLE my_keyspace.document WITH gc_grace_seconds = '0';
and then:
nodetool compact my_keyspace document
Now I don't have any warning but the queries are still very slow and often timeout. No message is displayed in any logs concerning the timeout.The number of documents I am having is roughly 200k instances. Those documents are distributed over a 20 days period, about 4500 documents have status='PROCESSED' each day. The queries answer time vary depending of the date range: about 3 seconds for a one day time range, 15 secs for 4 days and timeout for 2 weeks. Also, I disabled the swap. The version of Cassandra I am using is 3.5.
Recently I've noticed that giving the precise columns to extract instead of * is improving the response time a bit, but the system is still too slow.
EDIT: Computing partition size as proposed by Reveka
So, following the formula:
Number of rows = 20 * 4500 = 90,000
Number of columns = 19
Number of primary keys = 3
Number of static column = 0
So the number of values is 90000*(19-3)=1,440,000
For the size of the partition, I got to an estimate of about 1.2GB.
This might be a bit big. But how can I modify my partition key to still be able to do the same range queries while having smaller partitions? I could have a composite partition key containing the status and the day extracted from date, but wouldn't I have to then specify the day before being able to query by range:
SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND day='someday' AND date>='start_date' AND date<='end_date';
Which forces me to do one query per day.
I see that your primary key consists of status, date and doc_id and you only use status as your partition key. That means that all the documents of the same status regardless of date will be put in the same partition. I guess that is a lot of information for one partition. Cassandra works well in partitions that are 100MB (or a couple of hundred MB in later versions) big, see here. Datastax D220 cource (it is free you just need to create an account) has a video that shows you how to calculate your partition size. You can post the results to your analysis so we can further help you. :)
EDIT: After the size analysis
You will have to make your partition by date in order to have smaller partition. That means that now you will not be able to query by range. A workaround for this would be to do multiple queries based on the range you want. For example: if you want to do a query for range 12 August to 14 August you split by day and do three queries, one for 12 August, one for 13 and one for 14. Again though, if your range is big you will end up retrieving gb of data. I do not know your use case, but I am going to make a guess that you don't need gb worth of files everytime you do a date range query. Can you give me more info on your use case (a.k.a what do you want to do?)
ps. I can't write comments yet so I can only advice you through this answer

Resources