Tombstone vs nodetool and repair - cassandra

I inserted 10K entries in a table in Cassandra which has the TTL of 1 minute under the single partition.
After the successful insert, I tried to read all the data from a single partition but it throws an error like below,
WARN [ReadStage-2] 2018-04-04 11:39:44,833 ReadCommand.java:533 - Read 0 live rows and 100001 tombstone cells for query SELECT * FROM qcs.job LIMIT 100 (see tombstone_warn_threshold)
DEBUG [Native-Transport-Requests-1] 2018-04-04 11:39:44,834 ReadCallback.java:132 - Failed; received 0 of 1 responses
ERROR [ReadStage-2] 2018-04-04 11:39:44,836 StorageProxy.java:1906 - Scanned over 100001 tombstones during query 'SELECT * FROM qcs.job LIMIT 100' (last scanned row partion key was ((job), 2018-04-04 11:19+0530, 1, jobType1522820944168, jobId1522820944168)); query aborted
I understand tombstone is an marking in the sstable not the actual delete.
So I performed the compaction and repair using nodetool
Even after that when I read the data from the table, It throws the same error in log file.
1) How to handle this scenario?
2) Could some explain why this scenario happened and Why not the compaction and repair didn't solve this issue?

Tombstones are really deleted after period specified by gc_grace_seconds setting of the table (it's 10 days by default). This is done to make sure that any node that was down at time of deletion will pickup these changes after recover. Here are the blog posts that discuss this in great details: from thelastpickle (recommended), 1, 2, and DSE documentation or Cassandra documentation.
You can set the gc_grace_seconds option on the individual table to lower value to remove deleted data faster, but this should be done only for tables with TTLed data. You may also need to tweak tombstone_threshold & tombstone_compaction_interval table options to perform compactions faster. See this document or this document for description of these options.

New cassandra support .
$ ./nodetool garbagecollect
After this command "Transfer memory to disk, before restart"
$ ./nodetool drain # "This closes connection after that, clients can not access. "
Shutdown cassandra and restart again. "You should restart after drain. "
** You do not need to drain, ! but, depends on situation.! These are extra informations.

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

JanusGraph query failure due to Cassandra backend tombstone exception

I have raised a GitHub issue regarding this as well. Pasting the same below.
JanusGraph version - janusgraph-0.3.1
Cassandra - cassandra:3.11.4
When we run JanusGraph with the Cassandra backend, after a period of time, the JanusGraph starts throwing the below errors and goes in to an unusable state.
JanusGraph Logs:
466489 [gremlin-server-exec-6] INFO
org.janusgraph.diskstorage.util.BackendOperation - Temporary exception
during backend operation [EdgeStoreKeys].
Attempting backoff retry.
org.janusgraph.diskstorage.TemporaryBackendException:
Temporary failure in storage backend at io.vavr.API$Match$Case0.apply(API.java:3174)
at
io.vavr.API$Match.of(API.java:3137) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.lambda$static$0 (CQLKeyColumnValueStore.java:123)
at io.vavr.control.Try.getOrElseThrow(Try.java:671) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.getKeys (CQLKeyColumnValueStore.java:405)
Caused by: com.datastax.driver.core.exceptions.ReadFailureException:
Cassandra failure during read query at consistency QUORUM (1 responses
were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:130)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:30)
Cassandra Logs:
WARN [ReadStage-2] 2019-07-19 11:40:02,980 ReadCommand.java:569 - Read
74 live rows and 100001 tombstone cells for query SELECT * FROM
janusgraph.edgestore WHERE column1 >= 02 AND column1 <= 03 LIMIT 100
(see tombstone_warn_threshold)
ERROR [ReadStage-2] 2019-07-19
11:40:02,980 StorageProxy.java:1896 - Scanned over 100001 tombstones
during query 'SELECT * FROM janusgraph.edgestore WHERE column1 >= 02
AND column1 <= 03 LIMIT 100' (last scanned row partion key was
((00000000002b9d88), 02)); query aborted
Related Question:
Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Questions:
1) Is Edge updates are stored as a new item causing tombstones ?. (since janus is a fork of titan).
How to increment Number of Visit count in Titan graph database Edge Label?
https://github.com/JanusGraph/janusgraph/issues/934
2) What is the right approach towards this. ?
Any solution/indications would be really helpful.
[Update]
1) Update to the edges didn't cause tombstones in the JanusGraph.
2) Solutions:
- As per the answer, reduce the gc_grace_seconds to a lower value based on the deletions of edge/vertex.
- Also can consider tuning the "tombstone_failure_threshold" in cassandra.yaml based on the needs.
For Cassandra, a tombstone is a flag that indicates that a record should be deleted, this can be occur after a delete operation was explicitly requested, or once that the Time To Live (TTL) period expired. A record with a tombstone will persist for the time defined in the gc_grace_seconds after the delete operation was executed, by default it is 10 days.
Usually running nodetool repair janusgraph edgestore (based on the error log provided) should be able to fix the issue. If you are still getting the error, you may need to decrease the gc_grace_seconds value of your table, as explained here.
For more information regarding tombstones:
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Tombstone vs nodetool and repair

Cassandra 'bad state', cannot run compaction?

We are using 72% of hard-drive, deleted about half of rows ( using cqlsh ), however Cassandra(3.9.0) cannot complete compaction, throws java.lang.RuntimeException: Not enough space for compaction, estimated sstables = 1, expected write size = 799429448428
Compaction triggers very 24 hrs and fails.
Note that is a single node setup and 'gc_grace_seconds=0';
Is there any other way to force removal of deleted data?
Thanks
You can try splitting large table (with sstablesplit) into smaller ones, so the compaction will require less space (this is requires to stop the node).
http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableSplit.html

DateTieredCompactionStrategy sub-properties in Cassandra

Have some questions on DateTieredCompactionStrategy sub-properties in Cassandra
The blog http://www.datastax.com/dev/blog/datetieredcompactionstrategy, says:
Base_time_seconds: “This is the size of the first window, defaults to 3600 seconds (1 hour). The rest of the windows will be min_threshold (default 4) times the size of the previous window.”
With default value of 3600 i.e., 1hr for base_time_seconds does it mean first compaction triggers at 1st hour, next at 4,16, 64 hours and so on?
max_window_size_seconds: Default 1 day. Does it mean my compaction is run atleast once in a day?
tombstone_compaction_interval: Default 10 days.
If my sstable is say 7 day old, but is full of expired data due to ttl 1 day and GC_grace_sec of 1 day. Does it mean that still my sstables are not removed?
Does tombstone_compaction_interval take priority over ttl and GC_grace_sec
min_threshold: When compaction is run, and no, of sstables is < min_threshold, then my compaction is not run?
No - DTCS finds the sstables within one of those windows (1h, 4h, ..) and if it thinks it needs to compact them together (iirc for the first window it has to be more than min_threshold, for the rest 2 or more), it will.
No. The number of compactions is only depending on the number of flushed/streamed sstables. max window size is just to make sure we don't get huge older windows which hurt when bootstrapping/streaming etc.
No, with DTCS you should not touch the tombstone_compaction_interval - the whole idea is that once the whole sstable is expired, the entire thing will get dropped automatically without compaction.
Correct, but it is per window, so you could have 100 sstables in separate windows with DTCS
Note that DTCS is deprecated and you should really be using TWCS instead. If you use cassandra < 3.0, you can just build the jar file and drop it in the lib directory to use it. https://github.com/jeffjirsa/twcs https://issues.apache.org/jira/browse/CASSANDRA-9666

How to force delete tombstone created with USING TIMESTAMP in delete CQL statement?

We have a table which we deleted bunch of rows from using java max long (9223372036854775807) as timestamp. For example,
DELETE r_id FROM orderbook USING TIMESTAMP 9223372036854775807 WHERE o_id='' AND p_id='' AND e_id='' AND a_id='a1' AND ord_id = 645e7d3c-aef7-4e3c-b834-24b792cf2e55;
These tombstones are created in sstable with markedForDeleteAt = 9223372036854775807.
Sample output from sstable2json
[
{"key": ":::a1",
"cells": [["645e7d3c-aef7-4e3c-b834-24b792cf2e51:_","645e7d3c-aef7-4e3c-b834-24b792cf2e51:!",9223372036854775807,"t",1476520163],
["645e7d3c-aef7-4e3c-b834-24b792cf2e52:","",1],
["645e7d3c-aef7-4e3c-b834-24b792cf2e55:","",1],
["645e7d3c-aef7-4e3c-b834-24b792cf2e55:r_id",1476520867,9223372036854775807,"d"]]}
]
Tombstones (range ("t") or otherwise ("d")) created with such high time isn't getting collected with minor or major compaction. We even tried setting gc_grace_seconds to 0 and run major compaction but no luck. I am thinking that 'markedForDeleteAt + gc_grace_seconds > compaction time' equation is playing out and that's why tombstones are not collected. But then I read cassandra code and it seems like localDeletionTime is considered in equation and not markedForDeleteAt.
* The local server timestamp, in seconds since the unix epoch, at which this tombstone was created. This is
* only used for purposes of purging the tombstone after gc_grace_seconds have elapsed.
*/
public final int localDeletionTime;
With all that, how can I force remove all tombstones from sstable?
CASSANDRA-12792 - Due to Cassandra bug filled yesterday, it isn't possible to remove tombstones written with Long.MAX_VALUE with compaction. I had to do ETL and table truncate to get rid of tombstones.
In db/compaction/LazilyCompactedRow.java
we only check for < MaxPurgeableTimeStamp
eg:
(this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp())
this should probably be <=

Resources