JanusGraph query failure due to Cassandra backend tombstone exception

JanusGraph query failure due to Cassandra backend tombstone exception - cassandra

I have raised a GitHub issue regarding this as well. Pasting the same below.
JanusGraph version - janusgraph-0.3.1
Cassandra - cassandra:3.11.4
When we run JanusGraph with the Cassandra backend, after a period of time, the JanusGraph starts throwing the below errors and goes in to an unusable state.
JanusGraph Logs:
466489 [gremlin-server-exec-6] INFO
org.janusgraph.diskstorage.util.BackendOperation - Temporary exception
during backend operation [EdgeStoreKeys].
Attempting backoff retry.
org.janusgraph.diskstorage.TemporaryBackendException:
Temporary failure in storage backend at io.vavr.API$Match$Case0.apply(API.java:3174)
at
io.vavr.API$Match.of(API.java:3137) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.lambda$static$0 (CQLKeyColumnValueStore.java:123)
at io.vavr.control.Try.getOrElseThrow(Try.java:671) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.getKeys (CQLKeyColumnValueStore.java:405)
Caused by: com.datastax.driver.core.exceptions.ReadFailureException:
Cassandra failure during read query at consistency QUORUM (1 responses
were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:130)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:30)
Cassandra Logs:
WARN [ReadStage-2] 2019-07-19 11:40:02,980 ReadCommand.java:569 - Read
74 live rows and 100001 tombstone cells for query SELECT * FROM
janusgraph.edgestore WHERE column1 >= 02 AND column1 <= 03 LIMIT 100
(see tombstone_warn_threshold)
ERROR [ReadStage-2] 2019-07-19
11:40:02,980 StorageProxy.java:1896 - Scanned over 100001 tombstones
during query 'SELECT * FROM janusgraph.edgestore WHERE column1 >= 02
AND column1 <= 03 LIMIT 100' (last scanned row partion key was
((00000000002b9d88), 02)); query aborted
Related Question:
Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Questions:
1) Is Edge updates are stored as a new item causing tombstones ?. (since janus is a fork of titan).
How to increment Number of Visit count in Titan graph database Edge Label?
https://github.com/JanusGraph/janusgraph/issues/934
2) What is the right approach towards this. ?
Any solution/indications would be really helpful.
[Update]
1) Update to the edges didn't cause tombstones in the JanusGraph.
2) Solutions:
- As per the answer, reduce the gc_grace_seconds to a lower value based on the deletions of edge/vertex.
- Also can consider tuning the "tombstone_failure_threshold" in cassandra.yaml based on the needs.

For Cassandra, a tombstone is a flag that indicates that a record should be deleted, this can be occur after a delete operation was explicitly requested, or once that the Time To Live (TTL) period expired. A record with a tombstone will persist for the time defined in the gc_grace_seconds after the delete operation was executed, by default it is 10 days.
Usually running nodetool repair janusgraph edgestore (based on the error log provided) should be able to fix the issue. If you are still getting the error, you may need to decrease the gc_grace_seconds value of your table, as explained here.
For more information regarding tombstones:
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Tombstone vs nodetool and repair

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!

You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.

There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

Tombstone vs nodetool and repair

I inserted 10K entries in a table in Cassandra which has the TTL of 1 minute under the single partition.
After the successful insert, I tried to read all the data from a single partition but it throws an error like below,
WARN [ReadStage-2] 2018-04-04 11:39:44,833 ReadCommand.java:533 - Read 0 live rows and 100001 tombstone cells for query SELECT * FROM qcs.job LIMIT 100 (see tombstone_warn_threshold)
DEBUG [Native-Transport-Requests-1] 2018-04-04 11:39:44,834 ReadCallback.java:132 - Failed; received 0 of 1 responses
ERROR [ReadStage-2] 2018-04-04 11:39:44,836 StorageProxy.java:1906 - Scanned over 100001 tombstones during query 'SELECT * FROM qcs.job LIMIT 100' (last scanned row partion key was ((job), 2018-04-04 11:19+0530, 1, jobType1522820944168, jobId1522820944168)); query aborted
I understand tombstone is an marking in the sstable not the actual delete.
So I performed the compaction and repair using nodetool
Even after that when I read the data from the table, It throws the same error in log file.
1) How to handle this scenario?
2) Could some explain why this scenario happened and Why not the compaction and repair didn't solve this issue?

Tombstones are really deleted after period specified by gc_grace_seconds setting of the table (it's 10 days by default). This is done to make sure that any node that was down at time of deletion will pickup these changes after recover. Here are the blog posts that discuss this in great details: from thelastpickle (recommended), 1, 2, and DSE documentation or Cassandra documentation.
You can set the gc_grace_seconds option on the individual table to lower value to remove deleted data faster, but this should be done only for tables with TTLed data. You may also need to tweak tombstone_threshold & tombstone_compaction_interval table options to perform compactions faster. See this document or this document for description of these options.

New cassandra support .
$ ./nodetool garbagecollect
After this command "Transfer memory to disk, before restart"
$ ./nodetool drain # "This closes connection after that, clients can not access. "
Shutdown cassandra and restart again. "You should restart after drain. "
** You do not need to drain, ! but, depends on situation.! These are extra informations.

Readtimeout when using User Defined Function in Cassandra

We have a single node Cassandra Cluster (Apache) with 2 vCPUs and around 16 GB RAM on AWS. We have around 28 GB of data uploaded into Cassandra.
Now Cassandra is working fine for select and group by queries using primary keys, however when using User Defined Functions to use aggregate functions on non-primary key - it is giving a timeout.
To elaborate - we have partition on Year, Month and Date for a 3 year data. Now for example if two columns are - Bill_ID and Bill_Amount we want to have a sum of Bill_Amount by Bill_ID using UDF.
Kind of confused here as I believe that if the info says it has received 1 response, why should it give a message of timeout if it has received it? Why are we getting a timeout and that too only when using User Defined functions?
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 1 responses." info={'received_responses': 1, 'required_responses': 1, 'consistency': 'ONE'}
We have increase read timeouts in the yaml file to as high as 10 minutes.
Edit - Adding the screenshot of the query. The results displayed before setting --request-timeout and post that using UDF. The table has 150 million rows with over 1095 partitions only for 3 years of data - with primary keys been year, day and month.

Try to increase the timeouts also on client side, cqlsh for example:
cqlsh --request-timeout=3600

How to force delete tombstone created with USING TIMESTAMP in delete CQL statement?

We have a table which we deleted bunch of rows from using java max long (9223372036854775807) as timestamp. For example,
DELETE r_id FROM orderbook USING TIMESTAMP 9223372036854775807 WHERE o_id='' AND p_id='' AND e_id='' AND a_id='a1' AND ord_id = 645e7d3c-aef7-4e3c-b834-24b792cf2e55;
These tombstones are created in sstable with markedForDeleteAt = 9223372036854775807.
Sample output from sstable2json
[
{"key": ":::a1",
"cells": [["645e7d3c-aef7-4e3c-b834-24b792cf2e51:_","645e7d3c-aef7-4e3c-b834-24b792cf2e51:!",9223372036854775807,"t",1476520163],
["645e7d3c-aef7-4e3c-b834-24b792cf2e52:","",1],
["645e7d3c-aef7-4e3c-b834-24b792cf2e55:","",1],
["645e7d3c-aef7-4e3c-b834-24b792cf2e55:r_id",1476520867,9223372036854775807,"d"]]}
]
Tombstones (range ("t") or otherwise ("d")) created with such high time isn't getting collected with minor or major compaction. We even tried setting gc_grace_seconds to 0 and run major compaction but no luck. I am thinking that 'markedForDeleteAt + gc_grace_seconds > compaction time' equation is playing out and that's why tombstones are not collected. But then I read cassandra code and it seems like localDeletionTime is considered in equation and not markedForDeleteAt.
* The local server timestamp, in seconds since the unix epoch, at which this tombstone was created. This is
* only used for purposes of purging the tombstone after gc_grace_seconds have elapsed.
*/
public final int localDeletionTime;
With all that, how can I force remove all tombstones from sstable?

CASSANDRA-12792 - Due to Cassandra bug filled yesterday, it isn't possible to remove tombstones written with Long.MAX_VALUE with compaction. I had to do ETL and table truncate to get rid of tombstones.
In db/compaction/LazilyCompactedRow.java
we only check for < MaxPurgeableTimeStamp
eg:
(this.maxRowTombstone.markedForDeleteAt < getMaxPurgeableTimestamp())
this should probably be <=

Timeout cassandra hector

i've started working with cassandra. Therefore I’ve download cassandra (1.1.1) to my windows pc and started it. Everything works fine.
Thus I began to reimplement a old application (in java using hector 1.1) which imports about 200.000.000 for 4 tables, which should insertet into 4 columnfamilies. After importing about 2.000.000 records I get an timeout exception and cassandra doesn't response on requests:
2012-07-03 15:35:43,299 WARN - Could not fullfill request on this host CassandraClient<localhost:9160-16>
2012-07-03 15:35:43,300 WARN - Exception: me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
....
Caused by: TimedOutException()
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20269)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
The last entries inside the logfile are:
INFO 15:35:31,678 Writing Memtable-cf2#678837311(7447722/53551072 serialized/live bytes, 262236 ops)
INFO 15:35:32,810 Completed flushing \var\lib\cassandra\data\keySpaceName\cf2\keySpaceName-cf2-hd-205-Data.db (3292685 bytes) for commitlog position ReplayPosition(segmentId=109596147695328, position=131717208)
INFO 15:35:33,282 Compacted to [\var\lib\cassandra\data\keySpaceName\cf3\keySpaceName-cf3-hd-29-Data.db,]. 33.992.615 to 30.224.481 (~88% of original) bytes for 282.032 keys at 1,378099MB/s. Time: 20.916ms.
INFO 15:35:33,286 Compacting [SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-8-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-6-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-7-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-5-Data.db')]
INFO 15:35:34,871 Compacted to [\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-9-Data.db,]. 4.249.270 to 2.471.543 (~58% of original) bytes for 30.270 keys at 1,489916MB/s. Time: 1.582ms.
INFO 15:35:41,858 Compacted to [\var\lib\cassandra\data\keySpaceName\cf2\keySpaceName-cf2-hd-204-Data.db,]. 48.868.818 to 24.033.164 (~49% of original) bytes for 135.367 keys at 2,019011MB/s. Time: 11.352ms.
I created 4 column families like following:
ColumnFamilyDefinition cf1 = HFactory.createColumnFamilyDefinition(
“keyspacename”,
“cf1”,
ComparatorType.ASCIITYPE);
The column families have following column count:
16 columns
14 columns
7 colmuns
5 columns
The keyspace is created with replication factor 1 and default strategy (simple)
I insert the records (rows) with 'Mutator#AddInsertion'
Any advice avoiding this exception?
Regards
WM

That exception is basically Cassandra saying that it's far enough behind on mutations that it won't complete your requests before they time out. Assuming your PC isn't a beast, you should probably throttle your requests. I suggest sleeping for a while after catching that exception and then retrying; there's no harm in accidentally writing the same row twice, and Cassandra should catch up on write pretty quickly.
If you were in a production environment, I would look more closely at other reasons why the node might be performing poorly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string