Cassandra deleted rows coming back (reappear) when nodes have NTP sync issue - cassandra

I have 3 nodes cassandra setup and seems like some nodes had time sync issues, that is some nodes 10 minutes ahead of others.
CT-Cass2:/root>nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.94.1.22 14.15 GB 256 ? db37ca57-c7c9-4c36-bac3-f0cbd8516143 RAC1
UN 172.94.1.23 14.64 GB 256 ? b6927b2b-37b2-4a7d-af44-21c9f548c533 RAC1
UN 172.94.1.21 14.42 GB 256 ? e482b781-7e9f-43e2-82f8-92901be48eed RAC1
I have below table created.
CREATE TABLE test_users (
userid text PRIMARY KEY,
omavvmon int,
vvmon int
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 48000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I can see that in customer setup some of the deleted records coming back and shows writetime(omavvmon) shows writetime of 10 minutes later than the row delete time. I am almost certain that records are coming back due to time sync issue (Because after correcting time its not happening). But when i tried to reproduce this issue locally it never happens.
I set cassandra system time 10 minutes ahead and create row. and writetime shows 10 minutes ahead
update test_users set omavvmon=1 where userid='4444';
I set the system time back to normal, that is 10 mins slower. Then i perform delete of userid 4444.
As i understand, this delete is 10 minutes lesser writetime compared to first creation and i should see records coming back again. But its not happening. Can any help me explain why deleted records coming back in production setup and not in my local setup? Also why cassandra is not showing the record locally even though delete has lesser timestamp compared to insert? Isnt it similar to delete then insert?
In production i check after few hours but local setup i am checking immediately after delete.

Related

Cassandra data compacted until deletion with no TTL set up?

I was testing node repair on my Cassandra cluster (v3.11.5) while simultaneously stress-testing it with cassandra-stress (v3.11.4). The disk space run out and the repair failed. As a result gossip got disabled on the nodes. Sstables that were being anticompacted got cleaned up (effectively = deleted), which dropped the disk usage by ~half (to ~1.5TB per node) within a minute. And this I understand.
What I do not undestand is what happened next. The sstables started getting continuously compacted into smaller ones and eventually deleted. As a result the disk usage continued to drop (this time slowly), after a day or so it went from ~1.5TB per node to ~50GB per node. The data that was residing in the cluster was randomly generated by the cassandra-stress, so I see no way to confirm whether it's intact, however I find highly unlikely that it is, as the disk usage dropped that much. Also I have no TTL set up (at least that I would know of, might be missing something), so I would not expect the data being deleted. But I believe this is the case.
Anyway, can anyone point me to what is happening?
Table schema:
> desc test-table1;
CREATE TABLE test-keyspace1.test-table1 (
event_uuid uuid,
create_date timestamp,
action text,
business_profile_id int,
client_uuid uuid,
label text,
params text,
unique_id int,
PRIMARY KEY (event_uuid, create_date)
) WITH CLUSTERING ORDER BY (create_date DESC)
AND bloom_filter_fp_chance = 0.1
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.DeflateCompressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Logs:
DEBUG [CompactionExecutor:7] 2019-11-23 20:17:19,828 CompactionTask.java:255 - Compacted (59ddec80-0e20-11ea-9612-67e94033cb24) 4 sstables to [/data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3259-big,] to level=0. 93.264GiB to 25.190GiB (~27% of original) in 5,970,059ms. Read Throughput = 15.997MiB/s, Write Throughput = 4.321MiB/s, Row Throughput = ~909/s. 1,256,595 total partitions merged to 339,390. Partition merge counts were {2:27340, 3:46285, 4:265765, }
(...)
DEBUG [CompactionExecutor:7] 2019-11-24 03:50:14,820 CompactionTask.java:255 - Compacted (e1bd7f50-0e4b-11ea-9612-67e94033cb24) 32 sstables to [/data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3301-big,] to level=0. 114.787GiB to 25.150GiB (~21% of original) in 14,448,734ms. Read Throughput = 8.135MiB/s, Write Throughput = 1.782MiB/s, Row Throughput = ~375/s. 1,546,722 total partitions merged to 338,859. Partition merge counts were {1:12732, 2:42441, 3:78598, 4:50454, 5:36032, 6:52989, 7:21216, 8:34681, 9:9716, }
DEBUG [CompactionExecutor:15] 2019-11-24 03:50:14,852 LeveledManifest.java:423 - L0 is too far behind, performing size-tiering there first
DEBUG [CompactionExecutor:15] 2019-11-24 03:50:14,852 CompactionTask.java:155 - Compacting (85e06040-0e6d-11ea-9612-67e94033cb24) [/data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3259-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3299-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3298-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3300-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3301-big-Data.db:level=0,]
(...)
DEBUG [NonPeriodicTasks:1] 2019-11-24 06:02:50,117 SSTable.java:105 - Deleting sstable: /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3259-big
edit:
I performed some additional testing. To my best knowledge there is no TTL set up, see query result straight after cassandra-stress started inserting data:
> SELECT event_uuid, create_date, ttl(action), ttl(business_profile_id), ttl(client_uuid), ttl(label), ttl(params), ttl(unique_id) FROM test-table1 LIMIT 1;
event_uuid | create_date | ttl(action) | ttl(business_profile_id) | ttl(client_uuid) | ttl(label) | ttl(params) | ttl(unique_id)
--------------------------------------+---------------------------------+-------------+--------------------------+------------------+------------+-------------+----------------
00000000-001b-adf7-0000-0000001badf7 | 2018-01-10 10:08:45.476000+0000 | null | null | null | null | null | null
So neither TTL nor tombstones deletion should be related to the issue. It's likely that there are no duplicates, as the data is highly randomized. No Replication Factor changes were made, as well.
What I found out is that the data volume decrease starts every time after cassandra-stress gets stopped. Sadly, still don't know the exact reason.
I guess, when you think of it from a Cassandra perspective there really are only a few options on why your data shrinks:
1) TTL expired past GC Grace
2) Deletes past GC grace
3) The same records exists in multiple sstables (i.e. "updates")
4) Change in RF to a lower number (essentially a "cleanup" - token reassignment)
In any of the above cases when compaction runs it will either remove or reconcile records which could shrink up the space consumption. Without having the sstables around any more, it's hard to determine which, if not a combination of the above, occurred.
-Jim

Cassandra table did not get compacted in 2 years?

I have the following table definition:
CREATE TABLE snap_websites.backend (
key blob,
column1 blob,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.100000001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = 'backend'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy', 'max_threshold': '10', 'min_threshold': '4', 'tombstone_threshold': '0.02'}
AND compression = {'enabled': 'false'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 3600
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Looking at the compaction setup, it seems that it should get compacted once in a while... However, after about 2 years the table was really slow on a SELECT and I could see 12,281 files in the corresponding data folder! I only checked on node, I would imagine that all the nodes had similar piles of files.
Why does that happen? Could it be because I never give Cassandra a break and therefore it never really is given a time to run the compaction? (i.e. I pretty much always have some process running against that table, but I did not expect things to get this bad! Wow!)
The command line worked well:
nodetools compact snapwebsites backend
and the number of files went all the way down to 9 (after all, I have just 2 lines of data in that table at the moment!)
What I really need to know is: what is preventing Cassandra from running the compaction process?
I don't remember much about DTCS, but if you can, I'd consider using TWCS to replace it. It works well for time series data (TDCS was mentioned to be going away in the near future).

Howto debug why hints doesn't get processed after all nodes are up again

Did some extended maintenance on a node d1r1n3 out of a 14x node dsc 2.1.15 cluster today, but finished well within the cluster's max hint window.
After bringing the node back up most other nodes' hints disappeared again within minutes except for two nodes (d1r1n4 and d1r1n7), where only part of the hints went away.
After few hours of still showing 1 active hintedhandoff task I restarted node d1r1n7 and then quickly d1r1n4 emptied its hint table.
Howto see for which node stored hints on d1r1n7 are destined?
And possible howto get hints processed?
Update:
Found later corresponding to end-of-maxhint-window after taking node d1r1n3 offline for maintenance that d1r1n7' hints had vanished. Leaving us with a confused feeling of whether this was okay or not. Had the hinted been processed okay or some how just expired after end of maxhint window?
If the latter would we need to run a repair on node d1r1n3 after it's mainenance (this takes quite some time and IO... :/) What if we now applied read [LOCAL]QUORUM instead of as currently read ONE w/one DC and RF=3, could this then trigger read path repairs on needed-basis and maybe spare us is this case for a full repair?
Answer: turned out hinted_handoff_throttle_in_kb was # default 1024 on these two nodes while rest of cluster were # 65536 :)
hints are stored in cassandra 2.1.15 in system.hints table
cqlsh> describe table system.hints;
CREATE TABLE system.hints (
target_id uuid,
hint_id timeuuid,
message_version int,
mutation blob,
PRIMARY KEY (target_id, hint_id, message_version)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (hint_id ASC, message_version ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = 'hints awaiting delivery'
AND compaction = {'enabled': 'false', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
the target_id correlated with the node id
for example
in my sample 2 node cluster with RF=2
nodetool status
Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.47 KB 256 100.0% d00c4b10-2997-4411-9fc9-f6d9f6077916 rack1
DN 127.0.0.2 75.4 KB 256 100.0% 1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa rack1
I executed the following while node2 was down
cqlsh> insert into ks.cf (key,val) values (1,1);
cqlsh> select * from system.hints;
target_id | hint_id | message_version | mutation
--------------------------------------+--------------------------------------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa | e80a6230-ec8c-11e6-a1fd-d743d945c76e | 8 | 0x0004000000010000000101cfb4fba0ec8c11e6a1fdd743d945c76e7fffffff80000000000000000000000000000002000300000000000547df7ba68692000000000006000376616c0000000547df7ba686920000000400000001
(1 rows)
as can be seen the system.hints.target_id correlates with host id in nodetool status (1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa)

Select distinct gives incorrect values even if performed on primary key Cassandra

Im running Cassandra Version 2.1.2 and cqlsh 5.0.1
Here is the table weather.log, weather is the keyspace having consistency level One.
I have 2 nodes configured.
CREATE KEYSPACE weather WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east': '1'} AND durable_writes = true;
CREATE TABLE weather.log (
ip inet,
ts timestamp,
city text,
country text,
PRIMARY KEY (ip, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
When we run the query.
select distinct ip from weather.log
We get inconsistent, wrong responses. Once we get 99 just next time we get 1600 etc. [where the actual number should be > 2000]
I have tried this query with consistency level set to ALL also. It dint work.
Why is this happening ? I need to get all the keys. How to get all the primary keys?
It looks like you might be effected by CASSANDRA-8940. I'd suggest to update to the latest 2.1.x release and verify if this issue is fixed for you.

OperationTimedOut during cqlsh alter table

I am receiving a OperationTimedOut error while running an alter table command in cqlsh. How is that possible? Since this is just a table metadata update, shouldn't this operation run almost instantaneously?
Specifically, this is an excerpt from my cqlsh session
cqlsh:metric> alter table metric with gc_grace_seconds = 86400;
OperationTimedOut: errors={}, last_host=sandbox73vm230
The metric table currently has a gc_grace_seconds of 864000. I am seeing this behavior in a 2-node cluster and in a 6-node 2-datacenter cluster. My nodes seem to be communicating fine in general (e.g. I can insert in one and read from the other). Here is the full table definition (a cyanite 0.1.3 schema with DateTieredCompactionStrategy, clustering and caching changes):
CREATE TABLE metric.metric (
tenant text,
period int,
rollup int,
path text,
time bigint,
data list<double>,
PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH CLUSTERING ORDER BY (time ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'timestamp_resolution': 'SECONDS', 'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = 'NONE';
I realize at this point the question is pretty old, and you may have either figured out the answer or otherwise moved on, but wanted to post this in case others stumbled upon it.
The default cqlsh request timeout is 10 seconds. You can adjust this by starting up cqlsh with the --request-timeout option set to some value that allows your ALTER TABLE to run to completion, e.g.:
cqlsh --request-timeout=1000000

Resources