Cassandra table tombstones is not 0 - cassandra

I have problems with cassandra:
if I do nodetool -h 10.169.20.8 cfstats name.name -H
I get results and stats is like this:
Read Count: 0
Read Latency: NaN ms.
Write Count: 739812
Write Latency: 0.038670616318740435 ms.
Pending Flushes: 0
Table: name
SSTable count: 10
Space used (live): 1.48 GB
Space used (total): 1.48 GB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 3.04 MB
SSTable Compression Ratio: 0.5047407001982581
Number of keys (estimate): 701190
Memtable cell count: 22562
Memtable data size: 14.12 MB
Memtable off heap memory used: 0 bytes
Memtable switch count: 7
Local read count: 0
Local read latency: NaN ms
Local write count: 739812
Local write latency: 0.043 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 2.39 MB
Bloom filter off heap memory used: 2.39 MB
Index summary off heap memory used: 302.03 KB
Compression metadata off heap memory used: 366.3 KB
Compacted partition minimum bytes: 87 bytes
Compacted partition maximum bytes: 3.22 MB
Compacted partition mean bytes: 2.99 KB
Average live cells per slice (last five minutes): 1101.2357892212697
Maximum live cells per slice (last five minutes): 1109
Average tombstones per slice (last five minutes): 271.6848030693603
Maximum tombstones per slice (last five minutes): 1109
Dropped Mutations: 0 bytes
Why tombstones stats is not 0? We here only write into Cassandra, no one deleted records. We dont use TTL, the are set to default settings.
Second problem (probably connected to the issue) - number of rows of tables changes randomly, we dont understand what is going on.

I am not sure there is a way to explain the tombstones - if you are not doing any deletes.
I can provide you two methods to try and analyze this - maybe this will help understand better what is hapenning and how.
There is a tool named sstable2json - it allows taking an sstable and dumping it to json -
for example for the following schema
cqlsh> describe schema;
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
CREATE TABLE test.t1 (
key text PRIMARY KEY,
value text
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
running sstable2json on an sstable file with a tombstone for a complete partition provides the folliowing
[
{"key": "key",
"metadata": {"deletionInfo": {"markedForDeleteAt":1475270192779047,"localDeletionTime":1475270192}},
"cells": []}
]
and in this case the markjer is for the partition using "key"
Another method you can use (given that the tombstone count is increasing) is to use a tcpdump and then analyze it with wireshark. Benoit Canet from ScyllaDB contributed to wireshark a dissector supporting CQL that is now in the latest stable release 2.2.0 (https://www.wireshark.org/docs/relnotes/wireshark-2.2.0.html)
Please note that cql deletes can actually be found in two types QUERY and PREPARED (if deletes are done using prepared statements).
If they are done via prepared statements you may need to drop the CQL connections to make sure you catch the specific packets that have the prepared statements.
Here is a sample from wireshark capturing the delete statement from above

N.B. : sometimes tombstones could be created using nulls bindings in prepared statements - http://thelastpickle.com/blog/2016/09/15/Null-bindings-on-prepared-statements-and-undesired-tombstone-creation.html

writing a value of in a column is the same as a deletion and causes a tombstone. Wait... Say What.

I know that the question and issue back to some years ago but in case someone having same issue with new cassandra versions 3+ and want to remove deleted data he/she can run nodetool garbagecollect
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsGarbageCollect.html

Related

Repair status not 100% after repair

I have noticed that some tables show less than 100% "Percent repaired" in the nodetool tablestatus output. I have manually executed repairs on all nodes (3 node cluster, RF=3) but the value doesnt seem to change.
Example output:
Table: users
SSTable count: 3
Space used (live): 66636
Space used (total): 66636
Space used by snapshots (total): 0
Off heap memory used (total): 688
SSTable Compression Ratio: 0.5731829674519404
Number of partitions (estimate): 162
Memtable cell count: 11
Memtable data size: 483
Memtable off heap memory used: 0
Memtable switch count: 27
Local read count: 120833
Local read latency: NaN ms
Local write count: 12094
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 91.54
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 568
Bloom filter off heap memory used: 544
Index summary off heap memory used: 112
Compression metadata off heap memory used: 32
Compacted partition minimum bytes: 30
Compacted partition maximum bytes: 1916
Compacted partition mean bytes: 420
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Repair was done with nodetool repair -pr
What is going on?
Percent repaired seems to be a misleading metric as it refers to the percentage of SSTables repaired, but there are some conditions to be computed here:
- the tables should not be from systems keyspaces
- the tables should have a replication factor greater than 1
- the repair should be incremental or full (non-subrange)
When you use nodetool repair -pr, that will invoke a full repair that won't be able to update this value.
For more information regarding incremental repairs, I would recommend this article from the Last Pickle. Since they adopted the maintenance of the reaper tool, they have become an authority regarding repairs.
Executing nodetool repair -pr will repair the primary range owned by the node that command is executed on.
What does this mean? The node this command is executed on has data that it "owns", i.e., its primary range, but the node also contains data/replicas "owned" by other nodes. You are not repairing the replicas "owned" owned by other nodes.
Now, if you execute that command on every single node in the cluster (not data center), it will cover all the token ranges.
EDIT / NOTE:
My answer did not properly address the question. Although what I wrote is accurate, the answer to the question is stated in the answer above mine; basically, the percentage repaired is a value that is for incremental repair usage and is not affected by a full repair. (Incremental repair marks the repaired ranges as it works so it does not spend time re-repairing later.)

Does nodetool for cassandra only gather data for a single node or for the entire cluster?

I have a 19-node Cassandra cluster for our internal service. If I log into a node using nodetool and run commands like tablestats, etc, does that gather stats just for that particular node or for the entire cluster?
nodetool utility for cassandra gather for entire cluster, not a single node.
For example, if you run command like-
command:
nodetool tablestats musicdb.artist
result:
Keyspace: musicdb
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: artist
SSTable count: 1
Space used (live): 62073
Space used (total): 62073
Space used by snapshots (total): 0
Off heap memory used (total): 1400
SSTable Compression Ratio: 0.27975344141453456
Number of keys (estimate): 1000
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1264
Bloom filter off heap memory used: 1256
Index summary off heap memory used: 128
Compression metadata off heap memory used: 16
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 149
Compacted partition mean bytes: 149
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
Status of the table artist belongs to keyspace musicdb above is from the entire cluster.
Most nodetool commands operate on a single node in the cluster if -h
is not used to identify one or more other nodes. If the node from
which you issue the command is the intended target, you do not need
the -h option to identify the target; otherwise, for remote
invocation, identify the target node, or nodes, using -h.
Nodetool Utility

Cassandra: read/s write/s

I'm trying to figure out the throughput of my Cassandra cluster, and can't figure out how to use nodetool to accomplish that. Below is a sample output:
Starting NodeTool
Keyspace: realtimetrader
Read Count: 0
Read Latency: NaN ms.
Write Count: 402
Write Latency: 0.09648756218905473 ms.
Pending Flushes: 0
Table: currencies
SSTable count: 1
Space used (live): 5254
Space used (total): 5254
Space used by snapshots (total): 0
Off heap memory used (total): 40
SSTable Compression Ratio: 0.0
Number of keys (estimate): 14
Memtable cell count: 1608
Memtable data size: 567
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 402
Local write latency: 0.106 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0,00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 16
Compression metadata off heap memory used: 8
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 149
Compacted partition mean bytes: 149
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
I run the command:
nodetool cfstats
to get this, and then subtract the latter, "Local read count:" from the earlier one.
But I'm not sure what the "Local" here means?
Does it mean its local to that node and in a ring of 5 nodes, I should multiple the value by 5? Or is it that the simple subtraction will give me the correct result?
Also, which JMX bean should I be looking at to get these #'s?
Have a look at this nodetool cfstats.
I think what you are looking for is 'Read Latency' and 'Write Latency'.
These fields indicate how faster your read/writes are in your cluster.

Cassandra Cluster with 2 Nodes got Read TimeOut/NoHostAvailable Exception

I am implementing a recommendation engine in .Net C#, I am using Cassandra to store the data. I am still new in using C*, just started using it 2 months ago. At the moment I have only 2 nodes in my cluster (single DC), deployed in Azure DS2 VM (each has 7Gb RAM, 2 Cores). I set RF=2, CL=1 for both read and write. I set the timeouts in yaml config file as below
read_request_timeout_in_ms: 60000
write_request_timeout_in_ms: 120000
counter_write_request_timeout_in_ms: 120000
request_timeout_in_ms: 120000
I set lower read query timeout in client side (30 secs each).
The data stored in cassandra is user history, item counter, and recommended items data. I created an API (stands in equinix DC) for my recommendation engine, its work is very simple, only reading all recommended_items Id from recommended_items table in C* everytime a user opens the website page. It means that the query is very simple for each user :
select * from recommended_items where username = <username>
When I did load testing for up to 500 users/threads, it was fine and very fast. But when the online site calls API to read from C* table, I got read timeouts very often. There were usually only less than 20 users at the same time though.
I monitor the cassandra nodes activity using DataDog and I found that only node #2 that keeps getting timeouts (the seed node is node #1, though what I understand is seed doesn't really matter except during bootstrapping step). However, everytime the timeout happens, I tried to query using cqlsh in both nodes, and node #1 is the one that return
OperationTimeOut Exception.
I have been trying to find the main root of this issue. Does that have anything to do with coordinator node being down (I read this article) ? Or is that because I have only 2 nodes?
When the timeout happens (the webpage shows nothing), then I tried to refresh the page that calls the API, it will be loading for long time before showing nothing again (because of the timeout). But surprisingly, I will get the log that all those requests were actually successful after few minutes even though the web page has been closed. It's like the read request was still running even though the page has been closed.
The exception are like these (they didn't happen together) :
None of the hosts tried for query are available (tried: 13.73.193.140:9042,13.75.154.140:9042)
OR
Cassandra timeout during read query at consistency LocalOne (0 replica(s) responded over 1 required)
Does anyone have any suggestion about my problem? thank you.
output of cfstats .recommended_items
NODE #1
Read Count: 683
Read Latency: 2.970781844802343 ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: recommendedvideos
Space used (live): 96034775
Space used (total): 96034775
Space used by snapshots (total): 40345163
Off heap memory used (total): 192269
SSTable Compression Ratio: 0.4405242717559795
Number of keys (estimate): 101493
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 376
Local read latency: 1.647 ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 126928
Bloom filter off heap memory used: 126896
Index summary off heap memory used: 40085
Compression metadata off heap memory used: 25288
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 454826
Compacted partition mean bytes: 2201
Average live cells per slice (last five minutes): 160.28657799274487
Maximum live cells per slice (last five minutes): 2759
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
NODE #2
Read Count: 733
Read Latency: 3.0032783083219647 ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: recommendedvideos
Space used (live): 99145806
Space used (total): 99145806
Space used by snapshots (total): 15101127
Off heap memory used (total): 196008
SSTable Compression Ratio: 0.44063804831658704
Number of keys (estimate): 103863
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 453
Local read latency: 1.344 ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 129056
Bloom filter off heap memory used: 129040
Index summary off heap memory used: 40856
Compression metadata off heap memory used: 26112
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 454826
Compacted partition mean bytes: 2264
Average live cells per slice (last five minutes): 170.7715877437326
Maximum live cells per slice (last five minutes): 2759
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0

Why space usage is 0 although I had already inserted >40k rows

Currently, I have 3 nodes for Cassandra.
I create a table named events
After inserting >40k rows, I perform the following command in each node.
nodetool -h localhost cfstats
This is the output from one of the node
Table: events
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 43516
Off heap memory used (total): 0
SSTable Compression Ratio: 0.0
Number of keys (estimate): 1
Memtable cell count: 102675
Memtable data size: 4224801
Memtable off heap memory used: 0
Memtable switch count: 1
Local read count: 0
Local read latency: NaN ms
Local write count: 4223
Local write latency: 0.085 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
To my surprise, Space used (live) and Space used (total) are 0. The other nodes are also having 0 Space used (live) and Space used (total).
However, when I perform SELECT, I can get multiple rows which are being inserted previously.
May I know, why are my Space used (live) and Space used (total) 0 for all nodes?
Your Memtables have not yet flushed to disk. Flush is generally triggered by a few things:
The memtable reaching the max threshold size
A commit log segment responsible for data in that memtable expiring
User calling nodetool flush
If you insert 40k rows and then do nothing, as long as they fit comfortably in memory, they will stay in memory. You will see no permanent disk usage for those rows since there is no on-disk sstable holding their values.
The persistence for those rows is guaranteed by the commit-log, which stores mutations in the order in which they occurred on the disk and can be replayed in case of node failure. The commit-log is a rolling log so when commit-log segement is about to expire, Cassandra will flush the memtable holding the data in that segement to an on-disk sstable.

Resources