cassandra table looks empty in cqlsh, but nodetool cfstats thinks otherwise - cassandra

Using nodetool cfstats I can see that a particular table (table1) is using 59mb and has 545597 keys. Another related table (table2) is using 568mb and has 2,506,141 keys.
Using cqlsh, when I do select count( * ) from table1 it pauses for about 7 seconds then returns a count of 0. However, if I do select count( * ) from table2 it pauses for much longer and then returns a count of 2,481,669.
I also tried select * from table1 and select * from table2. The first takes 7 seconds then returns nothing. The second instantly starts paging through results.
I'm well aware these are expensive operations, however this is on a single dev server which has only this one Cassandra instance. It's a cluster of 1 and not meant for production. I just want to figure out why the values in table1 are invisible.
Is it possible that table1 actually has no values in it? That shouldn't be possible given that I just ran a job to add a bunch of values to it. I also ran "nodetool compact", so that should have eliminated all the tombstones and the cfstats should show what's actually there, right? Here are the cfstats for table1 after I ran nodetool compact:
SSTable count: 1
Space used (live): 59424392
Space used (total): 59424392
Space used by snapshots (total): 73951087
Off heap memory used (total): 806762
SSTable Compression Ratio: 0.28514022725059224
Number of keys (estimate): 545597
Memtable cell count: 393204
Memtable data size: 17877650
Memtable off heap memory used: 0
Memtable switch count: 3
Local read count: 5
Local read latency: 0.252 ms
Local write count: 545804
Local write latency: 0.013 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 611792
Bloom filter off heap memory used: 611784
Index summary off heap memory used: 180202
Compression metadata off heap memory used: 14776
Compacted partition minimum bytes: 216
Compacted partition maximum bytes: 310
Compacted partition mean bytes: 264
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 6.0
Maximum tombstones per slice (last five minutes): 7
If it helps, I'm using apache cassandra 2.2.0 on a linux server.

Cassandra saves all the data in files (sstables). For speeds, writes append data at the end of the files (the index certainly works differently, but they do not describe how those function...)
The deletion of data (or expiration in your case) does not remove the data from the files because it would otherwise mean a lot of large moves and tons of I/O. So instead of they just mark the entry as "dead" (hence they are called tombstones).
Once in a while, the compaction system comes in (Assuming you did not turn it off against that table) and compacts tables. That means it reads from the start of the file and moves live entries over dead ones. More or less, something like this assuming B gets deleted at some point (columns left to right represent different points in time):
Creation Deletion Compaction
A A A
B B-tombstone C
C C
If your table has too many tombstones, the compaction may fail (I do not understand why it could fail, but that's what I read). A table that fails compaction is marked as "do not ever compact", which is a big problem, if you ask me. And a table with half a million keys could very well be failing.
While the table is in the "Deletion" state (includes tombstones), a SELECT that goes over a tombstone still creates a TombStone memory object (do not ask me why, I have no idea, it looks like Cassandra would not work right otherwise...) Hence, the 7 seconds to read all the tombstones and create Java objects for each one of them.
The CQL interface includes a TRACE feature that can be used to see the number of tombstones you have in a table. It prints out a bunch of things that you'd like to know about.
TRACE ON;
SELECT COUNT( * ) FROM table1;

Related

Dealing with uncompactable/overlapping sstables in Cassandra

We have a new cluster running Cassandra 2.2.14, and have left compactions to "sort themselves out". This is in our UAT environment, so load is low. We run STCS.
We are seeing forever growing tombstones. I understand that compactions will take care of the data eventually once the sstable is eligible for compaction.
This is not occuring often enough for us, so I enabled some settings as a test (I am aware they are aggressive, this is purely for testing):
'tombstone_compaction_interval': '120',
'unchecked_tombstone_compaction': 'true',
'tombstone_threshold': '0.2',
'min_threshold': '2'
this did result in some compactions occurring, however the amount of dropped tombstones are low, nor did it go below the threshold (0.2).
After these settings were applied, this is what I can see from sstablemetadata:
Estimated droppable tombstones: 0.3514636277302944
Estimated droppable tombstones: 0.0
Estimated droppable tombstones: 6.007563159628437E-5
Note that this is only one CF, and there are much worse CF's out there (90% tombstones, etc). Using this as an example, but all CF's are suffering the same symptoms.
tablestats:
SSTable count: 3
Space used (live): 3170892738
Space used (total): 3170892738
Space used by snapshots (total): 3170892750
Off heap memory used (total): 1298648
SSTable Compression Ratio: 0.8020960426857765
Number of keys (estimate): 506775
Memtable cell count: 4
Memtable data size: 104
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 2161
Local read latency: 14.531 ms
Local write count: 212
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 645872
Bloom filter off heap memory used: 645848
Index summary off heap memory used: 192512
Compression metadata off heap memory used: 460288
Compacted partition minimum bytes: 61
Compacted partition maximum bytes: 5839588
Compacted partition mean bytes: 8075
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 124.0
Maximum tombstones per slice (last five minutes): 124
The obvious answer here is that the tombstones were not eligible for removal.
gc_grace_seconds is set to 10 days, and has not been moved.
I dumped one of the sstables to json, and I can see tombstones dating back to April 2019:
{"key": "353633393435353430313436373737353036315f657370a6215211e68263740a8cc4fdec",
"cells": [["d62cf4f420fb11e6a92baabbb43c0a93",1566793260,1566793260977489,"d"],
["d727faf220fb11e6a67702e5d23e41ec",1566793260,1566793260977489,"d"],
["d7f082ba20fb11e6ac99efca1d29dc3f",1566793260,1566793260977489,"d"],
["d928644a20fb11e696696e95ac5b1fdd",1566793260,1566793260977489,"d"],
["d9ff10bc20fb11e69d2e7d79077d0b5f",1566793260,1566793260977489,"d"],
["da935d4420fb11e6a960171790617986",1566793260,1566793260977489,"d"],
["db6617c020fb11e6925271580ce42b57",1566793260,1566793260977489,"d"],
["dc6c40ae20fb11e6b1163ce2bad9d115",1566793260,1566793260977489,"d"],
["dd32495c20fb11e68f7979c545ad06e0",1566793260,1566793260977489,"d"],
["ddd7d9d020fb11e6837dd479bf59486e",1566793260,1566793260977489,"d"]]},
So I do not believe gc_grace_seconds is the issue here.
I have run a manual user defined compaction over every Data.db file within the column family folder (singular Data.db file only, one at a time). Compactions ran, but there was very little change to tombstone values. The old data still remains.
I can confirm repairs have occurred, yesterday actually. I can also confirm repairs have been running regularly, with no issues showing in the logs.
So repairs are fine. Compactions are fine.
All I can think of is overlapping SSTables.
The final test is to run a full compaction on the column family. I performed a user defined( not nodetool compact) on the 3 SSTables using JMXterm.
This resulted in a singular SSTable file, with the following:
Estimated droppable tombstones: 9.89886650537452E-6
If i look for the example EPOCH as above (1566793260), it is not visible. Nor is the key. So it was compacted out or Cassandra did something.
The total number of lines containing a tombstone ("d") flag is 1317, of the 120million line dump. And the EPOCH values are all within 10 days. Good.
So I assume the -6 value is a very small percentage and sstablemetadata is having problems showing it.
So, success right?
But it took a full compaction to remove the old tombstones. As far as I am aware, a full compaction is only a last ditch effort maneuver.
My questions are -
How can I determine if overlapping sstables is my issue here? I cant see any other reason why the data would not compact out unless it is overlapping related.
How can I resolve overlapping sstables, without performing a full compaction? I am afraid this is simply going to reoccur in a few weeks time. I don't want to get stuck having to perform full compactions regularly to keep tombstones at bay.
What are the reasons for the creation of overlapping sstables? Is this a data design problem, or some other issue?
Cheers.
To answer your questions:
How can I determine if overlapping sstables is my issue here? I cant see any other reason why the data would not compact out unless it is overlapping related.
If the tombstones weren't generated by using TTL, more of the time the tombstones and the shadowed data could locate into different sstables. When using STCS and there is low volume of write into the cluster, few compaction will be triggered which causes the tombstones stay for extended time. If you have the partition key of a tombstone, run nodetool getsstables -- <keyspace> <table> <key> on a node will return all sstables that contain the key in the local node. You can dump the sstable content to confirm.
How can I resolve overlapping sstables, without performing a full compaction? I am afraid this is simply going to reoccur in a few weeks time. I don't want to get stuck having to perform full compactions regularly to keep tombstones at bay.
There is a new option in "nodetool compaction -s" which can do a major compaction and slit the output to 4 sstables with different sizes. This solves the previous problem of the major compaction which creates a single large sstable. If the droppable tombstones ratio is as high as 80-90%, the resulted sstable size will be even smaller as the majority tombstones had been purged.
In the newer version Cassandra (3.10+), there is a new tool, nodetool garbagecollect, to clean up the tombstones. However, there is limitations in this tool. Not all kinds of tombstones could be removed by it.
All being said, for your situation that there are overlapping sstables and low volume of activities/less frequency of compactions, either you have to find out all related sstables and use user defined compaction, or do major compaction with "-s". https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsCompact.html
What are the reasons for the creation of overlapping sstables? Is this a data design problem, or some other issue?
Fast growing of tombstones usually indicates a data modeling problem: whether the application is inserting null, or periodically deleting data, or using collection and doing update instead of appending. If your data is time series, check if it makes sense to use TTL and TWCS.

Cassandra tombstones with TTL

I have worked with cassandra for quite some time (DSE) and am trying to understand something that isn't quite clear. We're running DSE 5.1.9 for this illustration. It's a single node cluster (If you have a multi-node cluster, ensure RF=nodeCount to make things easier).
It's very simple example:
Create the following simple table:
CREATE TABLE mytable (
status text,
process_on_date_time int,
PRIMARY KEY (status, process_on_date_time)
) WITH CLUSTERING ORDER BY (process_on_date_time ASC)
AND gc_grace_seconds = 60
I have a piece of code that inserts 5k records at a time up to 200k total records with TTL of 300 seconds. The status is ALWAYS "pending" and the process_on_date_time is a counter that increments by 1, starting at 1 (all unique records - 1 - 200k basically).
I run the code and then once it completes, I flush the memtable to disk. There's only a single sstable created. After this, no compaction, no repair, nothing else runs that would create or change the sstable configuration.
After the sstable dump, I go into cqlsh, turn on tracing, set consistency to LOCAL_ONE and paging off. I then run this repetitively:
SELECT * from mytable where status = 'pending' and process_on_date_time <= 300000;
What is interesting is I see things like this (cutting out some text for readability):
Run X) Read 31433 live rows and 85384 tombstone cells (31k rows returned to my screen)
Run X+1) Read 0 live rows and 76376 tombstone cells (0 rows returned to my screen - all rows expired at this point)
Run X+2) Read 0 live rows and 60429 tombstone cells
Run X+3) Read 0 live rows and 55894 tombstone cells
...
Run X+X) Read 0 live rows and 0 tombstone cells
What is going on? The sstable isn't changing (obviously as it's immutable), nothing else inserted, flushed, etc. Why is the tombstone count decreasing until it's at 0? What causes this behavior?
I would expect to see every run: 100k tombstones read and the query aborting as all TTL have expired in the single sstable.
For anyone else who may be curious to this answer, I opened a ticket with Datastax, and here is what they mentioned:
After the tombstones pass the gc_grace_seconds they will be ignored in
result sets because they are filtered out after they have past that
point. So you are correct in the assumption that the only way for the
tombstone warning to post would be for the data to be past their ttl
but still within gc_grace.
And since they are ignored/filtered out they wont have any harmful
effect on the system since like you said they are skipped.
So what this means is that if TTLs expire, but are within the GC Grace Seconds, they will be counted as tombstones when queried against. If TTLs expire AND GC Grace Seconds also expires, they will NOT be counted as tombstones (skipped). The system still has to "weed" through the expired TTL records, but other than processing time, are not "harmful" for the query. I found this very interesting as I don't see this documented anywhere.
Thought others may be interested in this information and could add to it if their experiences differ.

Check table size in cassandra historically

I've a Cassandra table (Cassandra version is 2.0) with terabytes of data, here is what the schema looks like
"my_table" (
key ascii,
timestamp bigint,
value blob,
PRIMARY KEY ((key), timestamp)
)
I'd like to delete some data, but before want to estimate how much disk space it will reclaim.
Unfortunately stats from JMX metrics are only available for last two weeks, so thats not very useful.
Is there any way to check how much space is used by certain set of data (for example where timestamp < 1000)?
I was wondering also if there is a way to check query result set size, so that I can do something like select * from my_table where timestamp < 1000 and see how many bytes the result occupies.
There is no mechanism to see the size on disk from the data, it can be pretty far removed from the coordinator of the request and theres levels that impact it like compression and multiple sstables which would make it difficult.
Also be aware that issuing a delete will not immediately reduce disk space. C* does not delete data, the sstables are immutable and cannot be changed. Instead it writes a tombstone entry that after gc_grace_seconds will disappear. When sstables are being merged, the tombstone + data would combine to be just the tombstone. After it is past the gc_grace_seconds the tombstone will no longer be copied during compaction.
The gc_grace is to prevent losing deletes in a distributed system, since until theres a repair (should be scheduled ~weekly) theres no absolute guarantee that the delete has been seen by all replicas. If a replica has not seen the delete and you remove the tombstone, the data can come back.
No, not really.
Using sstablemetadata you can find tombstone drop times, minimum timestamp and maximum timestamp in the mc-####-big-data.db files.
Additionally if you're low on HDD space consider nodetool cleanup, nodetool clearsnapshot and then finally nodetool repair.

What delays a tombstone purge when using LCS in Cassandra

In a C* 1.2.x cluster we have 7 keyspaces and each keyspace contains a column family that uses wide rows. The cf uses LCS. I am periodically doing deletes in the rows. Initially each row may contain at most 1 entry per day. Entries older than 3 months are deleted and at max 1 entry per week is kept. I have been running this for a few months but disk space isn't really reclaimed. I need to investigate why. For me it looks like the tombstones are not purged. Each keyspace has around 1300 sstable files (*-Data.db) and each file is around 130 Mb in size (sstable_size_in_mb is 128). GC grace seconds is 864000 in each CF. tombstone_threshold is not specified, so it should default to 0.2. What should I look at to find out why diskspace isn't reclaimed?
I've answered a similar question before on the cassandra mailing list here
To elaborate a bit further, it's crucial you understand the Levelled Compaction Strategy and leveldb in general (given normal write behavior)
To summarize the above:
The data store is organized as "levels". Each level is 10 times larger than the level under it. Files in level 0 have overlapping ranges. Files in higher levels do not have overlapping ranges within each level.
New writes are stored as new sstables entering level 0. Every once in a while all sstables in level0 are "compacted" upwards to level 1 sstables, and these are then compacted upwards to level 2 sstables etc..
Reads for a given key will perform ~N reads, N being the number of levels in your tree (which is a function of the total data set size). Level 0 sstables are all scanned (since there are no constraints that each has non-overlapping range with siblings). Level 1 and higher sstables however have no-overlapping ranges, so the DB knows which 1 exact sstable in level1 covers the range of the key you're asking for, same for level 2 etc...
The layout of your LCS tree in cassandra is stored in a json file that you can easily check - you can find it in the same directory as the sstables for the keyspace+ColumnFamily. Here's an example of one of my nodes (coupled with the jq tool + awk to summarize):
$ cat users.json | jq ".generations[].members|length" | awk '{print "Level", NR-1, ":", $0, "sstables"}'
Level 0 : 1 sstables
Level 1 : 10 sstables
Level 2 : 109 sstables
Level 3 : 1065 sstables
Level 4 : 2717 sstables
Level 5 : 0 sstables
Level 6 : 0 sstables
Level 7 : 0 sstables
As you've noted, the sstables are usually of equal size, so you can see that each level is roughly 10x the size of the previous one. I would expect in the node above to satisfy the majority of read operations in ~5 sstable reads. Once I add enough data for Level 4 to reach 10000 sstables and Level 5 starts getting populated, my read latency will increase slightly as each read will incur 1 more sstable read to satisfy. (on a tangent, cassandra provides bucketed histograms for you to check all these stats).
With the above out of the way, let's walk through some operations:
We issue a write ["bob"]["age"] = 30. This will enter level0. Usually soon after it'll be compacted to level1. Slowly then it'll spend time in each level but as more writes enter the system, it'll migrate upwards to the highest level N
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted
We issue a delete ["bob"]["age"]. This will enter level0 as a normal write with a special value "column tombstone". Usually soon after it'll be compacted to level1. Slowly then it'll spend time in each level but as more writes enter the system, it'll migrate upwards to the highest level N. During each compaction, if the sstables being compacted together have a tombstone (such as in l1) and an actual value (such as "30" in an l2), the tombstone "swallows up" the value and affects the logical deletion at that level. The tomstone however cannot be discarded yet, and must persist till it has had the chance to compact against every level until the highest one is reached - this is the only way to ensure that if L2 has age=30, L3 has an older age=29, and L4 has an even older age=28, all of them will have the chance to be destroyed by the tombstone. Only when the tombstone reaches the highest level can it actually be discarded entirely
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted
We issue a delete ["bob"]. This will enter level0 as a normal write with a special value "row tombstone". It will follow the same logic as the above column-level tombstone, except if it collides with any existing data of any column under row "bob" it discards it.
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted
I hope this answers your questions regarding why deletes in cassandra, especially with LCS, actually consume space instead of free up space (at least initially). The rows+columns the tombstones are attached to themselves have a size (which might actually be larger than the size of the value you're trying to delete if you have simple values).
The key point here is that they must traverse all the levels up to the highest level L before cassandra will actually discard them, and the primary driver of that bubbling up is the total write volume.
I was hoping for magic sauce here.
We are going to do a JMX-triggered LCS -> STCS -> LCS in a rolling fashion through the cluster. The switching of compaction strategy forces LCS structured sstables to restructure and apply the tombstones (in our version of cassandra we can't force an LCS compact).
There are nodetool commands to force compactions between tables, but that might screw up LCS. There are also nodetool commands to reassign the level of sstables, but again, that might foobar LCS if you muck with its structure.
What really should probably happen is that row tombstones should be placed in a separate sstable type that can be independently processed against "data" sstables to get the purge to occur. The tombstone sstable <-> data sstable processing doesn't remove the tombstone sstable, just removes tombstones from the tombstone sstable that are no longer needed after the data sstable was processed/pared/pruned. Perhaps these can be classified as "PURGE" tombstones for large scale data removals as opposed to more ad-hoc "DELETE" tombstones that would be intermingled with data. But who knows when that would be added to Cassandra.
Thanks for the great explanation of LCS, #minaguib. I think the statement from Datastax is misleading, at least to me
at most 10% of space will be wasted by obsolete rows.
Depends on how we define the “obsolete rows”. If “obsolete rows” is defined as ALL the rows which are supposed to be compacted, in your example, these “obsolete rows" will be age=30, age=29, age=28. We can end up wasting (N-1)/N space as these “age" can be in different levels.

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Resources