Data present despite SSTable deletion - cassandra

I'm working on a single Cassandra 3.11.2 node(RHEL 6.5). In keyspace(named 'test'), I've a table named 'test'. I entered some rows via cqlsh and then did nodetool flush. I checked in the data directory to confirm that a SSTable got created. Now I deleted all the .db files(from the test.test data directory using rm *.db).
Strangely, I can still see all the rows in cqlsh! I don't understand, how this is happening since I manually deleted the SSTable.
Given below is my keyspace:
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Given below is the table:
CREATE TABLE test.test (
aadhar_number int PRIMARY KEY,
address text,
name text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Given below is the output of nodetool tablestats command(after I had deleted the SSTable):
Keyspace : test
Read Count: 0
Read Latency: NaN ms
Write Count: 13
Write Latency: 0.11269230769230769 ms
Pending Flushes: 0
Table: test
SSTable count: 1
Space used (live): 5220
Space used (total): 5220
Space used by snapshots (total): 0
Off heap memory used (total): 48
SSTable Compression Ratio: 0.7974683544303798
Number of partitions (estimate): 255
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 4
Local read count: 0
Local read latency: NaN ms
Local write count: 10
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 16
Compression metadata off heap memory used: 16
Compacted partition minimum bytes: 18
Compacted partition maximum bytes: 50
Compacted partition mean bytes: 36
Average live cells per slice (last five minutes): 5.0
Maximum live cells per slice (last five minutes): 5
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
I restarted Cassandra and only then the data stopped showing in cqlsh.
A very good article for understanding filesystem details in linux.

On linux, filenames are just pointers (inodes) that point to the memory where the file resides. When Cassandra open the files, it holds a link to it. When you use rm to remove the file, you delete the link from the filesystem to the physical memory, but the file is still referenced by a live process and is therefore not deleted. You can easily check that with the command lsof (list open files). There is a flag to list for a given pid (check the cassandra pid with something like ps aux | grep cassandra)
Obviously, when you restart Cassandra, the file get deleted.

Related

Cassandra query 2nd index with pagination become slower when data grow

When I query secondary index with pagination, query becomes slower when data grows.
I thought with pagination, no matter how large your data grow, it takes same time to query one page. Is that true? Why my query get slower?
My simplified table is
CREATE TABLE closed_executions (
domain_id uuid,
workflow_id text,
start_time timestamp,
workflow_type_name text,
PRIMARY KEY ((domain_id), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
AND COMPACTION = {
'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
}
AND GC_GRACE_SECONDS = 172800;
And I create a secondary index as
CREATE INDEX closed_by_type ON closed_executions (workflow_type_name);
I query with following CQL
SELECT workflow_id, start_time, workflow_type_name
FROM closed_executions
WHERE domain_id = ?
AND start_time >= ?
AND start_time <= ?
AND workflow_type_name = ?
and code
query := v.session.Query(templateGetClosedWorkflowExecutionsByType,
request.DomainUUID,
common.UnixNanoToCQLTimestamp(request.EarliestStartTime),
common.UnixNanoToCQLTimestamp(request.LatestStartTime),
request.WorkflowTypeName).Consistency(gocql.One)
iter := query.PageSize(request.PageSize).PageState(request.NextPageToken).Iter()
// PageSize is 10, but could be thousand
Environement:
MacBook Pro
Cassandra: 3.11.0
GoCql: github.com/gocql/gocql master
Observation:
10K rows, within second
100K rows, ~3 second
1M rows, ~17 second
Debug log:
INFO [ScheduledTasks:1] 2018-09-11 16:29:48,349 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2018-09-11 16:29:48,357 MonitoringTask.java:173 - 1 operations were slow in the last 5005 msecs:
<SELECT * FROM cadence_visibility.closed_executions WHERE workflow_type_name = code.uber.internal/devexp/cadence-bench/load/basic.stressWorkflowExecute AND token(domain_id, domain_partition) >= token(d3138e78-abe7-48a0-adb9-8c466a9bb3fa, 0) AND token(domain_id, domain_partition) <= token(d3138e78-abe7-48a0-adb9-8c466a9bb3fa, 0) AND start_time >= 2018-09-11 16:29-0700 AND start_time <= 1969-12-31 16:00-0800 LIMIT 10>, time 2747 msec - slow timeout 500 msec
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-11 16:31:47,774 AbstractCommitLogSegmentManager.java:107 - No segments in reserve; creating a fresh one
DEBUG [ScheduledTasks:1] 2018-09-11 16:40:22,922 ColumnFamilyStore.java:899 - Enqueuing flush of size_estimates: 23.997MiB (2%) on-heap, 0.000KiB (0%) off-heap
Related ref (no answer for my questions):
https://lists.apache.org/thread.html/%3CCAAiKoBidknHVOz8oQQmncZFZHdFiDfW6HTs63vxXCOhisQYZgg#mail.gmail.com%3E
https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive
https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
-- Edit
tablestats returns
Total number of tables: 105
----------------
Keyspace : cadence_visibility
Read Count: 19
Read Latency: 0.5125263157894736 ms.
Write Count: 3220964
Write Latency: 0.04900822269357869 ms.
Pending Flushes: 0
Table: closed_executions
SSTable count: 1
SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 20.3 MiB
Space used (total): 20.3 MiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 6.35 KiB
SSTable Compression Ratio: 0.40192660515179696
Number of keys (estimate): 3
Memtable cell count: 28667
Memtable data size: 7.35 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 9
Local read count: 9
Local read latency: NaN ms
Local write count: 327024
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 16 bytes
Bloom filter off heap memory used: 8 bytes
Index summary off heap memory used: 38 bytes
Compression metadata off heap memory used: 6.3 KiB
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 62479625
Compacted partition mean bytes: 31239902
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0 bytes
----------------
Why pagination doesn't scale as the main table?
Your data in your secondary index is disperse
pagination will only apply logic
until it hits the page number
since your data is not clustered by time
you still have to sift through lots and lots of rows
before you can find your first 10 for example .
Query Tracing do show pagination plays at the very late phase.
Why secondary index is slow?
First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data. It is known anti-patterns with low cardinality index. (reference https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive)

Cassandra high cpu

I do have a cluster of 150 nodes, Cassandra 2.2.x, 8 cpu and 64 GB ram.
From time to time couple of my nodes have 100 % cpu but the rest of the nodes looks just fine.
When I have 100 % cpu:
My tpstats
No compactions are pending.
INFO [Service Thread] 2018-05-12 18:03:51,636 GCInspector.java:284 - G1 Young Generation GC in 262ms. G1 Eden Space: 1291845632 -> 0; G1 Old Gen: 10237756144 -> 7451103472; G1 Survivor Space: 419430400 -> 134217728;
INFO [Service Thread] 2018-05-12 18:03:54,130 GCInspector.java:284 - G1 Young Generation GC in 286ms. G1 Eden Space: 1577058304 -> 0; G1 Old Gen: 7451103472 -> 4596956920; G1 Survivor Space: 134217728 -> 150994944;
INFO [Service Thread] 2018-05-12 18:03:57,652 GCInspector.java:284 - G1 Young Generation GC in 213ms. G1 Eden Space: 1560281088 -> 0; G1 Old Gen: 4596956920 -> 4135583472; G1 Survivor Space: 150994944 -> 134217728;
INFO [CompactionExecutor:172351] 2018-05-12 18:21:29,098 AutoSavingCache.java:399 - Saved CounterCache (258640 items) in 298 ms
INFO [CompactionExecutor:172352] 2018-05-12 18:21:29,979 AutoSavingCache.java:399 - Saved KeyCache (788252 items) in 1282 ms
It is not the same node some other days I have different nodes who has 100 % cpu, any ideas?
Heap size Image
Table: abc
SSTable count: 14
Space used (live): 3 960 499 258
Space used (total): 3960499258
Space used by snapshots (total): 0
Off heap memory used (total): 8427312
SSTable Compression Ratio: 0.25806487155056185
Number of keys (estimate): 379760
Memtable cell count: 172086
Memtable data size: 1253025
Memtable off heap memory used: 4654441
Memtable switch count: 192
Local read count: 33923975
Local read latency: 15.081 ms
Local write count: 30595833
Local write latency: 0.144 ms
Pending flushes: 0
Bloom filter false positives: 18478
Bloom filter false ratio: 0.00000
Bloom filter space used: 1523216
Bloom filter off heap memory used: 1523104
Index summary off heap memory used: 459327
Compression metadata off heap memory used: 1790440
Compacted partition minimum bytes: 30
Compacted partition maximum bytes: 4055269
Compacted partition mean bytes: 11408
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 178.40981886703793
Maximum tombstones per slice (last five minutes): 258
Table: xyz
SSTable count: 8
Space used (live): 2034316302
Space used (total): 2034316302
Space used by snapshots (total): 0
Off heap memory used (total): 4289964
SSTable Compression Ratio: 0.24435625500089553
Number of keys (estimate): 260117
Memtable cell count: 133822
Memtable data size: 832669
Memtable off heap memory used: 2215708
Memtable switch count: 175
Local read count: 10034316
Local read latency: 11.653 ms
Local write count: 27984185
Local write latency: 0.075 ms
Pending flushes: 0
Bloom filter false positives: 261009
Bloom filter false ratio: 0.04762
Bloom filter space used: 1007024
Bloom filter off heap memory used: 1006960
Index summary off heap memory used: 200632
Compression metadata off heap memory used: 866664
Compacted partition minimum bytes: 61
Compacted partition maximum bytes: 5839588
Compacted partition mean bytes: 8883
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 28.292941014323677
Maximum tombstones per slice (last five minutes): 60
Table: zzz
SSTable count: 8
Space used (live): 1542111866
Space used (total): 1542111866
Space used by snapshots (total): 0
Off heap memory used (total): 14022908
SSTable Compression Ratio: 0.5677972573793397
Number of keys (estimate): 8670916
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 10623352
Bloom filter off heap memory used: 10623288
Index summary off heap memory used: 3157452
Compression metadata off heap memory used: 242168
Compacted partition minimum bytes: 36
Compacted partition maximum bytes: 268650950
Compacted partition mean bytes: 242
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Now the load has dropped. Just to compare after load abc table stats:
Table: abc
SSTable count: 14
Space used (live): 3960499258
Space used (total): 3960499258
Space used by snapshots (total): 0
Off heap memory used (total): 24792685
SSTable Compression Ratio: 0.25806487155056185
Number of keys (estimate): 380267
Memtable cell count: 780522
Memtable data size: 3746628
Memtable off heap memory used: 21019814
Memtable switch count: 192
Local read count: 33927339
Local read latency: 9.319 ms
Local write count: 30598848
Local write latency: 0.093 ms
Pending flushes: 0
Bloom filter false positives: 18478
Bloom filter false ratio: 0.00000
Bloom filter space used: 1523216
Bloom filter off heap memory used: 1523104
Index summary off heap memory used: 459327
Compression metadata off heap memory used: 1790440
Compacted partition minimum bytes: 30
Compacted partition maximum bytes: 4055269
Compacted partition mean bytes: 11408
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 179.0
Maximum tombstones per slice (last five minutes): 179

Cassandra Compression Ratio is 0 although LZ4Compressor used

I have create a keyspace and table within it for documents store.
The code I used is
CREATE KEYSPACE space WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
USE space;
CREATE TABLE documents (
doc_id text,
path text,
content text,
metadata_id text,
PRIMARY KEY (doc_id)
)
WITH compression = { 'sstable_compression' : 'LZ4Compressor' };
Then I've pushed some data into it and with using a command nodetool cfstats orpd.documents I wanted to check compression ratio.
$ nodetool cfstats space.documents
Keyspace: space
Read Count: 0
Read Latency: NaN ms.
Write Count: 2005
Write Latency: 0.050547132169576056 ms.
Pending Flushes: 0
Table: documents
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: 0.0
Number of keys (estimate): 978
Memtable cell count: 8020
Memtable data size: 92999622
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 2005
Local write latency: 0.051 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
----------------
However, I got confused because the ratio is 0.0, even though I use a compressor.
I am curious whether more data needs to be put into DB in order to get the measure or I am doing somethig wrong.
Your all data is in memtable
Run the below command to flush your memtable data to sstable
nodetool flush

Cassandra query fails for some WHERE clauses

I have a large table that looks essentially like:
CREATE TABLE keyspace.table(
node bigint,
time bigint,
core bigint,
set bigint,
value1 bigint,
value2 bigint,
PRIMARY KEY (node, time, core)
);
And a secondary index (possibly irrelevant) on column set.
When I do a stupidly simple query:
SELECT * FROM keyspace.table LIMIT 10;
It succeeds.
However, for some WHERE clauses, it fails, e.g.:
SELECT * FROM keyspace.table WHERE node = 12 LIMIT 10;
Traceback (most recent call last):
File "/usr/bin/cqlsh.py", line 1297, in perform_simple_statement
result = future.result()
File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.0.0-6af642d.zip/cassandra-driver-3.0.0-6af642d/cassandra/cluster.py", line 3122, in result
raise self._final_exception
Unavailable: code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}
And nothing shows up in the cassandra system log.
Details
The datacenter looks like:
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.64.8 234.84 GB 256 ? 478f59e1-4797-45df-81d1-77559e393e8a RAC1
DN 192.168.64.9 217.85 GB 256 ? 84d4bfd6-054a-433e-b6d4-3eead77609ac RAC1
DN 192.168.64.10 208.99 GB 256 ? c6ac565e-5897-4205-9439-779becf7fafe RAC1
UN 192.168.64.11 225.69 GB 256 ? 4a941e8f-db80-48f3-8eca-c4430b795694 RAC1
DN 192.168.64.12 208.57 GB 256 ? 34e57e18-e8b9-40d6-85e8-40a309e91b49 RAC1
DN 192.168.64.13 240.22 GB 256 ? 7a312c4f-01c0-4ed4-be7f-417b8f14f940 RAC1
DN 192.168.64.4 208.5 GB 256 ? 41a49d5c-e569-46f3-9f0e-97de43a22690 RAC1
UN 192.168.64.5 213.19 GB 256 ? b5bfb4e3-30a2-46b5-ba41-cf1a58a7355d RAC1
UN 192.168.64.6 212.21 GB 256 ? 9f6781ca-09b7-4923-8fa1-5b91079e2a18 RAC1
UN 192.168.64.7 235.97 GB 256 ? 5f429e7b-2e16-4796-b746-834500aeb884 RAC1
The keyspace:
CREATE KEYSPACE keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '2'} AND durable_writes = true;
This table is continuously ingesting 2000 rows per second using cassandra-loader and is fairly large. nodetool tablestats output (EDIT tombstones cleaned up):
Keyspace: keyspace
Read Count: 0
Read Latency: NaN ms.
Write Count: 31945
Write Latency: 7.547574268273595 ms.
Pending Flushes: 0
Table: table
SSTable count: 9
Space used (live): 15414599520
Space used (total): 15414599520
Space used by snapshots (total): 0
Off heap memory used (total): 4506364
SSTable Compression Ratio: 0.41688805047558564
Number of keys (estimate): 251
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 3
Local read count: 0
Local read latency: NaN ms
Local write count: 31945
Local write latency: 8.261 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1528
Bloom filter off heap memory used: 1456
Index summary off heap memory used: 260
Compression metadata off heap memory used: 4504648
Compacted partition minimum bytes: 2300
Compacted partition maximum bytes: 74975550
Compacted partition mean bytes: 36544449
Average live cells per slice (last five minutes): 1.0222222222222221
Maximum live cells per slice (last five minutes): 2
Average tombstones per slice (last five minutes): 1.0222222222222221
Maximum tombstones per slice (last five minutes): 2
Dropped Mutations: 0
Output with tracing enabled
cqlsh> TRACING ON;
Now Tracing is enabled
cqlsh> SELECT * FROM keyspace.table WHERE node = 1223 LIMIT 10;
Traceback (most recent call last):
File "/usr/bin/cqlsh.py", line 1297, in perform_simple_statement
result = future.result()
File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.0.0-6af642d.zip/cassandra-driver-3.0.0-6af642d/cassandra/cluster.py", line 3122, in result
raise self._final_exception
Unavailable: code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}
Let me know any other details I can provide.

phpcassa get_range is too slow

I have CF with 1280 rows.
Each row has 6 columns. Im trying to $cf->get_range('pq_questions','','',1200) and it gets all rows but too slow(about 4-6 sec)
Column Family: pq_questions
SSTable count: 1
Space used (live): 668363
Space used (total): 668363
Number of Keys (estimate): 1280
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Key cache capacity: 200000
Key cache size: 1000
Key cache hit rate: 0.10998439937597504
Row cache capacity: 1000
Row cache size: 1000
Row cache hit rate: 0.0
Compacted row minimum size: 373
Compacted row maximum size: 1331
Compacted row mean size: 574
It is strange but read latency in cfstats is NaN ms
When i calling htop on debian i see that the most load causes phpcassa
I has only one node and use consistency level ONE.
What can cause so slow quering?
I'm guessing you don't have the C extension installed. Without it, a similar query takes 1-2 seconds for me. With it installed, the same query takes about 0.2 seconds.
Regarding the NaN read latency, latencies aren't captured for get_range_slices (get_range in phpcassa).

Resources