We are currently testing Cassandra with the following table schema:
CREATE TABLE coreglead_v2.stats_by_site_user (
d_tally text, -- ex.: '2016-01', '2016-02', etc..
site_id int,
d_date timestamp,
site_user_id int,
accepted counter,
error counter,
impressions_negative counter,
impressions_positive counter,
rejected counter,
revenue counter,
reversals_rejected counter,
reversals_revenue counter,
PRIMARY KEY (d_tally, site_id, d_date, site_user_id)
) WITH CLUSTERING ORDER BY (site_id ASC, d_date ASC, site_user_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
For our test purposes, we have written a python script that randomises data across the 2016 calendar (12 months in total), we expect our partition key to be the d_tally column, at the same time, we expect our number of keys to be 12 (from '2016-01' to '2016-12').
Running nodetool cfstats is showing us the following though:
Table: stats_by_site_user
SSTable count: 4
Space used (live): 131977793
Space used (total): 131977793
Space used by snapshots (total): 0
Off heap memory used (total): 89116
SSTable Compression Ratio: 0.18667406304929424
Number of keys (estimate): 24
Memtable cell count: 120353
Memtable data size: 23228804
Memtable off heap memory used: 0
Memtable switch count: 10
Local read count: 169
Local read latency: 1.938 ms
Local write count: 4912464
Local write latency: 0.066 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 128
Bloom filter off heap memory used: 96
Index summary off heap memory used: 76
Compression metadata off heap memory used: 88944
Compacted partition minimum bytes: 5839589
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 16102786
Average live cells per slice (last five minutes): 102.91627247589237
Maximum live cells per slice (last five minutes): 103
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
What is confusing us is the "Number of keys (estimate): 24" part. Looking at our schema and assuming our test data (over 5 million writes) is made up of just 2016 data, where does the 24 keys estimate come from?
Here is an example of our data:
d_tally | site_id | d_date | site_user_id | accepted | error | impressions_negative | impressions_positive | rejected | revenue | reversals_rejected | reversals_revenue
---------+---------+--------------------------+--------------+----------+-------+----------------------+----------------------+----------+---------+--------------------+-------------------
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 240054 | 1 | null | null | 1 | null | 553 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1263968 | 1 | null | null | 1 | null | 1093 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1267841 | 1 | null | null | 1 | null | 861 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1728725 | 1 | null | null | 1 | null | 425 | null | null
The number of keys is an estimate (although should be very close). It takes a sketch of the data from each sstable, and merges it together to estimate the cardinality (hyperloglog).
Unfortunately the equivalent does not exist in the memtable so it adds the cardinality of the memtable to the sstable estimate. This means things in both memtables and sstables are double counted. This is why you see 24 instead of 12.
Related
When I query secondary index with pagination, query becomes slower when data grows.
I thought with pagination, no matter how large your data grow, it takes same time to query one page. Is that true? Why my query get slower?
My simplified table is
CREATE TABLE closed_executions (
domain_id uuid,
workflow_id text,
start_time timestamp,
workflow_type_name text,
PRIMARY KEY ((domain_id), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
AND COMPACTION = {
'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
}
AND GC_GRACE_SECONDS = 172800;
And I create a secondary index as
CREATE INDEX closed_by_type ON closed_executions (workflow_type_name);
I query with following CQL
SELECT workflow_id, start_time, workflow_type_name
FROM closed_executions
WHERE domain_id = ?
AND start_time >= ?
AND start_time <= ?
AND workflow_type_name = ?
and code
query := v.session.Query(templateGetClosedWorkflowExecutionsByType,
request.DomainUUID,
common.UnixNanoToCQLTimestamp(request.EarliestStartTime),
common.UnixNanoToCQLTimestamp(request.LatestStartTime),
request.WorkflowTypeName).Consistency(gocql.One)
iter := query.PageSize(request.PageSize).PageState(request.NextPageToken).Iter()
// PageSize is 10, but could be thousand
Environement:
MacBook Pro
Cassandra: 3.11.0
GoCql: github.com/gocql/gocql master
Observation:
10K rows, within second
100K rows, ~3 second
1M rows, ~17 second
Debug log:
INFO [ScheduledTasks:1] 2018-09-11 16:29:48,349 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2018-09-11 16:29:48,357 MonitoringTask.java:173 - 1 operations were slow in the last 5005 msecs:
<SELECT * FROM cadence_visibility.closed_executions WHERE workflow_type_name = code.uber.internal/devexp/cadence-bench/load/basic.stressWorkflowExecute AND token(domain_id, domain_partition) >= token(d3138e78-abe7-48a0-adb9-8c466a9bb3fa, 0) AND token(domain_id, domain_partition) <= token(d3138e78-abe7-48a0-adb9-8c466a9bb3fa, 0) AND start_time >= 2018-09-11 16:29-0700 AND start_time <= 1969-12-31 16:00-0800 LIMIT 10>, time 2747 msec - slow timeout 500 msec
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-11 16:31:47,774 AbstractCommitLogSegmentManager.java:107 - No segments in reserve; creating a fresh one
DEBUG [ScheduledTasks:1] 2018-09-11 16:40:22,922 ColumnFamilyStore.java:899 - Enqueuing flush of size_estimates: 23.997MiB (2%) on-heap, 0.000KiB (0%) off-heap
Related ref (no answer for my questions):
https://lists.apache.org/thread.html/%3CCAAiKoBidknHVOz8oQQmncZFZHdFiDfW6HTs63vxXCOhisQYZgg#mail.gmail.com%3E
https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive
https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
-- Edit
tablestats returns
Total number of tables: 105
----------------
Keyspace : cadence_visibility
Read Count: 19
Read Latency: 0.5125263157894736 ms.
Write Count: 3220964
Write Latency: 0.04900822269357869 ms.
Pending Flushes: 0
Table: closed_executions
SSTable count: 1
SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 20.3 MiB
Space used (total): 20.3 MiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 6.35 KiB
SSTable Compression Ratio: 0.40192660515179696
Number of keys (estimate): 3
Memtable cell count: 28667
Memtable data size: 7.35 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 9
Local read count: 9
Local read latency: NaN ms
Local write count: 327024
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 16 bytes
Bloom filter off heap memory used: 8 bytes
Index summary off heap memory used: 38 bytes
Compression metadata off heap memory used: 6.3 KiB
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 62479625
Compacted partition mean bytes: 31239902
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0 bytes
----------------
Why pagination doesn't scale as the main table?
Your data in your secondary index is disperse
pagination will only apply logic
until it hits the page number
since your data is not clustered by time
you still have to sift through lots and lots of rows
before you can find your first 10 for example .
Query Tracing do show pagination plays at the very late phase.
Why secondary index is slow?
First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data. It is known anti-patterns with low cardinality index. (reference https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive)
I'm working on a single Cassandra 3.11.2 node(RHEL 6.5). In keyspace(named 'test'), I've a table named 'test'. I entered some rows via cqlsh and then did nodetool flush. I checked in the data directory to confirm that a SSTable got created. Now I deleted all the .db files(from the test.test data directory using rm *.db).
Strangely, I can still see all the rows in cqlsh! I don't understand, how this is happening since I manually deleted the SSTable.
Given below is my keyspace:
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Given below is the table:
CREATE TABLE test.test (
aadhar_number int PRIMARY KEY,
address text,
name text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Given below is the output of nodetool tablestats command(after I had deleted the SSTable):
Keyspace : test
Read Count: 0
Read Latency: NaN ms
Write Count: 13
Write Latency: 0.11269230769230769 ms
Pending Flushes: 0
Table: test
SSTable count: 1
Space used (live): 5220
Space used (total): 5220
Space used by snapshots (total): 0
Off heap memory used (total): 48
SSTable Compression Ratio: 0.7974683544303798
Number of partitions (estimate): 255
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 4
Local read count: 0
Local read latency: NaN ms
Local write count: 10
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 16
Compression metadata off heap memory used: 16
Compacted partition minimum bytes: 18
Compacted partition maximum bytes: 50
Compacted partition mean bytes: 36
Average live cells per slice (last five minutes): 5.0
Maximum live cells per slice (last five minutes): 5
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
I restarted Cassandra and only then the data stopped showing in cqlsh.
A very good article for understanding filesystem details in linux.
On linux, filenames are just pointers (inodes) that point to the memory where the file resides. When Cassandra open the files, it holds a link to it. When you use rm to remove the file, you delete the link from the filesystem to the physical memory, but the file is still referenced by a live process and is therefore not deleted. You can easily check that with the command lsof (list open files). There is a flag to list for a given pid (check the cassandra pid with something like ps aux | grep cassandra)
Obviously, when you restart Cassandra, the file get deleted.
I have a 3 node setup, Node1 (172.30.56.60), Node2 (172.30.56.61) and Node3 (172.30.56.62),
It has the single partition data of 100K, the partition is framed by nodeip.
Please find the token / partition value for the nodeip - 172.30.56.60
cqlsh:qnapstat> SELECT token(nodeip) FROM nodedata WHERE nodeip = '172.30.56.60' LIMIT 5;
system.token(nodeip)
----------------------
222567180698744628
222567180698744628
222567180698744628
222567180698744628
222567180698744628
As per the ./nodetool ring value provided below, '172.30.56.60' only will return the data to the coordinator since the value from 173960939250606057 to 239923324758894350 is handled bu the node 172.30.56.60. Note : This is my understanding
172.30.56.60 rack1 Up Normal 32.72 MiB 100.00% 173960939250606057
172.30.56.62 rack1 Up Normal 32.88 MiB 100.00% 239923324758894351
172.30.56.61 rack1 Up Normal 32.84 MiB 100.00% 253117576269706963
172.30.56.60 rack1 Up Normal 32.72 MiB 100.00% 273249439554531014
172.30.56.61 rack1 Up Normal 32.84 MiB 100.00% 295635292275517104
172.30.56.62 rack1 Up Normal 32.88 MiB 100.00% 301162927966816823
I have two questions here,
1) When I try to execute the following query, Does it mean that Coordinator (say 172.30.56.61) reads all the data from the 172.30.56.60?
2) Is that after receiving all the 100 K entries in the coordinator, Coordinator will perform the aggregation for 100K, If so does it keeps all 100K entries in memory in 172.30.56.61?
SELECT Max(readiops) FROM nodedata WHERE nodeip = '172.30.56.60';
There is nice tool called CQL TRACING that can help you understand and see the flow of events once a SELECT query is executed.
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 10);
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 20);
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 30);
cqlsh> select * from test.nodedata ;
nodeip | readiops
--------+-----------
1 | 10
1 | 20
1 | 30
(3 rows)
cqlsh> SELECT MAX(readiops) FROM test.nodedata WHERE nodeip = 1;
system.max(readiops)
-----------------------
30
(1 rows)
Now let's set cqlsh> TRACING ON and run the same query again.
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> SELECT MAX(readiops) FROM test.nodedata WHERE nodeip = 1;
system.max(readiops)
----------------------
30
(1 rows)
Tracing session: 4d7bf970-eada-11e7-a79d-000000000003
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------
Execute CQL3 query | 2017-12-27 07:48:44.404000 | 172.16.0.128 | 0
read_data: message received from /172.16.0.128 [shard 4] | 2017-12-27 07:48:44.385109 | 172.16.0.48 | 9
read_data handling is done, sending a response to /172.16.0.128 [shard 4] | 2017-12-27 07:48:44.385322 | 172.16.0.48 | 222
Parsing a statement [shard 1] | 2017-12-27 07:48:44.404821 | 172.16.0.128 | --
Processing a statement [shard 1] | 2017-12-27 07:48:44.404913 | 172.16.0.128 | 93
Creating read executor for token 6292367497774912474 with all: {172.16.0.128, 172.16.0.48, 172.16.0.115} targets: {172.16.0.48} repair decision: NONE [shard 1] | 2017-12-27 07:48:44.404966 | 172.16.0.128 | 146
read_data: sending a message to /172.16.0.48 [shard 1] | 2017-12-27 07:48:44.404972 | 172.16.0.128 | 152
read_data: got response from /172.16.0.48 [shard 1] | 2017-12-27 07:48:44.405497 | 172.16.0.128 | 676
Done processing - preparing a result [shard 1] | 2017-12-27 07:48:44.405535 | 172.16.0.128 | 715
Request complete | 2017-12-27 07:48:44.404722 | 172.16.0.128 | 722
As for your questions:
The Coordinator passes the query to the replica, if RF = 1 or (RF > 1 and CL=ONE), than it will receive the reply from 1 replica, but if (RF > 1 and CL > 1), than it needs to receive replies from multiple replicas and compare the answers, so there's also orchestration done on the Coordinator side.
The way it is actually done is a data request to the fastest replica (using the snitch) and a digest request to the other replicas needed to satisfy the CL.
And then the coordinator need to hash the responses from the data and digest requests and compare them.
If the partition is hashed into a specific node, it will reside in that node (assuming RF=1) and information will be read only from that node.
The Client sends with the query the page size, so the reply itself is returned in bulks (default=5000), which can be set from the client side.
I recommend watching this youtube clip on Cassandra read path for more details.
I have create a keyspace and table within it for documents store.
The code I used is
CREATE KEYSPACE space WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
USE space;
CREATE TABLE documents (
doc_id text,
path text,
content text,
metadata_id text,
PRIMARY KEY (doc_id)
)
WITH compression = { 'sstable_compression' : 'LZ4Compressor' };
Then I've pushed some data into it and with using a command nodetool cfstats orpd.documents I wanted to check compression ratio.
$ nodetool cfstats space.documents
Keyspace: space
Read Count: 0
Read Latency: NaN ms.
Write Count: 2005
Write Latency: 0.050547132169576056 ms.
Pending Flushes: 0
Table: documents
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: 0.0
Number of keys (estimate): 978
Memtable cell count: 8020
Memtable data size: 92999622
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 2005
Local write latency: 0.051 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
----------------
However, I got confused because the ratio is 0.0, even though I use a compressor.
I am curious whether more data needs to be put into DB in order to get the measure or I am doing somethig wrong.
Your all data is in memtable
Run the below command to flush your memtable data to sstable
nodetool flush
I am experiencing really poor performance with Cassandra 2.1.5. I am new to this so would appreciate any advice on how to debug. Here is what my table looks like:
Keyspace: nt_live_october x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Read Count: 6 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Read Latency: 20837.149166666666 ms. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Write Count: 39799 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Write Latency: 0.45696595391844014 ms. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pending Flushes: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Table: nt x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SSTable count: 12 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Space used (live): 15903191275 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Space used (total): 15971044770 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Space used by snapshots (total): 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Off heap memory used (total): 14468424 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SSTable Compression Ratio: 0.1308103413354315 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Number of keys (estimate): 740 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable cell count: 43483 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable data size: 9272510 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable off heap memory used: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable switch count: 17 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local read count: 6 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local read latency: 20837.150 ms x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local write count: 39801 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local write latency: 0.457 ms x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pending flushes: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter false positives: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter false ratio: 0.00000 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter space used: 4832 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter off heap memory used: 4736 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Index summary off heap memory used: 576 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compression metadata off heap memory used: 14463112 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compacted partition minimum bytes: 6867 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compacted partition maximum bytes: 30753941057 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compacted partition mean bytes: 44147544 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Average live cells per slice (last five minutes): 0.0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maximum live cells per slice (last five minutes): 0.0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Average tombstones per slice (last five minutes): 0.0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maximum tombstones per slice (last five minutes): 0.0
I am issuing the following query via cqlsh:
cassandra#cqlsh> TRACING ON; Tracing is already enabled. Use TRACING OFF to disable. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cassandra#cqlsh> CONSISTENCY; x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Current consistency level is ONE. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cassandra#cqlsh> select * from nt_live_october.nt where group_id='254358' and epoch >=1444313898 and epoch<=1444348800 LIMIT 1; x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OperationTimedOut: errors={}, last_host=XXX.203
Statement trace did not complete within 10 seconds
and here is what system_traces.events shows:
xxx.xxx.xxx.203 | 1281 | Parsing select * from nt_live_october.nt where group_id='254358'\nand epoch >=1443916800 and epoch<=1444348800\nLIMIT 30;
xxx.xxx.xxx.203 | 2604 | Preparing statement
xxx.xxx.xxx.203 | 8454 | Executing single-partition query on users
xxx.xxx.xxx.203 | 8474 | Acquiring sstable references
xxx.xxx.xxx.203 | 8547 | Merging memtable tombstones
xxx.xxx.xxx.203 | 8675 | Key cache hit for sstable 1
xxx.xxx.xxx.203 | 8685 | Seeking to partition beginning in data file
xxx.xxx.xxx.203 | 9040 | Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones
xxx.xxx.xxx.203 | 9056 | Merging data from memtables and 1 sstables
xxx.xxx.xxx.203 | 9120 | Read 1 live and 0 tombstone cells
xxx.xxx.xxx.203 | 9854 | Read-repair DC_LOCAL
xxx.xxx.xxx.203 | 10033 | Executing single-partition query on users
xxx.xxx.xxx.203 | 10046 | Acquiring sstable references
xxx.xxx.xxx.203 | 10105 | Merging memtable tombstones
xxx.xxx.xxx.203 | 10189 | Key cache hit for sstable 1
xxx.xxx.xxx.203 | 10198 | Seeking to partition beginning in data file
xxx.xxx.xxx.203 | 10248 | Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones
xxx.xxx.xxx.203 | 10261 | Merging data from memtables and 1 sstables
xxx.xxx.xxx.203 | 10296 | Read 1 live and 0 tombstone cells
xxx.xxx.xxx.203 | 12511 | Executing single-partition query on nt
xxx.xxx.xxx.203 | 12525 | Acquiring sstable references
xxx.xxx.xxx.203 | 12587 | Merging memtable tombstones
xxx.xxx.xxx.203 | 18067 | speculating read retry on /xxx.xxx.xxx.205
xxx.xxx.xxx.203 | 18577 | Sending READ message to xxx.xxx.xxx.205/xxx.xxx.xxx.205
xxx.xxx.xxx.203 | 25534 | Partition index with 6093 entries found for sstable 8885
xxx.xxx.xxx.203 | 25571 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 34989 | Partition index with 5327 entries found for sstable 8524
xxx.xxx.xxx.203 | 35022 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 36322 | Partition index with 333 entries found for sstable 8477
xxx.xxx.xxx.203 | 36336 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 714242 | Partition index with 299251 entries found for sstable 8541
xxx.xxx.xxx.203 | 714279 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 715717 | Partition index with 501 entries found for sstable 8217
xxx.xxx.xxx.203 | 715745 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 716232 | Partition index with 252 entries found for sstable 8888
xxx.xxx.xxx.203 | 716245 | Seeking to partition indexed section in data file
xxx.xxx.xxx.205 | 87 | READ message received from /xxx.xxx.xxx.203
xxx.xxx.xxx.205 | 50427 | Executing single-partition query on nt
xxx.xxx.xxx.205 | 50535 | Acquiring sstable references
xxx.xxx.xxx.205 | 50628 | Merging memtable tombstones
xxx.xxx.xxx.205 | 170441 | Partition index with 35650 entries found for sstable 6332
xxx.xxx.xxx.203 | 30718026 | Partition index with 199905 entries found for sstable 5958
xxx.xxx.xxx.203 | 30718077 | Seeking to partition indexed section in data file
xxx.xxx.xxx.205 | 170499 | Seeking to partition indexed section in data file
xxx.xxx.xxx.205 | 248898 | Partition index with 30958 entries found for sstable 6797
xxx.xxx.xxx.205 | 248962 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 67814573 | Read timeout: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
xxx.xxx.xxx.203 | 67814675 | Timed out; received 0 of 1 responses
I have 4 nodes, with replication factor of 3(one node is very light but it's not .203) The data I'm trying to read isn't very much -- even if LIMIT 1 is not being pushed to the remote node, the low end of the interval should be about 3 hours ago (I have no epochs past the current time)
Any tips on how to fix this/what might be going wrong? My cassandra version is 2.1.9, running largely with defaults
Table schema is as follows (I can't publish the whole schema for privacy reasons, but showing the keys which I hope is the main thing that matters)
PRIMARY KEY (group_id, epoch, group_name, auto_generated_uuid_field)
) WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 7776000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
___________EDIT_____________
to answer questions from below:
output of status:
-- Address Load Tokens Owns Host ID Rack x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DN xxx.xxx.xxx.204 15.8 GB 1 ? 32ed196b-f6eb-4e93-b759 r1 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UN xxx.xxx.xxx.205 20.38 GB 1 ? 446d71aa-e9cd-4ca9-a6ac r1 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UN xxx.xxx.xxx.202 1.48 GB 1 ? 2a6670b2-63f2-43be-b672 r1 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UN xxx.xxx.xxx.203 15.72 GB 1 ? dd26dfee-82da-454b-8db2 r1
The system.log is trickier as I have a lot of logging in there...one suspect thing I see is
WARN [CompactionExecutor:6] 2015-10-08 19:44:16,595 SSTableWriter.java (line 240) Compacting large partition nt_live_october/nt:254358 (230692316 bytes)
but it's just a warning...shortly after I see
INFO [CompactionExecutor:6] 2015-10-08 19:44:16,642 CompactionTask.java (line 274) Compacted 4 sstables to [/cassandra/data_dir_d/nt_live_october/nt-72813b106b9111e58f1ea1f0942ab78d/nt_live_october-nt-ka-9024,]. 35,733,701 bytes to 30,186,394 (~84% of original) in 34,907ms = 0.824705MB/s. 21 total partitions merged to 18. Partition merge counts were {1:17, 4:1, }
I see quite a few of these pairs in the log...but no ERROR level messages. Compaction seems to be going OK..it does say that this is the largest column family but all messages are INFO level....
First, the DN status of the node 204 means down. Retrieve its system.log and look for :
Exceptions and ERROR level logs
Anormal GC activity ( collection longer than 200ms)
StatusLogger
Second, the data is badly distributed among the cluster. The load of 202 is only 1.48 GB. I suspect you have some very large partitions replicated on the other nodes. What is the replication factor ? What is the scheme of your keyspace ? You can answer these questions with cqlsh command :
DESCRIBE KEYSPACE nt_live_october;