I have a 3 node setup, Node1 (172.30.56.60), Node2 (172.30.56.61) and Node3 (172.30.56.62),
It has the single partition data of 100K, the partition is framed by nodeip.
Please find the token / partition value for the nodeip - 172.30.56.60
cqlsh:qnapstat> SELECT token(nodeip) FROM nodedata WHERE nodeip = '172.30.56.60' LIMIT 5;
system.token(nodeip)
----------------------
222567180698744628
222567180698744628
222567180698744628
222567180698744628
222567180698744628
As per the ./nodetool ring value provided below, '172.30.56.60' only will return the data to the coordinator since the value from 173960939250606057 to 239923324758894350 is handled bu the node 172.30.56.60. Note : This is my understanding
172.30.56.60 rack1 Up Normal 32.72 MiB 100.00% 173960939250606057
172.30.56.62 rack1 Up Normal 32.88 MiB 100.00% 239923324758894351
172.30.56.61 rack1 Up Normal 32.84 MiB 100.00% 253117576269706963
172.30.56.60 rack1 Up Normal 32.72 MiB 100.00% 273249439554531014
172.30.56.61 rack1 Up Normal 32.84 MiB 100.00% 295635292275517104
172.30.56.62 rack1 Up Normal 32.88 MiB 100.00% 301162927966816823
I have two questions here,
1) When I try to execute the following query, Does it mean that Coordinator (say 172.30.56.61) reads all the data from the 172.30.56.60?
2) Is that after receiving all the 100 K entries in the coordinator, Coordinator will perform the aggregation for 100K, If so does it keeps all 100K entries in memory in 172.30.56.61?
SELECT Max(readiops) FROM nodedata WHERE nodeip = '172.30.56.60';
There is nice tool called CQL TRACING that can help you understand and see the flow of events once a SELECT query is executed.
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 10);
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 20);
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 30);
cqlsh> select * from test.nodedata ;
nodeip | readiops
--------+-----------
1 | 10
1 | 20
1 | 30
(3 rows)
cqlsh> SELECT MAX(readiops) FROM test.nodedata WHERE nodeip = 1;
system.max(readiops)
-----------------------
30
(1 rows)
Now let's set cqlsh> TRACING ON and run the same query again.
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> SELECT MAX(readiops) FROM test.nodedata WHERE nodeip = 1;
system.max(readiops)
----------------------
30
(1 rows)
Tracing session: 4d7bf970-eada-11e7-a79d-000000000003
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------
Execute CQL3 query | 2017-12-27 07:48:44.404000 | 172.16.0.128 | 0
read_data: message received from /172.16.0.128 [shard 4] | 2017-12-27 07:48:44.385109 | 172.16.0.48 | 9
read_data handling is done, sending a response to /172.16.0.128 [shard 4] | 2017-12-27 07:48:44.385322 | 172.16.0.48 | 222
Parsing a statement [shard 1] | 2017-12-27 07:48:44.404821 | 172.16.0.128 | --
Processing a statement [shard 1] | 2017-12-27 07:48:44.404913 | 172.16.0.128 | 93
Creating read executor for token 6292367497774912474 with all: {172.16.0.128, 172.16.0.48, 172.16.0.115} targets: {172.16.0.48} repair decision: NONE [shard 1] | 2017-12-27 07:48:44.404966 | 172.16.0.128 | 146
read_data: sending a message to /172.16.0.48 [shard 1] | 2017-12-27 07:48:44.404972 | 172.16.0.128 | 152
read_data: got response from /172.16.0.48 [shard 1] | 2017-12-27 07:48:44.405497 | 172.16.0.128 | 676
Done processing - preparing a result [shard 1] | 2017-12-27 07:48:44.405535 | 172.16.0.128 | 715
Request complete | 2017-12-27 07:48:44.404722 | 172.16.0.128 | 722
As for your questions:
The Coordinator passes the query to the replica, if RF = 1 or (RF > 1 and CL=ONE), than it will receive the reply from 1 replica, but if (RF > 1 and CL > 1), than it needs to receive replies from multiple replicas and compare the answers, so there's also orchestration done on the Coordinator side.
The way it is actually done is a data request to the fastest replica (using the snitch) and a digest request to the other replicas needed to satisfy the CL.
And then the coordinator need to hash the responses from the data and digest requests and compare them.
If the partition is hashed into a specific node, it will reside in that node (assuming RF=1) and information will be read only from that node.
The Client sends with the query the page size, so the reply itself is returned in bulks (default=5000), which can be set from the client side.
I recommend watching this youtube clip on Cassandra read path for more details.
Related
I have the a timeout issue showing up with few partition keys on the same table. All queries were working fine until yesterday on the same partition keys. All other queries on the same table are just fine as expected.
I tried to query the same partition key on cqlsh but it is only successful only a few times.
I have 3 nodes in GCP and the keyspace has replication factor of 3. All nodes are showing up. I even tried to increase the read-timeout in cassandra.yaml to 20 sec.
What could be going on? Same query is returning different timeout errors.
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute_async'}, last_host=10.142.70.4
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures" info={'failures': 2, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures" info={'failures': 2, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures" info={'failures': 2, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
address | kind | id
----------------+---------+--------------------------------------
0x6854f5fffbbf | gateway | d5a807e0-b389-11e7-9c90-3bf17563ce39
(1 rows)
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
address | kind | id
----------------+---------+--------------------------------------
0x6854f5fffbbf | gateway | d5a807e0-b389-11e7-9c90-3bf17563ce39
(1 rows)
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute_async'}, last_host=10.142.70.4
cqlsh:us> tracing on;
Now Tracing is enabled
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
address | kind | id
----------------+---------+--------------------------------------
0x6854f5fffbbf | gateway | d5a807e0-b389-11e7-9c90-3bf17563ce39
(1 rows)
Tracing session: 2d7a7510-c056-11e7-b420-1f591a32c1ad
activity | timestamp | source | source_elapsed | client
---------------------------------------------------------------------------------------------------------------+----------------------------+-------------+----------------+-------------
Execute CQL3 query | 2017-11-03 05:16:42.082000 | 10.142.70.4 | 0 | 10.142.70.4
Parsing select address, kind, id from devices_mac where address=0x6854f5fffbbf; [Native-Transport-Requests-1] | 2017-11-03 05:16:42.082000 | 10.142.70.4 | 427 | 10.142.70.4
Preparing statement [Native-Transport-Requests-1] | 2017-11-03 05:16:42.082000 | 10.142.70.4 | 671 | 10.142.70.4
reading data from /10.142.70.3 [Native-Transport-Requests-1] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1135 | 10.142.70.4
Sending READ message to /10.142.70.3 [MessagingService-Outgoing-/10.142.70.3] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1390 | 10.142.70.4
speculating read retry on /10.142.70.2 [Native-Transport-Requests-1] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1619 | 10.142.70.4
Sending READ message to /10.142.70.2 [MessagingService-Outgoing-/10.142.70.2] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1863 | 10.142.70.4
INTERNAL_RESPONSE message received from /10.142.70.3 [MessagingService-Incoming-/10.142.70.3] | 2017-11-03 05:16:42.085000 | 10.142.70.4 | 3035 | 10.142.70.4
Processing response from /10.142.70.3 [InternalResponseStage:252] | 2017-11-03 05:16:42.085000 | 10.142.70.4 | 3688 | 10.142.70.4
READ message received from /10.142.70.4 [MessagingService-Incoming-/10.142.70.4] | 2017-11-03 05:16:42.088000 | 10.142.70.3 | 36 | 10.142.70.4
Executing single-partition query on devices_mac [ReadStage-4] | 2017-11-03 05:16:42.088000 | 10.142.70.3 | 292 | 10.142.70.4
Acquiring sstable references [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 478 | 10.142.70.4
Bloom filter allows skipping sstable 89 [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 551 | 10.142.70.4
Key cache hit for sstable 88 [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 580 | 10.142.70.4
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 651 | 10.142.70.4
Sending INTERNAL_RESPONSE message to /10.142.70.4 [MessagingService-Outgoing-/10.142.70.4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 1043 | 10.142.70.4
READ message received from /10.142.70.4 [MessagingService-Incoming-/10.142.70.4] | 2017-11-03 05:16:42.137000 | 10.142.70.2 | 16431 | 10.142.70.4
Executing single-partition query on devices_mac [ReadStage-2] | 2017-11-03 05:16:42.140000 | 10.142.70.2 | 19140 | 10.142.70.4
Acquiring sstable references [ReadStage-2] | 2017-11-03 05:16:42.140000 | 10.142.70.2 | 19641 | 10.142.70.4
Bloom filter allows skipping sstable 79 [ReadStage-2] | 2017-11-03 05:16:42.141000 | 10.142.70.2 | 20340 | 10.142.70.4
Key cache hit for sstable 78 [ReadStage-2] | 2017-11-03 05:16:42.141000 | 10.142.70.2 | 21034 | 10.142.70.4
Key cache hit for sstable 77 [ReadStage-2] | 2017-11-03 05:16:42.142000 | 10.142.70.2 | 21281 | 10.142.70.4
Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-2] | 2017-11-03 05:16:42.142000 | 10.142.70.2 | 21759 | 10.142.70.4
REQUEST_RESPONSE message received from /10.142.70.2 [MessagingService-Incoming-/10.142.70.2] | 2017-11-03 05:16:42.151000 | 10.142.70.4 | 69022 | 10.142.70.4
Processing response from /10.142.70.2 [RequestResponseStage-2] | 2017-11-03 05:16:42.151000 | 10.142.70.4 | 69211 | 10.142.70.4
Merged data from memtables and 3 sstables [ReadStage-2] | 2017-11-03 05:16:42.153000 | 10.142.70.2 | 32443 | 10.142.70.4
Read 1 live and 1 tombstone cells [ReadStage-2] | 2017-11-03 05:16:42.153000 | 10.142.70.2 | 32792 | 10.142.70.4
Enqueuing response to /10.142.70.4 [ReadStage-2] | 2017-11-03 05:16:42.153000 | 10.142.70.2 | 33109 | 10.142.70.4
Sending REQUEST_RESPONSE message to /10.142.70.4 [MessagingService-Outgoing-/10.142.70.4] | 2017-11-03 05:16:42.154000 | 10.142.70.2 | 33905 | 10.142.70.4
Request complete | 2017-11-03 05:16:42.151744 | 10.142.70.4 | 69744 | 10.142.70.4
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.142.70.4
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.142.70.4
GC LOGS ----
2017-11-03T16:27:53.965+0000: 142177.812: Total time for which application threads were stopped: 0.0109614 seconds, Stopping threads took: 0.0000757 seconds
{Heap before GC invocations=4748 (full 33):
par new generation total 184320K, used 120338K [0x000000008b200000, 0x0000000097a00000, 0x0000000097a00000)
eden space 163840K, 73% used [0x000000008b200000, 0x0000000092764dc0, 0x0000000095200000)
from space 20480K, 0% used [0x0000000095200000, 0x000000009521fe20, 0x0000000096600000)
to space 20480K, 0% used [0x0000000096600000, 0x0000000096600000, 0x0000000097a00000)
concurrent mark-sweep generation total 1710080K, used 697432K [0x0000000097a00000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 46344K, capacity 47706K, committed 47892K, reserved 1091584K
class space used 5827K, capacity 6111K, committed 6164K, reserved 1048576K
2017-11-03T16:27:54.857+0000: 142178.704: [GC (Allocation Failure) 2017-11-03T16:27:54.857+0000: 142178.704: [ParNew
Desired survivor size 10485760 bytes, new threshold 1 (max 1)
- age 1: 51336 bytes, 51336 total
: 120338K->80K(184320K), 0.0105980 secs] 817771K->697549K(1894400K), 0.0107087 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
Heap after GC invocations=4749 (full 33):
par new generation total 184320K, used 80K [0x000000008b200000, 0x0000000097a00000, 0x0000000097a00000)
eden space 163840K, 0% used [0x000000008b200000, 0x000000008b200000, 0x0000000095200000)
from space 20480K, 0% used [0x0000000096600000, 0x0000000096614070, 0x0000000097a00000)
to space 20480K, 0% used [0x0000000095200000, 0x0000000095200000, 0x0000000096600000)
concurrent mark-sweep generation total 1710080K, used 697469K [0x0000000097a00000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 46344K, capacity 47706K, committed 47892K, reserved 1091584K
class space used 5827K, capacity 6111K, committed 6164K, reserved 1048576K
}
2017-11-03T16:27:54.868+0000: 142178.715: Total time for which application threads were stopped: 0.0115291 seconds, Stopping threads took: 0.0000898 seconds
{Heap before GC invocations=4749 (full 33):
par new generation total 184320K, used 163920K [0x000000008b200000, 0x0000000097a00000, 0x0000000097a00000)
eden space 163840K, 100% used [0x000000008b200000, 0x0000000095200000, 0x0000000095200000)
from space 20480K, 0% used [0x0000000096600000, 0x0000000096614070, 0x0000000097a00000)
to space 20480K, 0% used [0x0000000095200000, 0x0000000095200000, 0x0000000096600000)
concurrent mark-sweep generation total 1710080K, used 697469K [0x0000000097a00000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 46344K, capacity 47706K, committed 47892K, reserved 1091584K
class space used 5827K, capacity 6111K, committed 6164K, reserved 1048576K
2017-11-03T16:28:05.336+0000: 142189.183: [GC (Allocation Failure) 2017-11-03T16:28:05.336+0000: 142189.183: [ParNew
Desired survivor size 10485760 bytes, new threshold 1 (max 1)
- age 1: 94096 bytes, 94096 total
: 163920K->115K(184320K), 0.0389448 secs] 861389K->771393K(1894400K), 0.0390719 secs] [Times: user=0.08 sys=0.00, real=0.04 secs]
CFSTATS for my table ---
Keyspace : us
Read Count: 26141
Read Latency: 0.12058234191499943 ms.
Write Count: 50140
Write Latency: 0.052054906262465096 ms.
Pending Flushes: 0
Table: devices_mac
SSTable count: 2
Space used (live): 223080
Space used (total): 223080
Space used by snapshots (total): 0
Off heap memory used (total): 1068
SSTable Compression Ratio: 0.21666133520615916
Number of keys (estimate): 615
Memtable cell count: 2
Memtable data size: 8719
Memtable off heap memory used: 0
Memtable switch count: 3
Local read count: 186
Local read latency: NaN ms
Local write count: 978
Local write latency: 0.208 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 848
Bloom filter off heap memory used: 832
Index summary off heap memory used: 108
Compression metadata off heap memory used: 128
Compacted partition minimum bytes: 643
Compacted partition maximum bytes: 14237
Compacted partition mean bytes: 1372
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
CFHISTOGRAMS for the table
us/devices_mac histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 1331 60
75% 0.00 0.00 0.00 1597 60
95% 0.00 0.00 0.00 1597 60
98% 0.00 0.00 0.00 2759 124
99% 0.00 0.00 0.00 3311 124
Min 0.00 0.00 0.00 643 21
Max 0.00 0.00 0.00 14237 149
Please, help me to understand what i missed.
I see strange behavior of one cluster node on SELECT with LIMIT and ORDER BY DESC clauses:
SELECT cid FROM test_cf WHERE uid = 0x50236b6de695baa1140004bf ORDER BY tuuid DESC LIMIT 1000;
TRACING (only part):
…
Sending REQUEST_RESPONSE message to /10.0.25.56 [MessagingService-Outgoing-/10.0.25.56] | 2016-02-29 22:17:25.117000 | 10.0.23.15 | 7862
Sending REQUEST_RESPONSE message to /10.0.25.56 [MessagingService-Outgoing-/10.0.25.56] | 2016-02-29 22:17:25.136000 | 10.0.25.57 | 6283
Sending REQUEST_RESPONSE message to /10.0.25.56 [MessagingService-Outgoing-/10.0.25.56] | 2016-02-29 22:17:38.568000 | 10.0.24.51 | 457931
…
10.0.25.56 - coordinator node
10.0.23.15, 10.0.24.51, 10.0.25.57 - node with data
Coordinator get response from 10.0.24.51 13 seconds later than other nodes! Why so? How can i fix it?
Number of rows for partition key (uid = 0x50236b6de695baa1140004bf) is about 300.
All is fine if we use ORDER BY ASC (our clustering order) or LIMIT value less than number of rows for this partition key.
Cassandra (v2.2.5) cluster contains 25 nodes.
Every node holds about 400Gb of data.
Cluster is placed in AWS. Nodes are evenly distributed over 3 subnets in VPC. Type of instance for nodes is c3.4xlarge (16 CPU cores, 30GB RAM). We use EBS-backed storages (1TB GP SSD).
Keyspace RF equals 3.
Column family:
CREATE TABLE test_cf (
uid blob,
tuuid timeuuid,
cid text,
cuid blob,
PRIMARY KEY (uid, tuuid)
) WITH CLUSTERING ORDER BY (tuuid ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction ={'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression ={'sstable_compression':'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
nodetool gcstats (10.0.25.57):
Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms) GC Reclaimed (MB) Collections Direct Memory Bytes
1208504 368 4559 73 553798792712 58 305691840
nodetool gcstats (10.0.23.15):
Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms) GC Reclaimed (MB) Collections Direct Memory Bytes
1445602 369 3120 57 381929718000 38 277907601
nodetool gcstats (10.0.24.51):
Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms) GC Reclaimed (MB) Collections Direct Memory Bytes
1174966 397 4137 69 1900387479552 45 304448986
This could be due to a number of factors both related and not related to Cassandra.
Non-Cassandra Specific
How does the hardware (CPU/RAM/Disk Type (SSD v Rotational) on this
node compare to the other nodes?
How is the network configured? Is traffic to this node slower than other nodes? Do you have a routing issue between the nodes?
How does the load on this server compare to other nodes?
Cassandra Specific
Is the JVM properly configured? Is GC running significantly more frequently than the other nodes? Check nodetool gcstats on this and other nodes to compare.
Has compaction been run on this node recently? Check nodetool compactionhistory
Are there any issues with corrupted files on disk?
Have you checked the system.log to see if it contains any information.
Besides general Linux troubleshooting I would suggest you compare some of the specific C* functionality using nodetool and look for differences:
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsNodetool_r.html
We are currently testing Cassandra with the following table schema:
CREATE TABLE coreglead_v2.stats_by_site_user (
d_tally text, -- ex.: '2016-01', '2016-02', etc..
site_id int,
d_date timestamp,
site_user_id int,
accepted counter,
error counter,
impressions_negative counter,
impressions_positive counter,
rejected counter,
revenue counter,
reversals_rejected counter,
reversals_revenue counter,
PRIMARY KEY (d_tally, site_id, d_date, site_user_id)
) WITH CLUSTERING ORDER BY (site_id ASC, d_date ASC, site_user_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
For our test purposes, we have written a python script that randomises data across the 2016 calendar (12 months in total), we expect our partition key to be the d_tally column, at the same time, we expect our number of keys to be 12 (from '2016-01' to '2016-12').
Running nodetool cfstats is showing us the following though:
Table: stats_by_site_user
SSTable count: 4
Space used (live): 131977793
Space used (total): 131977793
Space used by snapshots (total): 0
Off heap memory used (total): 89116
SSTable Compression Ratio: 0.18667406304929424
Number of keys (estimate): 24
Memtable cell count: 120353
Memtable data size: 23228804
Memtable off heap memory used: 0
Memtable switch count: 10
Local read count: 169
Local read latency: 1.938 ms
Local write count: 4912464
Local write latency: 0.066 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 128
Bloom filter off heap memory used: 96
Index summary off heap memory used: 76
Compression metadata off heap memory used: 88944
Compacted partition minimum bytes: 5839589
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 16102786
Average live cells per slice (last five minutes): 102.91627247589237
Maximum live cells per slice (last five minutes): 103
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
What is confusing us is the "Number of keys (estimate): 24" part. Looking at our schema and assuming our test data (over 5 million writes) is made up of just 2016 data, where does the 24 keys estimate come from?
Here is an example of our data:
d_tally | site_id | d_date | site_user_id | accepted | error | impressions_negative | impressions_positive | rejected | revenue | reversals_rejected | reversals_revenue
---------+---------+--------------------------+--------------+----------+-------+----------------------+----------------------+----------+---------+--------------------+-------------------
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 240054 | 1 | null | null | 1 | null | 553 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1263968 | 1 | null | null | 1 | null | 1093 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1267841 | 1 | null | null | 1 | null | 861 | null | null
2016-01 | 1 | 2016-01-01 00:00:00+0000 | 1728725 | 1 | null | null | 1 | null | 425 | null | null
The number of keys is an estimate (although should be very close). It takes a sketch of the data from each sstable, and merges it together to estimate the cardinality (hyperloglog).
Unfortunately the equivalent does not exist in the memtable so it adds the cardinality of the memtable to the sstable estimate. This means things in both memtables and sstables are double counted. This is why you see 24 instead of 12.
I am experiencing really poor performance with Cassandra 2.1.5. I am new to this so would appreciate any advice on how to debug. Here is what my table looks like:
Keyspace: nt_live_october x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Read Count: 6 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Read Latency: 20837.149166666666 ms. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Write Count: 39799 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Write Latency: 0.45696595391844014 ms. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pending Flushes: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Table: nt x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SSTable count: 12 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Space used (live): 15903191275 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Space used (total): 15971044770 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Space used by snapshots (total): 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Off heap memory used (total): 14468424 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SSTable Compression Ratio: 0.1308103413354315 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Number of keys (estimate): 740 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable cell count: 43483 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable data size: 9272510 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable off heap memory used: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memtable switch count: 17 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local read count: 6 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local read latency: 20837.150 ms x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local write count: 39801 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local write latency: 0.457 ms x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pending flushes: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter false positives: 0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter false ratio: 0.00000 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter space used: 4832 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bloom filter off heap memory used: 4736 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Index summary off heap memory used: 576 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compression metadata off heap memory used: 14463112 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compacted partition minimum bytes: 6867 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compacted partition maximum bytes: 30753941057 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compacted partition mean bytes: 44147544 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Average live cells per slice (last five minutes): 0.0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maximum live cells per slice (last five minutes): 0.0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Average tombstones per slice (last five minutes): 0.0 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maximum tombstones per slice (last five minutes): 0.0
I am issuing the following query via cqlsh:
cassandra#cqlsh> TRACING ON; Tracing is already enabled. Use TRACING OFF to disable. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cassandra#cqlsh> CONSISTENCY; x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Current consistency level is ONE. x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cassandra#cqlsh> select * from nt_live_october.nt where group_id='254358' and epoch >=1444313898 and epoch<=1444348800 LIMIT 1; x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OperationTimedOut: errors={}, last_host=XXX.203
Statement trace did not complete within 10 seconds
and here is what system_traces.events shows:
xxx.xxx.xxx.203 | 1281 | Parsing select * from nt_live_october.nt where group_id='254358'\nand epoch >=1443916800 and epoch<=1444348800\nLIMIT 30;
xxx.xxx.xxx.203 | 2604 | Preparing statement
xxx.xxx.xxx.203 | 8454 | Executing single-partition query on users
xxx.xxx.xxx.203 | 8474 | Acquiring sstable references
xxx.xxx.xxx.203 | 8547 | Merging memtable tombstones
xxx.xxx.xxx.203 | 8675 | Key cache hit for sstable 1
xxx.xxx.xxx.203 | 8685 | Seeking to partition beginning in data file
xxx.xxx.xxx.203 | 9040 | Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones
xxx.xxx.xxx.203 | 9056 | Merging data from memtables and 1 sstables
xxx.xxx.xxx.203 | 9120 | Read 1 live and 0 tombstone cells
xxx.xxx.xxx.203 | 9854 | Read-repair DC_LOCAL
xxx.xxx.xxx.203 | 10033 | Executing single-partition query on users
xxx.xxx.xxx.203 | 10046 | Acquiring sstable references
xxx.xxx.xxx.203 | 10105 | Merging memtable tombstones
xxx.xxx.xxx.203 | 10189 | Key cache hit for sstable 1
xxx.xxx.xxx.203 | 10198 | Seeking to partition beginning in data file
xxx.xxx.xxx.203 | 10248 | Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones
xxx.xxx.xxx.203 | 10261 | Merging data from memtables and 1 sstables
xxx.xxx.xxx.203 | 10296 | Read 1 live and 0 tombstone cells
xxx.xxx.xxx.203 | 12511 | Executing single-partition query on nt
xxx.xxx.xxx.203 | 12525 | Acquiring sstable references
xxx.xxx.xxx.203 | 12587 | Merging memtable tombstones
xxx.xxx.xxx.203 | 18067 | speculating read retry on /xxx.xxx.xxx.205
xxx.xxx.xxx.203 | 18577 | Sending READ message to xxx.xxx.xxx.205/xxx.xxx.xxx.205
xxx.xxx.xxx.203 | 25534 | Partition index with 6093 entries found for sstable 8885
xxx.xxx.xxx.203 | 25571 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 34989 | Partition index with 5327 entries found for sstable 8524
xxx.xxx.xxx.203 | 35022 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 36322 | Partition index with 333 entries found for sstable 8477
xxx.xxx.xxx.203 | 36336 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 714242 | Partition index with 299251 entries found for sstable 8541
xxx.xxx.xxx.203 | 714279 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 715717 | Partition index with 501 entries found for sstable 8217
xxx.xxx.xxx.203 | 715745 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 716232 | Partition index with 252 entries found for sstable 8888
xxx.xxx.xxx.203 | 716245 | Seeking to partition indexed section in data file
xxx.xxx.xxx.205 | 87 | READ message received from /xxx.xxx.xxx.203
xxx.xxx.xxx.205 | 50427 | Executing single-partition query on nt
xxx.xxx.xxx.205 | 50535 | Acquiring sstable references
xxx.xxx.xxx.205 | 50628 | Merging memtable tombstones
xxx.xxx.xxx.205 | 170441 | Partition index with 35650 entries found for sstable 6332
xxx.xxx.xxx.203 | 30718026 | Partition index with 199905 entries found for sstable 5958
xxx.xxx.xxx.203 | 30718077 | Seeking to partition indexed section in data file
xxx.xxx.xxx.205 | 170499 | Seeking to partition indexed section in data file
xxx.xxx.xxx.205 | 248898 | Partition index with 30958 entries found for sstable 6797
xxx.xxx.xxx.205 | 248962 | Seeking to partition indexed section in data file
xxx.xxx.xxx.203 | 67814573 | Read timeout: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
xxx.xxx.xxx.203 | 67814675 | Timed out; received 0 of 1 responses
I have 4 nodes, with replication factor of 3(one node is very light but it's not .203) The data I'm trying to read isn't very much -- even if LIMIT 1 is not being pushed to the remote node, the low end of the interval should be about 3 hours ago (I have no epochs past the current time)
Any tips on how to fix this/what might be going wrong? My cassandra version is 2.1.9, running largely with defaults
Table schema is as follows (I can't publish the whole schema for privacy reasons, but showing the keys which I hope is the main thing that matters)
PRIMARY KEY (group_id, epoch, group_name, auto_generated_uuid_field)
) WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 7776000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
___________EDIT_____________
to answer questions from below:
output of status:
-- Address Load Tokens Owns Host ID Rack x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DN xxx.xxx.xxx.204 15.8 GB 1 ? 32ed196b-f6eb-4e93-b759 r1 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UN xxx.xxx.xxx.205 20.38 GB 1 ? 446d71aa-e9cd-4ca9-a6ac r1 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UN xxx.xxx.xxx.202 1.48 GB 1 ? 2a6670b2-63f2-43be-b672 r1 x~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UN xxx.xxx.xxx.203 15.72 GB 1 ? dd26dfee-82da-454b-8db2 r1
The system.log is trickier as I have a lot of logging in there...one suspect thing I see is
WARN [CompactionExecutor:6] 2015-10-08 19:44:16,595 SSTableWriter.java (line 240) Compacting large partition nt_live_october/nt:254358 (230692316 bytes)
but it's just a warning...shortly after I see
INFO [CompactionExecutor:6] 2015-10-08 19:44:16,642 CompactionTask.java (line 274) Compacted 4 sstables to [/cassandra/data_dir_d/nt_live_october/nt-72813b106b9111e58f1ea1f0942ab78d/nt_live_october-nt-ka-9024,]. 35,733,701 bytes to 30,186,394 (~84% of original) in 34,907ms = 0.824705MB/s. 21 total partitions merged to 18. Partition merge counts were {1:17, 4:1, }
I see quite a few of these pairs in the log...but no ERROR level messages. Compaction seems to be going OK..it does say that this is the largest column family but all messages are INFO level....
First, the DN status of the node 204 means down. Retrieve its system.log and look for :
Exceptions and ERROR level logs
Anormal GC activity ( collection longer than 200ms)
StatusLogger
Second, the data is badly distributed among the cluster. The load of 202 is only 1.48 GB. I suspect you have some very large partitions replicated on the other nodes. What is the replication factor ? What is the scheme of your keyspace ? You can answer these questions with cqlsh command :
DESCRIBE KEYSPACE nt_live_october;
A simple table join is done usualy in 0.0XX seconds and sometimes in 2.0XX seconds (according to PL/SQL Developer SQL execution). It sill happens when running from SQL Plus.
If I run the SQL 10 times, 8 times it runns fine and 2 times in 2+ seconds.
It's a clean install of Oracle 11.2.0.4 for Linux x86_64 on Centos 7.
I've installed Oracle recommended patches:
Patch 19769489 - Database Patch Set Update 11.2.0.4.5 (Includes CPUJan2015)
Patch 19877440 - Oracle JavaVM Component 11.2.0.4.2 Database PSU (Jan2015)
No change after patching.
The 2 tables have:
LNK_PACK_REP: 13 rows
PACKAGES: 6 rows
In SQL Plus i've enabled all statistics and runned the SQL multiple time. Only the time is changed from 0.1 to 2.1 from time to time. No other statistic is changed if I compare a run in 0.1 second with a run in 2.1 second. The server has 16 Gb RAM and 8 CPU core. Server load is under 0.1 (no user is using the server for the moment).
Output:
SQL> select PACKAGE_ID, id, package_name from LNK_PACK_REP LNKPR INNER JOIN PACKAGES P ON LNKPR.PACKAGE_ID = P.ID;
PACKAGE_ID ID PACKAGE_NAME
3 3 RAPOARTE
3 3 RAPOARTE
121 121 VANZARI
121 121 VANZARI
121 121 VANZARI
2 2 PACHETE
2 2 PACHETE
1 1 DEPARTAMENTE
1 1 DEPARTAMENTE
81 81 ROLURI
81 81 ROLURI
PACKAGE_ID ID PACKAGE_NAME
101 101 UTILIZATORI
101 101 UTILIZATORI
13 rows selected.
Elapsed: 00:00:02.01
Execution Plan
Plan hash value: 2671988802
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 13 | 351 | 3 (0)| 00:00:01 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 13 | 351 | 3 (0)| 00:00:01 | Q1,02 | P->S | QC (RAND) |
|* 3 | HASH JOIN | | 13 | 351 | 3 (0)| 00:00:01 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 6 | 84 | 2 (0)| 00:00:01 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | P->P | HASH |
| 6 | PX BLOCK ITERATOR | | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | PCWC | |
| 7 | TABLE ACCESS FULL| PACKAGES | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | PCWP | |
| 8 | BUFFER SORT | | | | | | Q1,02 | PCWC | |
| 9 | PX RECEIVE | | 13 | 169 | 1 (0)| 00:00:01 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10000 | 13 | 169 | 1 (0)| 00:00:01 | | S->P | HASH |
| 11 | INDEX FULL SCAN | UNQ_PACK_REP | 13 | 169 | 1 (0)| 00:00:01 | | | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
3 - access("LNKPR"."PACKAGE_ID"="P"."ID")
Note
dynamic sampling used for this statement (level=2)
Statistics
24 recursive calls
0 db block gets
10 consistent gets
0 physical reads
0 redo size
923 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
4 sorts (memory)
0 sorts (disk)
13 rows processed
Table 1 structure:
-- Create table
create table PACKAGES
(
id NUMBER(3) not null,
package_name VARCHAR2(150),
position NUMBER(3),
activ NUMBER(1)
)
tablespace UM
pctfree 10
initrans 1
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate primary, unique and foreign key constraints
alter table PACKAGES
add constraint PACKAGES_ID primary key (ID)
using index
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate indexes
create index PACKAGES_ACTIV on PACKAGES (ID, ACTIV)
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
Table 2 structure:
-- Create table
create table LNK_PACK_REP
(
package_id NUMBER(3) not null,
report_id NUMBER(3) not null
)
tablespace UM
pctfree 10
initrans 1
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate primary, unique and foreign key constraints
alter table LNK_PACK_REP
add constraint UNQ_PACK_REP primary key (PACKAGE_ID, REPORT_ID)
using index
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate indexes
create index LNK_PACK_REP_REPORT_ID on LNK_PACK_REP (REPORT_ID)
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
In Oracle Enterprise Manager in SQL Monitor I can see the SQL that is runned multiple times. All runns have "Database Time" 0.0s (under 10 microsconds if I hover the list) and "Duration" 0.0s for normal run and 2.0s for thoose with delay.
If I go to Monitored SQL Executions for that run of 2.0s I have:
Duration: 2.0s
Database Time: 0.0s
PL/SQL & Java: 0.0
Wait activity: % (no number here)
Buffer gets: 10
IO Requests: 0
IO Bytes: 0
Fetch calls: 2
Parallel: 4
Theese numbers are consistend with a fast run except Duration that is even smaller than Database Time (10,163 microseconds Database Time and 3,748 microseconds Duration) both dispalyed as 0.0s if no mouse hover.
I don't know what else to check.
Parallel queries cannot be meaningfully tuned to within a few seconds. They are designed for queries that process large amounts of data for a long time.
The best way to optimize parallel statements with small data sets is to temporarily disable it:
alter system set parallel_max_servers=0;
(This is a good example of the advantages of developing on workstations instead of servers. On a server, this change affects everyone and you probably don't even have the privilege to run the command.)
The query may be simple but parallelism adds a lot of complexity in the background.
It's hard to say exactly why it's slower. If you have the SQL Monitoring report the wait events may help. But even those numbers may just be generic waits like "CPU". Parallel queries have a lot of overhead, in expectation of a resource-intensive, long-running query. Here are some types of overhead that may explain where those 2 seconds come from:
Dynamic sampling - Parallelism may automatically cause dynamic sampling, which reads data from the tables. Although dynamic sampling used for this statement (level=2)
may just imply missing optimizer statistics.
OS Thread startup - The SQL statement probably needs to start up 8 additional OS threads, and prepare a large amount of memory to hold all the intermediate data. Perhaps
the parameter PARALLEL_MIN_SERVERS could help prevent some time used to create those threads.
Additional monitoring - Parallel statements are automatically monitored, which requires recursive SELECTs and INSERTs.
Caching - Parallel queries often read directly from disk and skip reading and writing into the buffer cache. The rules for when it caches data are complicated and undocumented.
Downgrading - Finding the correct degree of parallelism is complicated. For example, I've compiled a list of 39 factors that influence the DOP. It's possible that one of those is causing downgrading, making some queries fast and others slow.
And there are probably dozens of other types of overhead I can't think of. Parallelism is great for massively improving the run-time of huge operations. But it doesn't work well for tiny queries.
The delay is due to parallelism as suggested by David Aldridge and Jon Heller but I don't agree the solution proposed by Jon Heller to disable parallelism for all queries (at system level). You can play with "alter session" to disable it and re-enable it before running big queries. The exact reason of the delay it's still unknown as the query finish fast in 8 out of 10 runs and I would expect a 10/10 fast run.