Can't trace hinted handoff in Cassandra - cassandra

I'm trying to emulate hinted handoff using Cassandra cluster in Docker.
Hinted handoff is active:
root#2f5aa8d649e2:/# nodetool statushandoff
Hinted handoff is running
The keyspace has a replication factor of 3:
cqlsh> DESCRIBE KEYSPACE imdb;
CREATE KEYSPACE imdb WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '2', 'dc2': '1'} AND durable_writes = true;
Then I shut one node down, turn on tracing and insert a new row:
cqlsh:imdb> insert into movies_by_actor (actor, movie_id, character, movie_title, salary) values ('TomHanks', uuid(), 'Character', 'Title', 1000);
Tracing session: e4a2cc20-42ce-11e7-bd49-cf534e0135c6
activity | timestamp | source | source_elapsed | client
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------+-----------
Execute CQL3 query | 2017-05-27 11:23:22.466000 | 172.13.0.2 | 0 | 127.0.0.1
Parsing insert into movies_by_actor (actor, movie_id, character, movie_title, salary) values ('TomHanks', uuid(), 'Character', 'Title', 1000); [Native-Transport-Requests-1] | 2017-05-27 11:23:22.467000 | 172.13.0.2 | 364 | 127.0.0.1
Preparing statement [Native-Transport-Requests-1] | 2017-05-27 11:23:22.467000 | 172.13.0.2 | 727 | 127.0.0.1
Determining replicas for mutation [Native-Transport-Requests-1] | 2017-05-27 11:23:22.468000 | 172.13.0.2 | 1354 | 127.0.0.1
Sending MUTATION message to /172.13.0.3 [MessagingService-Outgoing-/172.13.0.3-Small] | 2017-05-27 11:23:22.468000 | 172.13.0.2 | 1722 | 127.0.0.1
Sending MUTATION message to /172.13.0.6 [MessagingService-Outgoing-/172.13.0.6-Small] | 2017-05-27 11:23:22.468000 | 172.13.0.2 | 1722 | 127.0.0.1
MUTATION message received from /172.13.0.2 [MessagingService-Incoming-/172.13.0.2] | 2017-05-27 11:23:22.469000 | 172.13.0.3 | 30 | 127.0.0.1
MUTATION message received from /172.13.0.2 [MessagingService-Incoming-/172.13.0.2] | 2017-05-27 11:23:22.469000 | 172.13.0.6 | 35 | 127.0.0.1
Appending to commitlog [MutationStage-1] | 2017-05-27 11:23:22.469000 | 172.13.0.3 | 294 | 127.0.0.1
Appending to commitlog [MutationStage-1] | 2017-05-27 11:23:22.469000 | 172.13.0.6 | 292 | 127.0.0.1
Adding to movies_by_actor memtable [MutationStage-1] | 2017-05-27 11:23:22.469000 | 172.13.0.6 | 486 | 127.0.0.1
Enqueuing response to /172.13.0.2 [MutationStage-1] | 2017-05-27 11:23:22.469000 | 172.13.0.6 | 660 | 127.0.0.1
REQUEST_RESPONSE message received from /172.13.0.3 [MessagingService-Incoming-/172.13.0.3] | 2017-05-27 11:23:22.470000 | 172.13.0.2 | 3659 | 127.0.0.1
Processing response from /172.13.0.3 [RequestResponseStage-2] | 2017-05-27 11:23:22.470000 | 172.13.0.2 | 3820 | 127.0.0.1
Sending REQUEST_RESPONSE message to /172.13.0.2 [MessagingService-Outgoing-/172.13.0.2-Small] | 2017-05-27 11:23:22.472000 | 172.13.0.6 | 3533 | 127.0.0.1
REQUEST_RESPONSE message received from /172.13.0.6 [MessagingService-Incoming-/172.13.0.6] | 2017-05-27 11:23:22.473000 | 172.13.0.2 | 34 | 127.0.0.1
Processing response from /172.13.0.6 [RequestResponseStage-3] | 2017-05-27 11:23:22.473000 | 172.13.0.2 | 523 | 127.0.0.1
Request complete | 2017-05-27 11:23:22.469919 | 172.13.0.2 | 3919 | 127.0.0.1
As seen from the log, coordinator node 172.13.0.2 processes the request and contacts nodes 172.13.0.3 and 172.13.0.6. I would expect node 172.13.0.2 to save a hinted handoff, since the third node is unavailable. But when I check the system.hints table, it is empty:
cqlsh:imdb> select * from system.hints;
target_id | hint_id | message_version | mutation
-----------+---------+-----------------+----------
(0 rows)
The consistency level is set to default ONE. Could someone explain where the hinted handoff is stored, if at all?

Latest version of cassandra does not store hints in system.hints table.
Hints are stored in flat files from cassandra 3.0. If using cassandra version greater than 3.0 you must look into hints directory configured in cassandra.yaml
# Directory where Cassandra should store hints.
# If not set, the default directory is $CASSANDRA_HOME/data/hints.
hints_directory: "C:/Program Files/DataStax-DDC/data/hints"
# How often hints should be flushed from the internal buffers to disk.
# Will *not* trigger fsync.
hints_flush_period_in_ms: 10000
Check above 2 values in your cassandra.yaml and look for hints in it.
Hinted Handoff in cassandra 3.0

Related

Cassandra select not stable using datastax driver

Versions:
com.datastax.oss
-java-driver-core:4.5.0
-java-driver-query-builder:4.5.0
-java-driver-mapper-runtime:4.5.0
cassandra:3.11.5 docker image
jdk 11.1
I'm running a deployment of feast that I've modified to use cassandra as a backend low latency serving db for machine learning features. I'm sucessfully writing and reading rows, but the read is inconsistent with respect to results returned. Sometimes the payloads are empty and I don't know why. I have already tried updating to the latest datastax driver and coordinating time using ntp/time.google.com. I've also tried to change the consistency of write to ALL and read to LOCAL_ONE/LOCAL_QUOROM, without success. I'm really struggling to figure out why select isn't consistent. Any insight would be great! :) Here is the process:
I write the rows into cassandra using CassandraIO
#Override
public Future<Void> saveAsync(CassandraMutation entityClass) {
return mapper.saveAsync(
entityClass,
Option.timestamp(entityClass.getWriteTime()),
Option.ttl(entityClass.getTtl()),
Option.consistencyLevel(ConsistencyLevel.LOCAL_QUORUM),
Option.tracing(true));
}
This seems to successfully map rows into my cassandra cluster, which I then query in my application as follows
List<InetSocketAddress> contactPoints =
Arrays.stream(cassandraConfig.getBootstrapHosts().split(","))
.map(h -> new InetSocketAddress(h, cassandraConfig.getPort()))
.collect(Collectors.toList());
CqlSession session =
CqlSession.builder()
.addContactPoints(contactPoints)
.withLocalDatacenter(storeProperties.getCassandraDcName())
.build();
....
PreparedStatement query =
session.prepare(
String.format(
"SELECT entities, feature, value, WRITETIME(value) as writetime FROM %s.%s WHERE entities = ?",
keyspace, tableName));
session.execute(
query
.bind(key)
.setTracing(true)
.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM)));
My issue is that there doesn't seem to be consistent selects happening. I have been recording various bits for a while now and for example here are two select queries, with the same coordinator, one succeeded, then after that the subsequent select fails to return results.
cqlsh> select * from system_traces.sessions where session_id=be023400-6a1e-11ea-97ca-6b8bbe3a2a36;
session_id | client | command | coordinator | duration | parameters | request | started_at
--------------------------------------+---------------+---------+---------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------+---------------------------------
be023400-6a1e-11ea-97ca-6b8bbe3a2a36 | xx.xx.xxx.189 | QUERY | xx.xx.xxx.158 | 41313 | {'bound_var_0_entities': '''ml_project/test_test_entity:1:entity2_uuid=TenderGreens_8755fff7|entity1_uuid=Zach_Yang_fe7fea92''', 'consistency_level': 'LOCAL_QUORUM', 'page_size': '5000', 'query': 'SELECT entities, feature, value, WRITETIME(value) as writetime FROM feast.feature_store WHERE entities = ?', 'serial_consistency_level': 'SERIAL'} | Execute CQL3 prepared query | 2020-03-19 20:18:05.760000+0000
select event_id, activity, source_elapsed, thread from system_traces.events where session_id=be023400-6a1e-11ea-97ca-6b8bbe3a2a36;
event_id | activity | source_elapsed | thread
--------------------------------------+---------------------------------------------------------------------------------------------------------------+----------------+-----------------------------------------------------------------------------------------------------------
be0652b0-6a1e-11ea-97ca-6b8bbe3a2a36 | Read-repair DC_LOCAL | 27087 | Native-Transport-Requests-1
be0679c0-6a1e-11ea-97ca-6b8bbe3a2a36 | reading data from /xx.xx.xxx.161 | 28034 | Native-Transport-Requests-1
be06a0d0-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to /xx.xx.xxx.161 | 28552 | MessagingService-Outgoing-/xx.xx.xxx.161-Small
be06a0d1-6a1e-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.162 | 28595 | Native-Transport-Requests-1
be06a0d2-6a1e-11ea-97ca-6b8bbe3a2a36 | Executing single-partition query on feature_store | 28598 | ReadStage-3
be06a0d3-6a1e-11ea-97ca-6b8bbe3a2a36 | Acquiring sstable references | 28689 | ReadStage-3
be06a0d4-6a1e-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xx.138 | 28852 | Native-Transport-Requests-1
be06a0d5-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to /xx.xx.xxx.162 | 28904 | MessagingService-Outgoing-/xx.xx.xxx.162-Small
be06a0d6-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 56 | 28937 | ReadStage-3
be06a0d7-6a1e-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.171 | 28983 | Native-Transport-Requests-1
be06a0d8-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 55 | 29020 | ReadStage-3
be06a0d9-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xx.138 | 29071 | MessagingService-Outgoing-cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xx.138-Small
be06a0da-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 54 | 29181 | ReadStage-3
be06a0db-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xxx.171 | 29201 | MessagingService-Outgoing-cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xx.171-Small
be06c7e0-6a1e-11ea-80ad-dffaf3fb56b4 | READ message received from /xx.xx.xxx.158 | 33 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-8693-577fec389856 | READ message received from /xx.xx.xxx.158 | 34 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-8b1a-e5aa876f7d0d | READ message received from /xx.xx.xxx.158 | 29 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-8d2e-c5837edad3d1 | READ message received from /xx.xx.xxx.158 | 44 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 41 | 29273 | ReadStage-3
be06c7e1-6a1e-11ea-80ad-dffaf3fb56b4 | Executing single-partition query on feature_store | 389 | ReadStage-1
be06c7e1-6a1e-11ea-8b1a-e5aa876f7d0d | Executing single-partition query on feature_store | 513 | ReadStage-1
be06c7e1-6a1e-11ea-97ca-6b8bbe3a2a36 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 29342 | ReadStage-3
be06c7e2-6a1e-11ea-80ad-dffaf3fb56b4 | Acquiring sstable references | 457 | ReadStage-1
be06c7e3-6a1e-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 55 | 620 | ReadStage-1
be06c7e4-6a1e-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 54 | 659 | ReadStage-1
be06c7e5-6a1e-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 41 | 677 | ReadStage-1
be06c7e6-6a1e-11ea-80ad-dffaf3fb56b4 | Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 695 | ReadStage-1
be06eef0-6a1e-11ea-80ad-dffaf3fb56b4 | Merged data from memtables and 0 sstables | 1039 | ReadStage-1
be06eef0-6a1e-11ea-8693-577fec389856 | Executing single-partition query on feature_store | 372 | ReadStage-1
be06eef0-6a1e-11ea-8b1a-e5aa876f7d0d | Acquiring sstable references | 583 | ReadStage-1
be06eef0-6a1e-11ea-8d2e-c5837edad3d1 | Executing single-partition query on feature_store | 454 | ReadStage-1
be06eef0-6a1e-11ea-97ca-6b8bbe3a2a36 | Merged data from memtables and 0 sstables | 30372 | ReadStage-3
be06eef1-6a1e-11ea-80ad-dffaf3fb56b4 | Read 16 live rows and 0 tombstone cells | 1125 | ReadStage-1
be06eef1-6a1e-11ea-8693-577fec389856 | Acquiring sstable references | 493 | ReadStage-1
be06eef1-6a1e-11ea-8b1a-e5aa876f7d0d | Bloom filter allows skipping sstable 54 | 703 | ReadStage-1
be06eef1-6a1e-11ea-8d2e-c5837edad3d1 | Acquiring sstable references | 530 | ReadStage-1
be06eef1-6a1e-11ea-97ca-6b8bbe3a2a36 | Read 16 live rows and 0 tombstone cells | 30484 | ReadStage-3
be06eef2-6a1e-11ea-80ad-dffaf3fb56b4 | Enqueuing response to /xx.xx.xxx.158 | 1155 | ReadStage-1
be06eef2-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 56 | 721 | ReadStage-1
be06eef2-6a1e-11ea-8b1a-e5aa876f7d0d | Bloom filter allows skipping sstable 41 | 740 | ReadStage-1
be06eef2-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 56 | 655 | ReadStage-1
be06eef3-6a1e-11ea-80ad-dffaf3fb56b4 | Sending REQUEST_RESPONSE message to cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158 | 1492 | MessagingService-Outgoing-cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158-Small
be06eef3-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 55 | 780 | ReadStage-1
be06eef3-6a1e-11ea-8b1a-e5aa876f7d0d | Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones | 761 | ReadStage-1
be06eef3-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 55 | 686 | ReadStage-1
be06eef4-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 54 | 815 | ReadStage-1
be06eef4-6a1e-11ea-8b1a-e5aa876f7d0d | Merged data from memtables and 0 sstables | 1320 | ReadStage-1
be06eef4-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 54 | 705 | ReadStage-1
be06eef5-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 41 | 839 | ReadStage-1
be06eef5-6a1e-11ea-8b1a-e5aa876f7d0d | Read 16 live rows and 0 tombstone cells | 1495 | ReadStage-1
be06eef5-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 41 | 720 | ReadStage-1
be06eef6-6a1e-11ea-8693-577fec389856 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 871 | ReadStage-1
be06eef6-6a1e-11ea-8b1a-e5aa876f7d0d | Enqueuing response to /xx.xx.xxx.158 | 1554 | ReadStage-1
be06eef6-6a1e-11ea-8d2e-c5837edad3d1 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 738 | ReadStage-1
be06eef7-6a1e-11ea-8d2e-c5837edad3d1 | Merged data from memtables and 0 sstables | 1157 | ReadStage-1
be06eef8-6a1e-11ea-8d2e-c5837edad3d1 | Read 16 live rows and 0 tombstone cells | 1296 | ReadStage-1
be06eef9-6a1e-11ea-8d2e-c5837edad3d1 | Enqueuing response to /xx.xx.xxx.158 | 1325 | ReadStage-1
be071600-6a1e-11ea-8693-577fec389856 | Merged data from memtables and 0 sstables | 1592 | ReadStage-1
be071600-6a1e-11ea-8b1a-e5aa876f7d0d | Sending REQUEST_RESPONSE message to cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158 | 1783 | MessagingService-Outgoing-cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158-Small
be071600-6a1e-11ea-8d2e-c5837edad3d1 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 1484 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
be071600-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.161 | 31525 | MessagingService-Incoming-/xx.xx.xxx.161
be071601-6a1e-11ea-8693-577fec389856 | Read 16 live rows and 0 tombstone cells | 1754 | ReadStage-1
be071601-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.161 | 31650 | RequestResponseStage-4
be071602-6a1e-11ea-8693-577fec389856 | Enqueuing response to /xx.xx.xxx.158 | 1796 | ReadStage-1
be071602-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xx.138 | 31795 | MessagingService-Incoming-/xx.xx.xx.138
be071603-6a1e-11ea-8693-577fec389856 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 1973 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
be071603-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xx.138 | 31872 | RequestResponseStage-4
be071604-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.162 | 31918 | MessagingService-Incoming-/xx.xx.xxx.162
be071605-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.162 | 32047 | RequestResponseStage-4
be073d10-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.171 | 32688 | MessagingService-Incoming-/xx.xx.xx.171
be073d11-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.171 | 32827 | RequestResponseStage-2
be073d12-6a1e-11ea-97ca-6b8bbe3a2a36 | Initiating read-repair | 32985 | RequestResponseStage-2
Failure:
cqlsh> select * from system_traces.sessions where session_id=472551e0-6a1f-11ea-97ca-6b8bbe3a2a36;
session_id | client | command | coordinator | duration | parameters | request | started_at
--------------------------------------+---------------+---------+---------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------+---------------------------------
472551e0-6a1f-11ea-97ca-6b8bbe3a2a36 | xx.xx.xxx.189 | QUERY | xx.xx.xxx.158 | 3044 | {'bound_var_0_entities': '''ml_project/test_test_entity:1:entity1_uuid=Zach_Yang_fe7fea92|entity2_uuid=TenderGreens_8755fff7''', 'consistency_level': 'LOCAL_QUORUM', 'page_size': '5000', 'query': 'SELECT entities, feature, value, WRITETIME(value) as writetime FROM feast.feature_store WHERE entities = ?', 'serial_consistency_level': 'SERIAL'} | Execute CQL3 prepared query | 2020-03-19 20:21:55.838000+0000
cqlsh> select event_id, activity, source_elapsed, thread from system_traces.events where session_id=472551e0-6a1f-11ea-97ca-6b8bbe3a2a36;
event_id | activity | source_elapsed | thread
--------------------------------------+---------------------------------------------------------------------------------------------------------------+----------------+-----------------------------------------------------------------------------------------------------------
472578f0-6a1f-11ea-80ad-dffaf3fb56b4 | READ message received from /xx.xx.xxx.158 | 18 | MessagingService-Incoming-/xx.xx.xxx.158
472578f0-6a1f-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.138 | 619 | Native-Transport-Requests-1
472578f1-6a1f-11ea-97ca-6b8bbe3a2a36 | Executing single-partition query on feature_store | 708 | ReadStage-2
472578f2-6a1f-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.161 | 755 | Native-Transport-Requests-1
472578f3-6a1f-11ea-97ca-6b8bbe3a2a36 | Acquiring sstable references | 768 | ReadStage-2
472578f4-6a1f-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xxx.138 | 836 | MessagingService-Outgoing-cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xxx.138-Small
472578f5-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 56 | 859 | ReadStage-2
472578f6-6a1f-11ea-97ca-6b8bbe3a2a36 | speculating read retry on /xx.xx.xx.171 | 862 | Native-Transport-Requests-1
472578f7-6a1f-11ea-97ca-6b8bbe3a2a36 | Sending READ message to /xx.xx.xxx.161 | 893 | MessagingService-Outgoing-/xx.xx.xxx.161-Small
472578f8-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 55 | 903 | ReadStage-2
472578f9-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 54 | 929 | ReadStage-2
472578fa-6a1f-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xx.171 | 982 | MessagingService-Outgoing-cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xx.171-Small
472578fb-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 41 | 996 | ReadStage-2
472578fc-6a1f-11ea-97ca-6b8bbe3a2a36 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 1039 | ReadStage-2
472578fd-6a1f-11ea-97ca-6b8bbe3a2a36 | Merged data from memtables and 0 sstables | 1227 | ReadStage-2
472578fe-6a1f-11ea-97ca-6b8bbe3a2a36 | Read 0 live rows and 0 tombstone cells | 1282 | ReadStage-2
4725a000-6a1f-11ea-80ad-dffaf3fb56b4 | Executing single-partition query on feature_store | 226 | ReadStage-2
4725a000-6a1f-11ea-8693-577fec389856 | READ message received from /xx.xx.xxx.158 | 12 | MessagingService-Incoming-/xx.xx.xxx.158
4725a000-6a1f-11ea-8d2e-c5837edad3d1 | READ message received from /xx.xx.xxx.158 | 15 | MessagingService-Incoming-/xx.xx.xxx.158
4725a001-6a1f-11ea-80ad-dffaf3fb56b4 | Acquiring sstable references | 297 | ReadStage-2
4725a001-6a1f-11ea-8693-577fec389856 | Executing single-partition query on feature_store | 258 | ReadStage-1
4725a001-6a1f-11ea-8d2e-c5837edad3d1 | Executing single-partition query on feature_store | 230 | ReadStage-1
4725a002-6a1f-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 55 | 397 | ReadStage-2
4725a002-6a1f-11ea-8693-577fec389856 | Acquiring sstable references | 327 | ReadStage-1
4725a002-6a1f-11ea-8d2e-c5837edad3d1 | Acquiring sstable references | 297 | ReadStage-1
4725a003-6a1f-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 54 | 433 | ReadStage-2
4725a003-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 56 | 451 | ReadStage-1
4725a003-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 56 | 439 | ReadStage-1
4725a004-6a1f-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 41 | 450 | ReadStage-2
4725a004-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 55 | 512 | ReadStage-1
4725a004-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 55 | 492 | ReadStage-1
4725a005-6a1f-11ea-80ad-dffaf3fb56b4 | Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 466 | ReadStage-2
4725a005-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 54 | 570 | ReadStage-1
4725a005-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 54 | 513 | ReadStage-1
4725a006-6a1f-11ea-80ad-dffaf3fb56b4 | Merged data from memtables and 0 sstables | 648 | ReadStage-2
4725a006-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 41 | 606 | ReadStage-1
4725a006-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 41 | 526 | ReadStage-1
4725a007-6a1f-11ea-80ad-dffaf3fb56b4 | Read 0 live rows and 0 tombstone cells | 708 | ReadStage-2
4725a007-6a1f-11ea-8693-577fec389856 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 631 | ReadStage-1
4725a007-6a1f-11ea-8d2e-c5837edad3d1 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 542 | ReadStage-1
4725a008-6a1f-11ea-80ad-dffaf3fb56b4 | Enqueuing response to /xx.xx.xxx.158 | 727 | ReadStage-2
4725a008-6a1f-11ea-8d2e-c5837edad3d1 | Merged data from memtables and 0 sstables | 700 | ReadStage-1
4725a009-6a1f-11ea-80ad-dffaf3fb56b4 | Sending REQUEST_RESPONSE message to cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158 | 838 | MessagingService-Outgoing-cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158-Small
4725a009-6a1f-11ea-8d2e-c5837edad3d1 | Read 0 live rows and 0 tombstone cells | 756 | ReadStage-1
4725a00a-6a1f-11ea-8d2e-c5837edad3d1 | Enqueuing response to /xx.xx.xxx.158 | 772 | ReadStage-1
4725a00b-6a1f-11ea-8d2e-c5837edad3d1 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 914 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
4725c710-6a1f-11ea-8693-577fec389856 | Merged data from memtables and 0 sstables | 845 | ReadStage-1
4725c710-6a1f-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.161 | 2327 | MessagingService-Incoming-/xx.xx.xxx.161
4725c711-6a1f-11ea-8693-577fec389856 | Read 0 live rows and 0 tombstone cells | 905 | ReadStage-1
4725c711-6a1f-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.161 | 2443 | RequestResponseStage-2
4725c712-6a1f-11ea-8693-577fec389856 | Enqueuing response to /xx.xx.xxx.158 | 929 | ReadStage-1
4725c712-6a1f-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.138 | 2571 | MessagingService-Incoming-/xx.xx.xxx.138
4725c713-6a1f-11ea-8693-577fec389856 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 1023 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
4725c713-6a1f-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.138 | 2712 | RequestResponseStage-2
4725c714-6a1f-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xx.171 | 2725 | MessagingService-Incoming-/xx.xx.xx.171
4725c715-6a1f-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xx.171 | 2797 | RequestResponseStage-2
4725c716-6a1f-11ea-97ca-6b8bbe3a2a36 | Initiating read-repair | 2855 | RequestResponseStage-2
Keyspace info
cqlsh> describe keyspace feast;
CREATE KEYSPACE feast WITH replication = {'class': 'NetworkTopologyStrategy', 'stage-us-west1': '5'} AND durable_writes = true;
CREATE TABLE feast.feature_store (
entities text,
feature text,
value blob,
PRIMARY KEY (entities, feature)
) WITH CLUSTERING ORDER BY (feature ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';

Cassandra replicates to a node different from the one in nodetool getendpoints

I have a table in Cassandra:
CREATE TABLE imdb.movies_by_actor (
actor text,
movie_id uuid,
character text,
movie_title text,
salary int,
PRIMARY KEY (actor, movie_id)
) WITH CLUSTERING ORDER BY (movie_id ASC)
actor | movie_id | character | movie_title | salary
-----------+--------------------------------------+-----------+-------------+--------
Tom Hanks | 767b7a89-868c-46ce-8fa6-f6184dfb6d69 | Dad | Seattle | 25000
Tom Hanks | a9a64b89-a19d-46e9-b5ee-991ac4939891 | Officer | Green mile | 20000
Then I find out which nodes are responsible for the 'Tom Hanks' partition:
select token(actor) from movies_by_actor ;
system.token(actor)
----------------------
-4258050846863339499
-4258050846863339499
root#2f5aa8d649e2:/# nodetool getendpoints imdb movies_by_actor -4258050846863339499
172.13.0.6
172.13.0.3
172.13.0.4
Then I shut the node corresponding to 172.13.0.6 down:
docker stop cassandra6
root#2f5aa8d649e2:/# ping 172.13.0.6
PING 172.13.0.6 (172.13.0.6): 56 data bytes
92 bytes from 2f5aa8d649e2 (172.13.0.2): Destination Host Unreachable
When I try to update the row and look at tracing info, it looks like data are sent to 172.13.0.2, 172.13.0.4, 172.13.0.5:
cqlsh:imdb> update movies_by_actor set salary = 26000 where actor = 'Tom Hanks' and movie_id = 767b7a89-868c-46ce-8fa6-f6184dfb6d69;
Tracing session: f44dbd70-4228-11e7-89c9-cf534e0135c6
activity | timestamp | source | source_elapsed | client
----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------+-----------
Execute CQL3 query | 2017-05-26 15:35:32.295000 | 172.13.0.2 | 0 | 127.0.0.1
Parsing update movies_by_actor set salary = 26000 where actor = 'Tom Hanks' and movie_id = 767b7a89-868c-46ce-8fa6-f6184dfb6d69; [Native-Transport-Requests-1] | 2017-05-26 15:35:32.295000 | 172.13.0.2 | 303 | 127.0.0.1
Preparing statement [Native-Transport-Requests-1] | 2017-05-26 15:35:32.295000 | 172.13.0.2 | 646 | 127.0.0.1
Determining replicas for mutation [Native-Transport-Requests-1] | 2017-05-26 15:35:32.296000 | 172.13.0.2 | 1181 | 127.0.0.1
Appending to commitlog [MutationStage-3] | 2017-05-26 15:35:32.296000 | 172.13.0.2 | 1420 | 127.0.0.1
Adding to movies_by_actor memtable [MutationStage-3] | 2017-05-26 15:35:32.296000 | 172.13.0.2 | 1557 | 127.0.0.1
Sending MUTATION message to /172.13.0.4 [MessagingService-Outgoing-/172.13.0.4-Small] | 2017-05-26 15:35:32.296000 | 172.13.0.2 | 1567 | 127.0.0.1
Sending MUTATION message to /172.13.0.5 [MessagingService-Outgoing-/172.13.0.5-Small] | 2017-05-26 15:35:32.296000 | 172.13.0.2 | 1583 | 127.0.0.1
MUTATION message received from /172.13.0.2 [MessagingService-Incoming-/172.13.0.2] | 2017-05-26 15:35:32.297000 | 172.13.0.4 | 27 | 127.0.0.1
MUTATION message received from /172.13.0.2 [MessagingService-Incoming-/172.13.0.2] | 2017-05-26 15:35:32.297000 | 172.13.0.5 | 23 | 127.0.0.1
Appending to commitlog [MutationStage-1] | 2017-05-26 15:35:32.297000 | 172.13.0.4 | 332 | 127.0.0.1
Adding to movies_by_actor memtable [MutationStage-1] | 2017-05-26 15:35:32.297000 | 172.13.0.4 | 577 | 127.0.0.1
Enqueuing response to /172.13.0.2 [MutationStage-1] | 2017-05-26 15:35:32.298000 | 172.13.0.4 | 884 | 127.0.0.1
Appending to commitlog [MutationStage-2] | 2017-05-26 15:35:32.298000 | 172.13.0.5 | 1526 | 127.0.0.1
Sending REQUEST_RESPONSE message to /172.13.0.2 [MessagingService-Outgoing-/172.13.0.2-Small] | 2017-05-26 15:35:32.298000 | 172.13.0.4 | 1122 | 127.0.0.1
Adding to movies_by_actor memtable [MutationStage-2] | 2017-05-26 15:35:32.299000 | 172.13.0.5 | 1854 | 127.0.0.1
Enqueuing response to /172.13.0.2 [MutationStage-2] | 2017-05-26 15:35:32.299000 | 172.13.0.5 | 2187 | 127.0.0.1
Sending REQUEST_RESPONSE message to /172.13.0.2 [MessagingService-Outgoing-/172.13.0.2-Small] | 2017-05-26 15:35:32.299000 | 172.13.0.5 | 2423 | 127.0.0.1
REQUEST_RESPONSE message received from /172.13.0.4 [MessagingService-Incoming-/172.13.0.4] | 2017-05-26 15:35:32.300000 | 172.13.0.2 | 56 | 127.0.0.1
REQUEST_RESPONSE message received from /172.13.0.5 [MessagingService-Incoming-/172.13.0.5] | 2017-05-26 15:35:32.300000 | 172.13.0.2 | 15 | 127.0.0.1
Processing response from /172.13.0.5 [RequestResponseStage-5] | 2017-05-26 15:35:32.300000 | 172.13.0.2 | 273 | 127.0.0.1
Processing response from /172.13.0.4 [RequestResponseStage-4] | 2017-05-26 15:35:32.300000 | 172.13.0.2 | 774 | 127.0.0.1
Request complete | 2017-05-26 15:35:32.296887 | 172.13.0.2 | 1887 | 127.0.0.1
Selecting with consistency level ALL also works although 172.13.0.6 is down. Could someone explain it please?
The command nodetool getendpoints received partition key value as parameter not the token
But there is a issue with nodetool getendpoints when the parameter value contain space.
You could use the script from the answer : https://stackoverflow.com/a/43155224/2320144
Or
You could run a nodetool ring to list out the token ranges for each node, and see which nodes are responsible for that range.
Source : https://stackoverflow.com/a/30515201/2320144

Cassandra delayed / denied updates

I'm having trouble with a small Cassandra cluster that used to work well. I used to have 3 nodes. When I added the 4th, I started seeing some issues with values not updating, so I did nodetool repair (a few times now) on the entire cluster. I should mention I did the switch at the same time as the upgrade from python-cql to the new python cassandra driver.
Essentially the weirdness falls into two cases:
Denied Updates:
cqlsh:analytics> select * from metrics where id = '36122cc69a7a12e266ab40f5b7756daee75bd0d2735a707b369302acb879eedc';
(0 rows)
Tracing session: 19e5bbd0-d172-11e3-a039-67dcdc0d02de
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 20:49:34,221 | 10.128.214.245 | 0
Parsing select * from metrics where id = '36122cc69a7a12e266ab40f5b7756daee75bd0d2735a707b369302acb879eedc' LIMIT 10000; | 20:49:34,221 | 10.128.214.245 | 176
Preparing statement | 20:49:34,222 | 10.128.214.245 | 311
Sending message to /10.128.180.108 | 20:49:34,222 | 10.128.214.245 | 773
Message received from /10.128.214.245 | 20:49:34,224 | 10.128.180.108 | 67
Row cache hit | 20:49:34,225 | 10.128.180.108 | 984
Read 0 live and 0 tombstoned cells | 20:49:34,225 | 10.128.180.108 | 1079
Message received from /10.128.180.108 | 20:49:34,227 | 10.128.214.245 | 5760
Enqueuing response to /10.128.214.245 | 20:49:34,227 | 10.128.180.108 | 3045
Processing response from /10.128.180.108 | 20:49:34,227 | 10.128.214.245 | 5942
Sending message to /10.128.214.245 | 20:49:34,227 | 10.128.180.108 | 3302
Request complete | 20:49:34,227 | 10.128.214.245 | 6282
cqlsh:analytics> update metrics set n = n + 1 where id = '36122cc69a7a12e266ab40f5b7756daee75bd0d2735a707b369302acb879eedc';
Tracing session: 20845ff0-d172-11e3-a039-67dcdc0d02de
activity | timestamp | source | source_elapsed
---------------------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 20:49:45,328 | 10.128.214.245 | 0
Parsing update metrics set n = n + 1 where id = '36122cc69a7a12e266ab40f5b7756daee75bd0d2735a707b369302acb879eedc'; | 20:49:45,328 | 10.128.214.245 | 129
Preparing statement | 20:49:45,328 | 10.128.214.245 | 227
Determining replicas for mutation | 20:49:45,328 | 10.128.214.245 | 298
Enqueuing counter update to /10.128.194.70 | 20:49:45,328 | 10.128.214.245 | 425
Sending message to /10.128.194.70 | 20:49:45,329 | 10.128.214.245 | 598
Message received from /10.128.214.245 | 20:49:45,330 | 10.128.194.70 | 37
Acquiring switchLock read lock | 20:49:45,331 | 10.128.194.70 | 623
Message received from /10.128.194.70 | 20:49:45,331 | 10.128.214.245 | 3335
Appending to commitlog | 20:49:45,331 | 10.128.194.70 | 645
Processing response from /10.128.194.70 | 20:49:45,331 | 10.128.214.245 | 3431
Adding to metrics memtable | 20:49:45,331 | 10.128.194.70 | 692
Sending message to /10.128.214.245 | 20:49:45,332 | 10.128.194.70 | 1120
Row cache miss | 20:49:45,332 | 10.128.194.70 | 1611
Executing single-partition query on metrics | 20:49:45,332 | 10.128.194.70 | 1687
Acquiring sstable references | 20:49:45,332 | 10.128.194.70 | 1692
Merging memtable tombstones | 20:49:45,332 | 10.128.194.70 | 1692
Key cache hit for sstable 13958 | 20:49:45,332 | 10.128.194.70 | 1714
Seeking to partition beginning in data file | 20:49:45,332 | 10.128.194.70 | 1856
Key cache hit for sstable 14036 | 20:49:45,333 | 10.128.194.70 | 2271
Seeking to partition beginning in data file | 20:49:45,333 | 10.128.194.70 | 2271
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones | 20:49:45,333 | 10.128.194.70 | 2540
Merging data from memtables and 2 sstables | 20:49:45,333 | 10.128.194.70 | 2564
Read 0 live and 1 tombstoned cells | 20:49:45,333 | 10.128.194.70 | 2632
Sending message to /10.128.195.149 | 20:49:45,335 | 10.128.194.70 | null
Message received from /10.128.194.70 | 20:49:45,335 | 10.128.180.108 | 43
Sending message to /10.128.180.108 | 20:49:45,335 | 10.128.194.70 | null
Acquiring switchLock read lock | 20:49:45,335 | 10.128.180.108 | 297
Appending to commitlog | 20:49:45,335 | 10.128.180.108 | 312
Message received from /10.128.194.70 | 20:49:45,336 | 10.128.195.149 | 53
Adding to metrics memtable | 20:49:45,336 | 10.128.180.108 | 374
Enqueuing response to /10.128.194.70 | 20:49:45,336 | 10.128.180.108 | 445
Sending message to /10.128.194.70 | 20:49:45,336 | 10.128.180.108 | 677
Message received from /10.128.180.108 | 20:49:45,337 | 10.128.194.70 | null
Processing response from /10.128.180.108 | 20:49:45,337 | 10.128.194.70 | null
Acquiring switchLock read lock | 20:49:45,338 | 10.128.195.149 | 1874
Appending to commitlog | 20:49:45,338 | 10.128.195.149 | 1970
Adding to metrics memtable | 20:49:45,338 | 10.128.195.149 | 2027
Enqueuing response to /10.128.194.70 | 20:49:45,338 | 10.128.195.149 | 2147
Sending message to /10.128.194.70 | 20:49:45,338 | 10.128.195.149 | 2572
Message received from /10.128.195.149 | 20:49:45,339 | 10.128.194.70 | null
Processing response from /10.128.195.149 | 20:49:45,339 | 10.128.194.70 | null
Request complete | 20:49:45,331 | 10.128.214.245 | 3556
cqlsh:analytics> select * from metrics where id = '36122cc69a7a12e266ab40f5b7756daee75bd0d2735a707b369302acb879eedc';
(0 rows)
Tracing session: 28f1f7b0-d172-11e3-a039-67dcdc0d02de
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 20:49:59,468 | 10.128.214.245 | 0
Parsing select * from metrics where id = '36122cc69a7a12e266ab40f5b7756daee75bd0d2735a707b369302acb879eedc' LIMIT 10000; | 20:49:59,468 | 10.128.214.245 | 119
Preparing statement | 20:49:59,468 | 10.128.214.245 | 235
Sending message to /10.128.180.108 | 20:49:59,468 | 10.128.214.245 | 574
Message received from /10.128.214.245 | 20:49:59,469 | 10.128.180.108 | 49
Row cache miss | 20:49:59,470 | 10.128.180.108 | 817
Executing single-partition query on metrics | 20:49:59,470 | 10.128.180.108 | 877
Acquiring sstable references | 20:49:59,470 | 10.128.180.108 | 888
Merging memtable tombstones | 20:49:59,470 | 10.128.180.108 | 938
Key cache hit for sstable 5399 | 20:49:59,470 | 10.128.180.108 | 1025
Seeking to partition beginning in data file | 20:49:59,470 | 10.128.180.108 | 1033
Message received from /10.128.180.108 | 20:49:59,471 | 10.128.214.245 | 3378
Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 20:49:59,471 | 10.128.180.108 | 1495
Processing response from /10.128.180.108 | 20:49:59,471 | 10.128.214.245 | 3466
Merging data from memtables and 1 sstables | 20:49:59,471 | 10.128.180.108 | 1507
Read 0 live and 1 tombstoned cells | 20:49:59,471 | 10.128.180.108 | 1660
Read 0 live and 0 tombstoned cells | 20:49:59,471 | 10.128.180.108 | 1759
Enqueuing response to /10.128.214.245 | 20:49:59,471 | 10.128.180.108 | 1817
Sending message to /10.128.214.245 | 20:49:59,471 | 10.128.180.108 | 1977
Request complete | 20:49:59,471 | 10.128.214.245 | 3638
So this is pretty straight forward. From y reading, it might have to do with tombstones timestamps that would have somehow gotten messed up. However, it's been several days and "now" still hasn't caught up to the future timestamp of the tombstone. Is there any way to just reset every single timestamp of the entire cluster to 0 while activity is stopped, I can live with slightly inacurate data right now.
The second issue is on some of my tables. Some tables reflect changes instantly, but for others, the changes will get reflected between 30 minutes to an hour later. I can't figure out how timestamps might relate to this.
I've synced all nodes of my cluster using NTP, not the most precise, but won't be out of sync in the scale of days or anything. All the nodes have been synced like this from the beginning, at no point did I have wildly out of sync times.
Can anybody help? As I was saying, by this point I'd settle for shutting down access to the cluster and resetting all timestamps to 0, I don't care about getting some of the order wrong, I just want this thing to work.
Thanks
Timestamps are immutable. You'd have to truncate the table and rebuild it. The easiest way to rebuild is to just insert correct data, but if that's not an option, you can round trip through sstable2json -> edit timestamps -> json2sstable.

Cassandra 1.2 merging data from memtables and sstables takes too long

Here is a trace from a 4 node cassandra cluster, running 1.2.6. I'm seeing a timeout with a simple select when the cluster is under no load and I need some help getting to the bottom of it.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 05:21:00,848 | 100.69.176.51 | 0
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 05:21:00,848 | 100.69.176.51 | 77
Peparing statement | 05:21:00,848 | 100.69.176.51 | 225
Executing single-partition query on user_scores | 05:21:00,849 | 100.69.176.51 | 589
Acquiring sstable references | 05:21:00,849 | 100.69.176.51 | 626
Merging memtable tombstones | 05:21:00,849 | 100.69.176.51 | 676
Key cache hit for sstable 34 | 05:21:00,849 | 100.69.176.51 | 817
Seeking to partition beginning in data file | 05:21:00,849 | 100.69.176.51 | 836
Key cache hit for sstable 32 | 05:21:00,849 | 100.69.176.51 | 1135
Seeking to partition beginning in data file | 05:21:00,849 | 100.69.176.51 | 1153
Merging data from memtables and 2 sstables | 05:21:00,850 | 100.69.176.51 | 1394
Request complete | 05:21:20,881 | 100.69.176.51 | 20033807
Here is the schema. You can see that is includes a few collections.
create table user_scores
(
user_id varchar,
post_type varchar,
score double,
team_to_score_map map<varchar, double>,
affiliation_to_score_map map<varchar, double>,
campaign_to_score_map map<varchar, double>,
person_to_score_map map<varchar, double>,
primary key(user_id, post_type)
)
with compaction =
{
'class' : 'LeveledCompactionStrategy',
'sstable_size_in_mb' : 10
};
I added the leveled compaction strategy as it was supposed to help with read latency.
I'd like to understand what could cause the cluster to timeout during the merge phase. Not all queries timeout. It appears to happen more frequently with rows that have maps with a larger number of entries.
Here is another trace of a failure for good measure. It is very reproducable:
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 05:51:34,557 | 100.69.176.51 | 0
Message received from /100.69.176.51 | 05:51:34,195 | 100.69.184.134 | 102
Executing single-partition query on user_scores | 05:51:34,199 | 100.69.184.134 | 3512
Acquiring sstable references | 05:51:34,199 | 100.69.184.134 | 3741
Merging memtable tombstones | 05:51:34,199 | 100.69.184.134 | 3890
Key cache hit for sstable 5 | 05:51:34,199 | 100.69.184.134 | 4040
Seeking to partition beginning in data file | 05:51:34,199 | 100.69.184.134 | 4059
Merging data from memtables and 1 sstables | 05:51:34,200 | 100.69.184.134 | 4412
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 05:51:34,558 | 100.69.176.51 | 91
Peparing statement | 05:51:34,558 | 100.69.176.51 | 238
Enqueuing data request to /100.69.184.134 | 05:51:34,558 | 100.69.176.51 | 567
Sending message to /100.69.184.134 | 05:51:34,558 | 100.69.176.51 | 979
Request complete | 05:51:54,562 | 100.69.176.51 | 20005209
And a trace from when it works:
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 05:55:07,772 | 100.69.176.51 | 0
Message received from /100.69.176.51 | 05:55:07,408 | 100.69.184.134 | 53
Executing single-partition query on user_scores | 05:55:07,409 | 100.69.184.134 | 1014
Acquiring sstable references | 05:55:07,409 | 100.69.184.134 | 1087
Merging memtable tombstones | 05:55:07,410 | 100.69.184.134 | 1209
Partition index with 0 entries found for sstable 5 | 05:55:07,410 | 100.69.184.134 | 1681
Seeking to partition beginning in data file | 05:55:07,410 | 100.69.184.134 | 1732
Merging data from memtables and 1 sstables | 05:55:07,411 | 100.69.184.134 | 2415
Read 1 live and 0 tombstoned cells | 05:55:07,412 | 100.69.184.134 | 3274
Enqueuing response to /100.69.176.51 | 05:55:07,412 | 100.69.184.134 | 3534
Sending message to /100.69.176.51 | 05:55:07,412 | 100.69.184.134 | 3936
Parsing select * from user_scores where user_id='305722020' LIMIT 10000; | 05:55:07,772 | 100.69.176.51 | 96
Peparing statement | 05:55:07,772 | 100.69.176.51 | 262
Enqueuing data request to /100.69.184.134 | 05:55:07,773 | 100.69.176.51 | 600
Sending message to /100.69.184.134 | 05:55:07,773 | 100.69.176.51 | 847
Message received from /100.69.184.134 | 05:55:07,778 | 100.69.176.51 | 6103
Processing response from /100.69.184.134 | 05:55:07,778 | 100.69.176.51 | 6341
Request complete | 05:55:07,778 | 100.69.176.51 | 6780
Looks like I was running into a performance issue with 1.2. Fortunately a patch had just been applied to the 1.2 branch, so when I built from source my problem went away.
see https://issues.apache.org/jira/browse/CASSANDRA-5677 for a detailed explanation.

Cassandra 1.2 huge read latency

I'm working on a 4 node cassandra 1.2.6 cluster with a single keyspace, replication factor of 2 (3 originally, but dropped to 2) and 10 or so column families. It is running the Oracle 1.7 jvm. It has a mix of reads and writes, with probably two to three times as many writes as reads.
Even under a small amount of load, I am seeing very large read latencies, and I get quite a few read timeouts (using the datastax java driver). Here is an example output of nodetool cfstats for one of the column families:
Column Family: user_scores
SSTable count: 1
SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 7539098
Space used (total): 7549091
Number of Keys (estimate): 42112
Memtable Columns Count: 2267
Memtable Data Size: 1048576
Memtable Switch Count: 2
Read Count: 2101
**Read Latency: 272334.202 ms.**
Write Count: 24947
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 55376
Compacted row minimum size: 447
Compacted row maximum size: 219342
Compacted row mean size: 1051
as you can see, I tried using a level base compaction strategy to try and improve read latency, but as you can also see the latency is huge. I'm a bit stumped. I had a cassandra 1.1.6 cluster working beautifully, but no luck so far with 1.2.
The cluster is running on VM's with 4 CPU's and 7 Gb of ram. The data drive is setup as a striped raid across 4 disks. The machine doesn't seem to be IO bound.
I'm running a pretty vanilla configuration, with all the defaults.
I do see strange CPU behavior where the CPU is spiking even under smaller load. Sometimes I see compactions running, but they are niced so I don't think are the culprit.
I'm trying to figure out where to go next. Any help appreciated!
[update with rpc_timeout trace]
Still playing with this. Here is an example trace. It looks like the merge step is taking way too long.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 04:57:18,882 | 100.69.176.51 | 0
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 04:57:18,884 | 100.69.176.51 | 1981
Peparing statement | 04:57:18,885 | 100.69.176.51 | 2997
Executing single-partition query on user_scores | 04:57:18,885 | 100.69.176.51 | 3657
Acquiring sstable references | 04:57:18,885 | 100.69.176.51 | 3724
Merging memtable tombstones | 04:57:18,885 | 100.69.176.51 | 3779
Key cache hit for sstable 32 | 04:57:18,886 | 100.69.176.51 | 3910
Seeking to partition beginning in data file | 04:57:18,886 | 100.69.176.51 | 3930
Merging data from memtables and 1 sstables | 04:57:18,886 | 100.69.176.51 | 4211
Request complete | 04:57:38,891 | 100.69.176.51 | 20009870
Older traces below:
[newer trace]
After addressing the problem noted in the logs by completely rebuilding the cluster data repository, I still ran into the problem, although it took quite a bit longer. Here is a trace I grabbed when in the bad state:
Tracing session: a6dbefc0-ea49-11e2-84bb-ef447a7d9a48
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 16:48:02,755 | 100.69.196.124 | 0
Parsing select * from user_scores limit 1; | 16:48:02,756 | 100.69.196.124 | 1774
Peparing statement | 16:48:02,759 | 100.69.196.124 | 4006
Determining replicas to query | 16:48:02,759 | 100.69.196.124 | 4286
Enqueuing request to /100.69.176.51 | 16:48:02,763 | 100.69.196.124 | 8849
Sending message to cdb002/100.69.176.51 | 16:48:02,764 | 100.69.196.124 | 9456
Message received from /100.69.196.124 | 16:48:03,449 | 100.69.176.51 | 160
Message received from /100.69.176.51 | 16:48:09,646 | 100.69.196.124 | 6891860
Processing response from /100.69.176.51 | 16:48:09,647 | 100.69.196.124 | 6892426
Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 16:48:10,288 | 100.69.176.51 | 6838754
Seeking to partition beginning in data file | 16:48:10,289 | 100.69.176.51 | 6839689
Read 1 live and 0 tombstoned cells | 16:48:10,289 | 100.69.176.51 | 6839927
Seeking to partition beginning in data file | 16:48:10,289 | 100.69.176.51 | 6839998
Read 1 live and 0 tombstoned cells | 16:48:10,289 | 100.69.176.51 | 6840082
Scanned 1 rows and matched 1 | 16:48:10,289 | 100.69.176.51 | 6840162
Enqueuing response to /100.69.196.124 | 16:48:10,289 | 100.69.176.51 | 6840229
Sending message to /100.69.196.124 | 16:48:10,299 | 100.69.176.51 | 6850072
Request complete | 16:48:09,648 | 100.69.196.124 | 6893029
[update]
I should add that things work just dandy with a solo cassandra instance on my macbook pro. AKA Works on my machine...:)
[update with trace data]
Here is some trace data. This is from the java driver. The downside is I can only trace the queries that succeed. I make it a total of 67 queries before every query starts timing out. What is weird is that it doesn't look that bad. The at query 68, I no longer get a response, and two of the servers are running hot.
2013-07-11 02:15:45 STDIO [INFO] ***************************************
66:Host (queried): cdb003/100.69.198.47
66:Host (tried): cdb003/100.69.198.47
66:Trace id: c95e51c0-e9cf-11e2-b9a9-5b3c0946787b
66:-----------------------------------------------------+--------------+-----------------+--------------
66: Enqueuing data request to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 200
66: Enqueuing digest request to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 265
66: Sending message to /100.69.196.124 | 02:15:42.045 | /100.69.198.47 | 570
66: Sending message to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 574
66: Message received from /100.69.176.51 | 02:15:42.107 | /100.69.198.47 | 62492
66: Processing response from /100.69.176.51 | 02:15:42.107 | /100.69.198.47 | 62780
66: Message received from /100.69.198.47 | 02:15:42.508 | /100.69.196.124 | 31
66: Executing single-partition query on user_scores | 02:15:42.508 | /100.69.196.124 | 406
66: Acquiring sstable references | 02:15:42.508 | /100.69.196.124 | 473
66: Merging memtable tombstones | 02:15:42.508 | /100.69.196.124 | 577
66: Key cache hit for sstable 11 | 02:15:42.508 | /100.69.196.124 | 807
66: Seeking to partition beginning in data file | 02:15:42.508 | /100.69.196.124 | 849
66: Merging data from memtables and 1 sstables | 02:15:42.509 | /100.69.196.124 | 1500
66: Message received from /100.69.198.47 | 02:15:43.379 | /100.69.176.51 | 60
66: Executing single-partition query on user_scores | 02:15:43.379 | /100.69.176.51 | 399
66: Acquiring sstable references | 02:15:43.379 | /100.69.176.51 | 490
66: Merging memtable tombstones | 02:15:43.379 | /100.69.176.51 | 593
66: Key cache hit for sstable 7 | 02:15:43.380 | /100.69.176.51 | 1098
66: Seeking to partition beginning in data file | 02:15:43.380 | /100.69.176.51 | 1141
66: Merging data from memtables and 1 sstables | 02:15:43.380 | /100.69.176.51 | 1912
66: Read 1 live and 0 tombstoned cells | 02:15:43.438 | /100.69.176.51 | 59094
66: Enqueuing response to /100.69.198.47 | 02:15:43.438 | /100.69.176.51 | 59225
66: Sending message to /100.69.198.47 | 02:15:43.438 | /100.69.176.51 | 59373
66:Started at: 02:15:42.04466:Elapsed time in micros: 63105
2013-07-11 02:15:45 STDIO [INFO] ***************************************
67:Host (queried): cdb004/100.69.184.134
67:Host (tried): cdb004/100.69.184.134
67:Trace id: c9f365d0-e9cf-11e2-a4e5-7f3170333ff5
67:-----------------------------------------------------+--------------+-----------------+--------------
67: Message received from /100.69.184.134 | 02:15:42.536 | /100.69.198.47 | 36
67: Executing single-partition query on user_scores | 02:15:42.536 | /100.69.198.47 | 273
67: Acquiring sstable references | 02:15:42.536 | /100.69.198.47 | 311
67: Merging memtable tombstones | 02:15:42.536 | /100.69.198.47 | 353
67: Key cache hit for sstable 8 | 02:15:42.536 | /100.69.198.47 | 436
67: Seeking to partition beginning in data file | 02:15:42.536 | /100.69.198.47 | 455
67: Merging data from memtables and 1 sstables | 02:15:42.537 | /100.69.198.47 | 811
67: Read 1 live and 0 tombstoned cells | 02:15:42.550 | /100.69.198.47 | 14242
67: Enqueuing response to /100.69.184.134 | 02:15:42.550 | /100.69.198.47 | 14456
67: Sending message to /100.69.184.134 | 02:15:42.551 | /100.69.198.47 | 14694
67: Enqueuing data request to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 | 323
67: Sending message to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 | 565
67: Message received from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 | 17029
67: Processing response from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 | 17230
67:Started at: 02:15:43.021
67:Elapsed time in micros: 17622
And here is a trace using cqlsh:
Tracing session: d0f845d0-e9cf-11e2-8882-ef447a7d9a48
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 19:15:54,833 | 100.69.196.124 | 0
Parsing select * from user_scores where user_id='39333433' LIMIT 10000; | 19:15:54,833 | 100.69.196.124 | 103
Peparing statement | 19:15:54,833 | 100.69.196.124 | 455
Executing single-partition query on user_scores | 19:15:54,834 | 100.69.196.124 | 1400
Acquiring sstable references | 19:15:54,834 | 100.69.196.124 | 1468
Merging memtable tombstones | 19:15:54,835 | 100.69.196.124 | 1575
Key cache hit for sstable 11 | 19:15:54,835 | 100.69.196.124 | 1780
Seeking to partition beginning in data file | 19:15:54,835 | 100.69.196.124 | 1822
Merging data from memtables and 1 sstables | 19:15:54,836 | 100.69.196.124 | 2562
Read 1 live and 0 tombstoned cells | 19:15:54,838 | 100.69.196.124 | 4808
Request complete | 19:15:54,838 | 100.69.196.124 | 5810
The trace seems to show that much of the time is doing or waiting for network operations. Perhaps your network has problems?
If only some operations fail, perhaps you have a problem with only one of your nodes. When that node is not needed, things work, but when it is needed things go badly. It might be worth looking at the log files on the other nodes.
Looks like I was running into a performance issue with 1.2. Fortunately a patch had just been applied to the 1.2 branch, so when I built from source my problem went away.
see https://issues.apache.org/jira/browse/CASSANDRA-5677 for a detailed explanation.
Thanks all!

Resources