Cassandra 1.2 huge read latency - cassandra

I'm working on a 4 node cassandra 1.2.6 cluster with a single keyspace, replication factor of 2 (3 originally, but dropped to 2) and 10 or so column families. It is running the Oracle 1.7 jvm. It has a mix of reads and writes, with probably two to three times as many writes as reads.
Even under a small amount of load, I am seeing very large read latencies, and I get quite a few read timeouts (using the datastax java driver). Here is an example output of nodetool cfstats for one of the column families:
Column Family: user_scores
SSTable count: 1
SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 7539098
Space used (total): 7549091
Number of Keys (estimate): 42112
Memtable Columns Count: 2267
Memtable Data Size: 1048576
Memtable Switch Count: 2
Read Count: 2101
**Read Latency: 272334.202 ms.**
Write Count: 24947
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 55376
Compacted row minimum size: 447
Compacted row maximum size: 219342
Compacted row mean size: 1051
as you can see, I tried using a level base compaction strategy to try and improve read latency, but as you can also see the latency is huge. I'm a bit stumped. I had a cassandra 1.1.6 cluster working beautifully, but no luck so far with 1.2.
The cluster is running on VM's with 4 CPU's and 7 Gb of ram. The data drive is setup as a striped raid across 4 disks. The machine doesn't seem to be IO bound.
I'm running a pretty vanilla configuration, with all the defaults.
I do see strange CPU behavior where the CPU is spiking even under smaller load. Sometimes I see compactions running, but they are niced so I don't think are the culprit.
I'm trying to figure out where to go next. Any help appreciated!
[update with rpc_timeout trace]
Still playing with this. Here is an example trace. It looks like the merge step is taking way too long.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 04:57:18,882 | 100.69.176.51 | 0
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 04:57:18,884 | 100.69.176.51 | 1981
Peparing statement | 04:57:18,885 | 100.69.176.51 | 2997
Executing single-partition query on user_scores | 04:57:18,885 | 100.69.176.51 | 3657
Acquiring sstable references | 04:57:18,885 | 100.69.176.51 | 3724
Merging memtable tombstones | 04:57:18,885 | 100.69.176.51 | 3779
Key cache hit for sstable 32 | 04:57:18,886 | 100.69.176.51 | 3910
Seeking to partition beginning in data file | 04:57:18,886 | 100.69.176.51 | 3930
Merging data from memtables and 1 sstables | 04:57:18,886 | 100.69.176.51 | 4211
Request complete | 04:57:38,891 | 100.69.176.51 | 20009870
Older traces below:
[newer trace]
After addressing the problem noted in the logs by completely rebuilding the cluster data repository, I still ran into the problem, although it took quite a bit longer. Here is a trace I grabbed when in the bad state:
Tracing session: a6dbefc0-ea49-11e2-84bb-ef447a7d9a48
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 16:48:02,755 | 100.69.196.124 | 0
Parsing select * from user_scores limit 1; | 16:48:02,756 | 100.69.196.124 | 1774
Peparing statement | 16:48:02,759 | 100.69.196.124 | 4006
Determining replicas to query | 16:48:02,759 | 100.69.196.124 | 4286
Enqueuing request to /100.69.176.51 | 16:48:02,763 | 100.69.196.124 | 8849
Sending message to cdb002/100.69.176.51 | 16:48:02,764 | 100.69.196.124 | 9456
Message received from /100.69.196.124 | 16:48:03,449 | 100.69.176.51 | 160
Message received from /100.69.176.51 | 16:48:09,646 | 100.69.196.124 | 6891860
Processing response from /100.69.176.51 | 16:48:09,647 | 100.69.196.124 | 6892426
Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 16:48:10,288 | 100.69.176.51 | 6838754
Seeking to partition beginning in data file | 16:48:10,289 | 100.69.176.51 | 6839689
Read 1 live and 0 tombstoned cells | 16:48:10,289 | 100.69.176.51 | 6839927
Seeking to partition beginning in data file | 16:48:10,289 | 100.69.176.51 | 6839998
Read 1 live and 0 tombstoned cells | 16:48:10,289 | 100.69.176.51 | 6840082
Scanned 1 rows and matched 1 | 16:48:10,289 | 100.69.176.51 | 6840162
Enqueuing response to /100.69.196.124 | 16:48:10,289 | 100.69.176.51 | 6840229
Sending message to /100.69.196.124 | 16:48:10,299 | 100.69.176.51 | 6850072
Request complete | 16:48:09,648 | 100.69.196.124 | 6893029
[update]
I should add that things work just dandy with a solo cassandra instance on my macbook pro. AKA Works on my machine...:)
[update with trace data]
Here is some trace data. This is from the java driver. The downside is I can only trace the queries that succeed. I make it a total of 67 queries before every query starts timing out. What is weird is that it doesn't look that bad. The at query 68, I no longer get a response, and two of the servers are running hot.
2013-07-11 02:15:45 STDIO [INFO] ***************************************
66:Host (queried): cdb003/100.69.198.47
66:Host (tried): cdb003/100.69.198.47
66:Trace id: c95e51c0-e9cf-11e2-b9a9-5b3c0946787b
66:-----------------------------------------------------+--------------+-----------------+--------------
66: Enqueuing data request to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 200
66: Enqueuing digest request to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 265
66: Sending message to /100.69.196.124 | 02:15:42.045 | /100.69.198.47 | 570
66: Sending message to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 574
66: Message received from /100.69.176.51 | 02:15:42.107 | /100.69.198.47 | 62492
66: Processing response from /100.69.176.51 | 02:15:42.107 | /100.69.198.47 | 62780
66: Message received from /100.69.198.47 | 02:15:42.508 | /100.69.196.124 | 31
66: Executing single-partition query on user_scores | 02:15:42.508 | /100.69.196.124 | 406
66: Acquiring sstable references | 02:15:42.508 | /100.69.196.124 | 473
66: Merging memtable tombstones | 02:15:42.508 | /100.69.196.124 | 577
66: Key cache hit for sstable 11 | 02:15:42.508 | /100.69.196.124 | 807
66: Seeking to partition beginning in data file | 02:15:42.508 | /100.69.196.124 | 849
66: Merging data from memtables and 1 sstables | 02:15:42.509 | /100.69.196.124 | 1500
66: Message received from /100.69.198.47 | 02:15:43.379 | /100.69.176.51 | 60
66: Executing single-partition query on user_scores | 02:15:43.379 | /100.69.176.51 | 399
66: Acquiring sstable references | 02:15:43.379 | /100.69.176.51 | 490
66: Merging memtable tombstones | 02:15:43.379 | /100.69.176.51 | 593
66: Key cache hit for sstable 7 | 02:15:43.380 | /100.69.176.51 | 1098
66: Seeking to partition beginning in data file | 02:15:43.380 | /100.69.176.51 | 1141
66: Merging data from memtables and 1 sstables | 02:15:43.380 | /100.69.176.51 | 1912
66: Read 1 live and 0 tombstoned cells | 02:15:43.438 | /100.69.176.51 | 59094
66: Enqueuing response to /100.69.198.47 | 02:15:43.438 | /100.69.176.51 | 59225
66: Sending message to /100.69.198.47 | 02:15:43.438 | /100.69.176.51 | 59373
66:Started at: 02:15:42.04466:Elapsed time in micros: 63105
2013-07-11 02:15:45 STDIO [INFO] ***************************************
67:Host (queried): cdb004/100.69.184.134
67:Host (tried): cdb004/100.69.184.134
67:Trace id: c9f365d0-e9cf-11e2-a4e5-7f3170333ff5
67:-----------------------------------------------------+--------------+-----------------+--------------
67: Message received from /100.69.184.134 | 02:15:42.536 | /100.69.198.47 | 36
67: Executing single-partition query on user_scores | 02:15:42.536 | /100.69.198.47 | 273
67: Acquiring sstable references | 02:15:42.536 | /100.69.198.47 | 311
67: Merging memtable tombstones | 02:15:42.536 | /100.69.198.47 | 353
67: Key cache hit for sstable 8 | 02:15:42.536 | /100.69.198.47 | 436
67: Seeking to partition beginning in data file | 02:15:42.536 | /100.69.198.47 | 455
67: Merging data from memtables and 1 sstables | 02:15:42.537 | /100.69.198.47 | 811
67: Read 1 live and 0 tombstoned cells | 02:15:42.550 | /100.69.198.47 | 14242
67: Enqueuing response to /100.69.184.134 | 02:15:42.550 | /100.69.198.47 | 14456
67: Sending message to /100.69.184.134 | 02:15:42.551 | /100.69.198.47 | 14694
67: Enqueuing data request to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 | 323
67: Sending message to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 | 565
67: Message received from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 | 17029
67: Processing response from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 | 17230
67:Started at: 02:15:43.021
67:Elapsed time in micros: 17622
And here is a trace using cqlsh:
Tracing session: d0f845d0-e9cf-11e2-8882-ef447a7d9a48
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 19:15:54,833 | 100.69.196.124 | 0
Parsing select * from user_scores where user_id='39333433' LIMIT 10000; | 19:15:54,833 | 100.69.196.124 | 103
Peparing statement | 19:15:54,833 | 100.69.196.124 | 455
Executing single-partition query on user_scores | 19:15:54,834 | 100.69.196.124 | 1400
Acquiring sstable references | 19:15:54,834 | 100.69.196.124 | 1468
Merging memtable tombstones | 19:15:54,835 | 100.69.196.124 | 1575
Key cache hit for sstable 11 | 19:15:54,835 | 100.69.196.124 | 1780
Seeking to partition beginning in data file | 19:15:54,835 | 100.69.196.124 | 1822
Merging data from memtables and 1 sstables | 19:15:54,836 | 100.69.196.124 | 2562
Read 1 live and 0 tombstoned cells | 19:15:54,838 | 100.69.196.124 | 4808
Request complete | 19:15:54,838 | 100.69.196.124 | 5810

The trace seems to show that much of the time is doing or waiting for network operations. Perhaps your network has problems?
If only some operations fail, perhaps you have a problem with only one of your nodes. When that node is not needed, things work, but when it is needed things go badly. It might be worth looking at the log files on the other nodes.

Looks like I was running into a performance issue with 1.2. Fortunately a patch had just been applied to the 1.2 branch, so when I built from source my problem went away.
see https://issues.apache.org/jira/browse/CASSANDRA-5677 for a detailed explanation.
Thanks all!

Related

Cassandra select not stable using datastax driver

Versions:
com.datastax.oss
-java-driver-core:4.5.0
-java-driver-query-builder:4.5.0
-java-driver-mapper-runtime:4.5.0
cassandra:3.11.5 docker image
jdk 11.1
I'm running a deployment of feast that I've modified to use cassandra as a backend low latency serving db for machine learning features. I'm sucessfully writing and reading rows, but the read is inconsistent with respect to results returned. Sometimes the payloads are empty and I don't know why. I have already tried updating to the latest datastax driver and coordinating time using ntp/time.google.com. I've also tried to change the consistency of write to ALL and read to LOCAL_ONE/LOCAL_QUOROM, without success. I'm really struggling to figure out why select isn't consistent. Any insight would be great! :) Here is the process:
I write the rows into cassandra using CassandraIO
#Override
public Future<Void> saveAsync(CassandraMutation entityClass) {
return mapper.saveAsync(
entityClass,
Option.timestamp(entityClass.getWriteTime()),
Option.ttl(entityClass.getTtl()),
Option.consistencyLevel(ConsistencyLevel.LOCAL_QUORUM),
Option.tracing(true));
}
This seems to successfully map rows into my cassandra cluster, which I then query in my application as follows
List<InetSocketAddress> contactPoints =
Arrays.stream(cassandraConfig.getBootstrapHosts().split(","))
.map(h -> new InetSocketAddress(h, cassandraConfig.getPort()))
.collect(Collectors.toList());
CqlSession session =
CqlSession.builder()
.addContactPoints(contactPoints)
.withLocalDatacenter(storeProperties.getCassandraDcName())
.build();
....
PreparedStatement query =
session.prepare(
String.format(
"SELECT entities, feature, value, WRITETIME(value) as writetime FROM %s.%s WHERE entities = ?",
keyspace, tableName));
session.execute(
query
.bind(key)
.setTracing(true)
.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM)));
My issue is that there doesn't seem to be consistent selects happening. I have been recording various bits for a while now and for example here are two select queries, with the same coordinator, one succeeded, then after that the subsequent select fails to return results.
cqlsh> select * from system_traces.sessions where session_id=be023400-6a1e-11ea-97ca-6b8bbe3a2a36;
session_id | client | command | coordinator | duration | parameters | request | started_at
--------------------------------------+---------------+---------+---------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------+---------------------------------
be023400-6a1e-11ea-97ca-6b8bbe3a2a36 | xx.xx.xxx.189 | QUERY | xx.xx.xxx.158 | 41313 | {'bound_var_0_entities': '''ml_project/test_test_entity:1:entity2_uuid=TenderGreens_8755fff7|entity1_uuid=Zach_Yang_fe7fea92''', 'consistency_level': 'LOCAL_QUORUM', 'page_size': '5000', 'query': 'SELECT entities, feature, value, WRITETIME(value) as writetime FROM feast.feature_store WHERE entities = ?', 'serial_consistency_level': 'SERIAL'} | Execute CQL3 prepared query | 2020-03-19 20:18:05.760000+0000
select event_id, activity, source_elapsed, thread from system_traces.events where session_id=be023400-6a1e-11ea-97ca-6b8bbe3a2a36;
event_id | activity | source_elapsed | thread
--------------------------------------+---------------------------------------------------------------------------------------------------------------+----------------+-----------------------------------------------------------------------------------------------------------
be0652b0-6a1e-11ea-97ca-6b8bbe3a2a36 | Read-repair DC_LOCAL | 27087 | Native-Transport-Requests-1
be0679c0-6a1e-11ea-97ca-6b8bbe3a2a36 | reading data from /xx.xx.xxx.161 | 28034 | Native-Transport-Requests-1
be06a0d0-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to /xx.xx.xxx.161 | 28552 | MessagingService-Outgoing-/xx.xx.xxx.161-Small
be06a0d1-6a1e-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.162 | 28595 | Native-Transport-Requests-1
be06a0d2-6a1e-11ea-97ca-6b8bbe3a2a36 | Executing single-partition query on feature_store | 28598 | ReadStage-3
be06a0d3-6a1e-11ea-97ca-6b8bbe3a2a36 | Acquiring sstable references | 28689 | ReadStage-3
be06a0d4-6a1e-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xx.138 | 28852 | Native-Transport-Requests-1
be06a0d5-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to /xx.xx.xxx.162 | 28904 | MessagingService-Outgoing-/xx.xx.xxx.162-Small
be06a0d6-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 56 | 28937 | ReadStage-3
be06a0d7-6a1e-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.171 | 28983 | Native-Transport-Requests-1
be06a0d8-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 55 | 29020 | ReadStage-3
be06a0d9-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xx.138 | 29071 | MessagingService-Outgoing-cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xx.138-Small
be06a0da-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 54 | 29181 | ReadStage-3
be06a0db-6a1e-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xxx.171 | 29201 | MessagingService-Outgoing-cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xx.171-Small
be06c7e0-6a1e-11ea-80ad-dffaf3fb56b4 | READ message received from /xx.xx.xxx.158 | 33 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-8693-577fec389856 | READ message received from /xx.xx.xxx.158 | 34 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-8b1a-e5aa876f7d0d | READ message received from /xx.xx.xxx.158 | 29 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-8d2e-c5837edad3d1 | READ message received from /xx.xx.xxx.158 | 44 | MessagingService-Incoming-/xx.xx.xxx.158
be06c7e0-6a1e-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 41 | 29273 | ReadStage-3
be06c7e1-6a1e-11ea-80ad-dffaf3fb56b4 | Executing single-partition query on feature_store | 389 | ReadStage-1
be06c7e1-6a1e-11ea-8b1a-e5aa876f7d0d | Executing single-partition query on feature_store | 513 | ReadStage-1
be06c7e1-6a1e-11ea-97ca-6b8bbe3a2a36 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 29342 | ReadStage-3
be06c7e2-6a1e-11ea-80ad-dffaf3fb56b4 | Acquiring sstable references | 457 | ReadStage-1
be06c7e3-6a1e-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 55 | 620 | ReadStage-1
be06c7e4-6a1e-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 54 | 659 | ReadStage-1
be06c7e5-6a1e-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 41 | 677 | ReadStage-1
be06c7e6-6a1e-11ea-80ad-dffaf3fb56b4 | Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 695 | ReadStage-1
be06eef0-6a1e-11ea-80ad-dffaf3fb56b4 | Merged data from memtables and 0 sstables | 1039 | ReadStage-1
be06eef0-6a1e-11ea-8693-577fec389856 | Executing single-partition query on feature_store | 372 | ReadStage-1
be06eef0-6a1e-11ea-8b1a-e5aa876f7d0d | Acquiring sstable references | 583 | ReadStage-1
be06eef0-6a1e-11ea-8d2e-c5837edad3d1 | Executing single-partition query on feature_store | 454 | ReadStage-1
be06eef0-6a1e-11ea-97ca-6b8bbe3a2a36 | Merged data from memtables and 0 sstables | 30372 | ReadStage-3
be06eef1-6a1e-11ea-80ad-dffaf3fb56b4 | Read 16 live rows and 0 tombstone cells | 1125 | ReadStage-1
be06eef1-6a1e-11ea-8693-577fec389856 | Acquiring sstable references | 493 | ReadStage-1
be06eef1-6a1e-11ea-8b1a-e5aa876f7d0d | Bloom filter allows skipping sstable 54 | 703 | ReadStage-1
be06eef1-6a1e-11ea-8d2e-c5837edad3d1 | Acquiring sstable references | 530 | ReadStage-1
be06eef1-6a1e-11ea-97ca-6b8bbe3a2a36 | Read 16 live rows and 0 tombstone cells | 30484 | ReadStage-3
be06eef2-6a1e-11ea-80ad-dffaf3fb56b4 | Enqueuing response to /xx.xx.xxx.158 | 1155 | ReadStage-1
be06eef2-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 56 | 721 | ReadStage-1
be06eef2-6a1e-11ea-8b1a-e5aa876f7d0d | Bloom filter allows skipping sstable 41 | 740 | ReadStage-1
be06eef2-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 56 | 655 | ReadStage-1
be06eef3-6a1e-11ea-80ad-dffaf3fb56b4 | Sending REQUEST_RESPONSE message to cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158 | 1492 | MessagingService-Outgoing-cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158-Small
be06eef3-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 55 | 780 | ReadStage-1
be06eef3-6a1e-11ea-8b1a-e5aa876f7d0d | Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones | 761 | ReadStage-1
be06eef3-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 55 | 686 | ReadStage-1
be06eef4-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 54 | 815 | ReadStage-1
be06eef4-6a1e-11ea-8b1a-e5aa876f7d0d | Merged data from memtables and 0 sstables | 1320 | ReadStage-1
be06eef4-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 54 | 705 | ReadStage-1
be06eef5-6a1e-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 41 | 839 | ReadStage-1
be06eef5-6a1e-11ea-8b1a-e5aa876f7d0d | Read 16 live rows and 0 tombstone cells | 1495 | ReadStage-1
be06eef5-6a1e-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 41 | 720 | ReadStage-1
be06eef6-6a1e-11ea-8693-577fec389856 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 871 | ReadStage-1
be06eef6-6a1e-11ea-8b1a-e5aa876f7d0d | Enqueuing response to /xx.xx.xxx.158 | 1554 | ReadStage-1
be06eef6-6a1e-11ea-8d2e-c5837edad3d1 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 738 | ReadStage-1
be06eef7-6a1e-11ea-8d2e-c5837edad3d1 | Merged data from memtables and 0 sstables | 1157 | ReadStage-1
be06eef8-6a1e-11ea-8d2e-c5837edad3d1 | Read 16 live rows and 0 tombstone cells | 1296 | ReadStage-1
be06eef9-6a1e-11ea-8d2e-c5837edad3d1 | Enqueuing response to /xx.xx.xxx.158 | 1325 | ReadStage-1
be071600-6a1e-11ea-8693-577fec389856 | Merged data from memtables and 0 sstables | 1592 | ReadStage-1
be071600-6a1e-11ea-8b1a-e5aa876f7d0d | Sending REQUEST_RESPONSE message to cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158 | 1783 | MessagingService-Outgoing-cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158-Small
be071600-6a1e-11ea-8d2e-c5837edad3d1 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 1484 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
be071600-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.161 | 31525 | MessagingService-Incoming-/xx.xx.xxx.161
be071601-6a1e-11ea-8693-577fec389856 | Read 16 live rows and 0 tombstone cells | 1754 | ReadStage-1
be071601-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.161 | 31650 | RequestResponseStage-4
be071602-6a1e-11ea-8693-577fec389856 | Enqueuing response to /xx.xx.xxx.158 | 1796 | ReadStage-1
be071602-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xx.138 | 31795 | MessagingService-Incoming-/xx.xx.xx.138
be071603-6a1e-11ea-8693-577fec389856 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 1973 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
be071603-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xx.138 | 31872 | RequestResponseStage-4
be071604-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.162 | 31918 | MessagingService-Incoming-/xx.xx.xxx.162
be071605-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.162 | 32047 | RequestResponseStage-4
be073d10-6a1e-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.171 | 32688 | MessagingService-Incoming-/xx.xx.xx.171
be073d11-6a1e-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.171 | 32827 | RequestResponseStage-2
be073d12-6a1e-11ea-97ca-6b8bbe3a2a36 | Initiating read-repair | 32985 | RequestResponseStage-2
Failure:
cqlsh> select * from system_traces.sessions where session_id=472551e0-6a1f-11ea-97ca-6b8bbe3a2a36;
session_id | client | command | coordinator | duration | parameters | request | started_at
--------------------------------------+---------------+---------+---------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------+---------------------------------
472551e0-6a1f-11ea-97ca-6b8bbe3a2a36 | xx.xx.xxx.189 | QUERY | xx.xx.xxx.158 | 3044 | {'bound_var_0_entities': '''ml_project/test_test_entity:1:entity1_uuid=Zach_Yang_fe7fea92|entity2_uuid=TenderGreens_8755fff7''', 'consistency_level': 'LOCAL_QUORUM', 'page_size': '5000', 'query': 'SELECT entities, feature, value, WRITETIME(value) as writetime FROM feast.feature_store WHERE entities = ?', 'serial_consistency_level': 'SERIAL'} | Execute CQL3 prepared query | 2020-03-19 20:21:55.838000+0000
cqlsh> select event_id, activity, source_elapsed, thread from system_traces.events where session_id=472551e0-6a1f-11ea-97ca-6b8bbe3a2a36;
event_id | activity | source_elapsed | thread
--------------------------------------+---------------------------------------------------------------------------------------------------------------+----------------+-----------------------------------------------------------------------------------------------------------
472578f0-6a1f-11ea-80ad-dffaf3fb56b4 | READ message received from /xx.xx.xxx.158 | 18 | MessagingService-Incoming-/xx.xx.xxx.158
472578f0-6a1f-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.138 | 619 | Native-Transport-Requests-1
472578f1-6a1f-11ea-97ca-6b8bbe3a2a36 | Executing single-partition query on feature_store | 708 | ReadStage-2
472578f2-6a1f-11ea-97ca-6b8bbe3a2a36 | reading digest from /xx.xx.xxx.161 | 755 | Native-Transport-Requests-1
472578f3-6a1f-11ea-97ca-6b8bbe3a2a36 | Acquiring sstable references | 768 | ReadStage-2
472578f4-6a1f-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xxx.138 | 836 | MessagingService-Outgoing-cassandra-feature-store-1.cassandra-feature-store.team-data/xx.xx.xxx.138-Small
472578f5-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 56 | 859 | ReadStage-2
472578f6-6a1f-11ea-97ca-6b8bbe3a2a36 | speculating read retry on /xx.xx.xx.171 | 862 | Native-Transport-Requests-1
472578f7-6a1f-11ea-97ca-6b8bbe3a2a36 | Sending READ message to /xx.xx.xxx.161 | 893 | MessagingService-Outgoing-/xx.xx.xxx.161-Small
472578f8-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 55 | 903 | ReadStage-2
472578f9-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 54 | 929 | ReadStage-2
472578fa-6a1f-11ea-97ca-6b8bbe3a2a36 | Sending READ message to cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xx.171 | 982 | MessagingService-Outgoing-cassandra-feature-store-0.cassandra-feature-store.team-data/xx.xx.xx.171-Small
472578fb-6a1f-11ea-97ca-6b8bbe3a2a36 | Bloom filter allows skipping sstable 41 | 996 | ReadStage-2
472578fc-6a1f-11ea-97ca-6b8bbe3a2a36 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 1039 | ReadStage-2
472578fd-6a1f-11ea-97ca-6b8bbe3a2a36 | Merged data from memtables and 0 sstables | 1227 | ReadStage-2
472578fe-6a1f-11ea-97ca-6b8bbe3a2a36 | Read 0 live rows and 0 tombstone cells | 1282 | ReadStage-2
4725a000-6a1f-11ea-80ad-dffaf3fb56b4 | Executing single-partition query on feature_store | 226 | ReadStage-2
4725a000-6a1f-11ea-8693-577fec389856 | READ message received from /xx.xx.xxx.158 | 12 | MessagingService-Incoming-/xx.xx.xxx.158
4725a000-6a1f-11ea-8d2e-c5837edad3d1 | READ message received from /xx.xx.xxx.158 | 15 | MessagingService-Incoming-/xx.xx.xxx.158
4725a001-6a1f-11ea-80ad-dffaf3fb56b4 | Acquiring sstable references | 297 | ReadStage-2
4725a001-6a1f-11ea-8693-577fec389856 | Executing single-partition query on feature_store | 258 | ReadStage-1
4725a001-6a1f-11ea-8d2e-c5837edad3d1 | Executing single-partition query on feature_store | 230 | ReadStage-1
4725a002-6a1f-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 55 | 397 | ReadStage-2
4725a002-6a1f-11ea-8693-577fec389856 | Acquiring sstable references | 327 | ReadStage-1
4725a002-6a1f-11ea-8d2e-c5837edad3d1 | Acquiring sstable references | 297 | ReadStage-1
4725a003-6a1f-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 54 | 433 | ReadStage-2
4725a003-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 56 | 451 | ReadStage-1
4725a003-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 56 | 439 | ReadStage-1
4725a004-6a1f-11ea-80ad-dffaf3fb56b4 | Bloom filter allows skipping sstable 41 | 450 | ReadStage-2
4725a004-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 55 | 512 | ReadStage-1
4725a004-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 55 | 492 | ReadStage-1
4725a005-6a1f-11ea-80ad-dffaf3fb56b4 | Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 466 | ReadStage-2
4725a005-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 54 | 570 | ReadStage-1
4725a005-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 54 | 513 | ReadStage-1
4725a006-6a1f-11ea-80ad-dffaf3fb56b4 | Merged data from memtables and 0 sstables | 648 | ReadStage-2
4725a006-6a1f-11ea-8693-577fec389856 | Bloom filter allows skipping sstable 41 | 606 | ReadStage-1
4725a006-6a1f-11ea-8d2e-c5837edad3d1 | Bloom filter allows skipping sstable 41 | 526 | ReadStage-1
4725a007-6a1f-11ea-80ad-dffaf3fb56b4 | Read 0 live rows and 0 tombstone cells | 708 | ReadStage-2
4725a007-6a1f-11ea-8693-577fec389856 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 631 | ReadStage-1
4725a007-6a1f-11ea-8d2e-c5837edad3d1 | Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones | 542 | ReadStage-1
4725a008-6a1f-11ea-80ad-dffaf3fb56b4 | Enqueuing response to /xx.xx.xxx.158 | 727 | ReadStage-2
4725a008-6a1f-11ea-8d2e-c5837edad3d1 | Merged data from memtables and 0 sstables | 700 | ReadStage-1
4725a009-6a1f-11ea-80ad-dffaf3fb56b4 | Sending REQUEST_RESPONSE message to cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158 | 838 | MessagingService-Outgoing-cassandra-feature-store-2.cassandra-feature-store.team-data/xx.xx.xxx.158-Small
4725a009-6a1f-11ea-8d2e-c5837edad3d1 | Read 0 live rows and 0 tombstone cells | 756 | ReadStage-1
4725a00a-6a1f-11ea-8d2e-c5837edad3d1 | Enqueuing response to /xx.xx.xxx.158 | 772 | ReadStage-1
4725a00b-6a1f-11ea-8d2e-c5837edad3d1 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 914 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
4725c710-6a1f-11ea-8693-577fec389856 | Merged data from memtables and 0 sstables | 845 | ReadStage-1
4725c710-6a1f-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.161 | 2327 | MessagingService-Incoming-/xx.xx.xxx.161
4725c711-6a1f-11ea-8693-577fec389856 | Read 0 live rows and 0 tombstone cells | 905 | ReadStage-1
4725c711-6a1f-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.161 | 2443 | RequestResponseStage-2
4725c712-6a1f-11ea-8693-577fec389856 | Enqueuing response to /xx.xx.xxx.158 | 929 | ReadStage-1
4725c712-6a1f-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xxx.138 | 2571 | MessagingService-Incoming-/xx.xx.xxx.138
4725c713-6a1f-11ea-8693-577fec389856 | Sending REQUEST_RESPONSE message to /xx.xx.xxx.158 | 1023 | MessagingService-Outgoing-/xx.xx.xxx.158-Small
4725c713-6a1f-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xxx.138 | 2712 | RequestResponseStage-2
4725c714-6a1f-11ea-97ca-6b8bbe3a2a36 | REQUEST_RESPONSE message received from /xx.xx.xx.171 | 2725 | MessagingService-Incoming-/xx.xx.xx.171
4725c715-6a1f-11ea-97ca-6b8bbe3a2a36 | Processing response from /xx.xx.xx.171 | 2797 | RequestResponseStage-2
4725c716-6a1f-11ea-97ca-6b8bbe3a2a36 | Initiating read-repair | 2855 | RequestResponseStage-2
Keyspace info
cqlsh> describe keyspace feast;
CREATE KEYSPACE feast WITH replication = {'class': 'NetworkTopologyStrategy', 'stage-us-west1': '5'} AND durable_writes = true;
CREATE TABLE feast.feature_store (
entities text,
feature text,
value blob,
PRIMARY KEY (entities, feature)
) WITH CLUSTERING ORDER BY (feature ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';

Cassandra OOM with large RangeTombstoneList objects in memory

We have had problems with our cassandra nodes going OOM for some time, so finally we got them configured so that we could get a heap dump to try and see what was causing the OOM.
In the dump there where 16 threads (named SharedPool-Worker-XX) each executing a SliceFromReadCommand from the same table. Each of the 16 threads had a RangeTombstoneList object retaining between 200 and 240mb of memory.
The table in question are used as a queue (I know, far from the ideal use case for cassandra, but that is how it is) between two applications, where one writes and the other reads. So it is not unlikely that there is a large number of tombstones in the table...how ever I have been unable to find them.
I did a trace on the query issued against the table, and it resulted in the following:
cqlsh:ddp> select json_data
... from file_download
... where ti = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
ti | uuid | json_data
----+------+-----------
(0 rows)
Tracing session: b2f2be60-01c8-11e8-ae90-fd18acdea80d
activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------
Execute CQL3 query | 2018-01-25 12:10:14.214000 | 10.60.73.232 | 0
Parsing select json_data\nfrom file_download\nwhere ti = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'; [SharedPool-Worker-1] | 2018-01-25 12:10:14.260000 | 10.60.73.232 | 105
Preparing statement [SharedPool-Worker-1] | 2018-01-25 12:10:14.262000 | 10.60.73.232 | 197
Executing single-partition query on file_download [SharedPool-Worker-4] | 2018-01-25 12:10:14.263000 | 10.60.73.232 | 442
Acquiring sstable references [SharedPool-Worker-4] | 2018-01-25 12:10:14.264000 | 10.60.73.232 | 491
Merging memtable tombstones [SharedPool-Worker-4] | 2018-01-25 12:10:14.265000 | 10.60.73.232 | 517
Bloom filter allows skipping sstable 2444 [SharedPool-Worker-4] | 2018-01-25 12:10:14.270000 | 10.60.73.232 | 608
Bloom filter allows skipping sstable 8 [SharedPool-Worker-4] | 2018-01-25 12:10:14.271000 | 10.60.73.232 | 665
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-4] | 2018-01-25 12:10:14.273000 | 10.60.73.232 | 700
Merging data from memtables and 0 sstables [SharedPool-Worker-4] | 2018-01-25 12:10:14.274000 | 10.60.73.232 | 707
Read 0 live and 0 tombstone cells [SharedPool-Worker-4] | 2018-01-25 12:10:14.274000 | 10.60.73.232 | 754
Request complete | 2018-01-25 12:10:14.215148 | 10.60.73.232 | 1148
cfstats also shows avg./max tombstones pr. slice to be 0.
This makes me question whether or not the OOM was actually due to large amount of tombstones or something else?
We run cassandra v2.1.17

Fetching data from Cassandra is too slow

We have a cluster of 3 cassandra nodes. All nodes are working fine, but fetching results is EXTREMELY slow. I run a SELECT-query in cql-shell to fetch ~100k of rows and before starting to show me first results it may take up to 30 seconds to warm-up.
Why it may happen? Is there any way to speed it up?
Here is a trace log:
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 17:41:34,923 | *.*.*.211 | 0
Parsing SELECT * FROM ga_page_visits WHERE site_global_key='GLOBAL_KEY' AND reported_at < '2014-05-17' AND reported_at > '2014-05-01' LIMIT 100000; | 17:41:34,923 | *.*.*.211 | 87
Preparing statement | 17:41:34,923 | *.*.*.211 | 290
Sending message to /*.*.*.213 | 17:41:34,924 | *.*.*.211 | 1579
Sending message to /*.*.*.212 | 17:41:34,924 | *.*.*.211 | 1617
Executing single-partition query on users | 17:41:34,924 | *.*.*.211 | 1666
Message received from /*.*.*.211 | 17:41:34,925 | *.*.*.213 | 48
Message received from /*.*.*.211 | 17:41:34,925 | *.*.*.212 | 40
Acquiring sstable references | 17:41:34,925 | *.*.*.211 | 1704
Merging memtable tombstones | 17:41:34,925 | *.*.*.211 | 1767
Key cache hit for sstable 2 | 17:41:34,925 | *.*.*.211 | 1886
Seeking to partition beginning in data file | 17:41:34,925 | *.*.*.211 | 1908
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones | 17:41:34,925 | *.*.*.211 | 2337
Merging data from memtables and 1 sstables | 17:41:34,925 | *.*.*.211 | 2366
Read 1 live and 0 tombstoned cells | 17:41:34,925 | *.*.*.211 | 2410
Executing single-partition query on users | 17:41:34,926 | *.*.*.213 | 702
Executing single-partition query on users | 17:41:34,926 | *.*.*.212 | 813
Acquiring sstable references | 17:41:34,926 | *.*.*.213 | 747
Acquiring sstable references | 17:41:34,926 | *.*.*.212 | 848
Merging memtable tombstones | 17:41:34,926 | *.*.*.213 | 838
Merging memtable tombstones | 17:41:34,926 | *.*.*.212 | 922
Key cache hit for sstable 5 | 17:41:34,926 | *.*.*.213 | 1006
Key cache hit for sstable 1 | 17:41:34,926 | *.*.*.212 | 1044
Seeking to partition beginning in data file | 17:41:34,926 | *.*.*.213 | 1034
Seeking to partition beginning in data file | 17:41:34,926 | *.*.*.212 | 1066
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones | 17:41:34,927 | *.*.*.213 | 1635
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones | 17:41:34,927 | *.*.*.212 | 1543
Merging data from memtables and 1 sstables | 17:41:34,927 | *.*.*.213 | 1676
Merging data from memtables and 1 sstables | 17:41:34,927 | *.*.*.212 | 1571
Read 1 live and 0 tombstoned cells | 17:41:34,927 | *.*.*.213 | 1760
Read 1 live and 0 tombstoned cells | 17:41:34,927 | *.*.*.212 | 1634
Enqueuing response to /*.*.*.211 | 17:41:34,927 | *.*.*.213 | 2014
Enqueuing response to /*.*.*.211 | 17:41:34,927 | *.*.*.212 | 1846
Sending message to /*.*.*.211 | 17:41:34,927 | *.*.*.213 | 2173
Sending message to /*.*.*.211 | 17:41:34,927 | *.*.*.212 | 1960
Message received from /*.*.*.213 | 17:41:34,928 | *.*.*.211 | 4759
Message received from /*.*.*.212 | 17:41:34,928 | *.*.*.211 | 4775
Processing response from /*.*.*.213 | 17:41:34,928 | *.*.*.211 | 5439
Processing response from /*.*.*.212 | 17:41:34,928 | *.*.*.211 | 5668
Sending message to /*.*.*.40 | 17:41:34,929 | *.*.*.211 | 5954
Executing single-partition query on ga_page_visits | 17:41:34,930 | *.*.*.211 | 6792
Acquiring sstable references | 17:41:34,930 | *.*.*.211 | 6868
Merging memtable tombstones | 17:41:34,930 | *.*.*.211 | 6936
Key cache hit for sstable 2 | 17:41:34,930 | *.*.*.211 | 7075
Seeking to partition indexed section in data file | 17:41:34,930 | *.*.*.211 | 7102
Key cache hit for sstable 1 | 17:41:34,930 | *.*.*.211 | 7211
Seeking to partition indexed section in data file | 17:41:34,930 | *.*.*.211 | 7232
Skipped 1/3 non-slice-intersecting sstables, included 0 due to tombstones | 17:41:34,930 | *.*.*.211 | 7290
Merging data from memtables and 2 sstables | 17:41:34,930 | *.*.*.211 | 7311
Message received from /*.*.*.211 | 17:41:34,964 | *.*.*.40 | 45
Executing single-partition query on users | 17:41:34,965 | *.*.*.40 | 658
Acquiring sstable references | 17:41:34,965 | *.*.*.40 | 695
Merging memtable tombstones | 17:41:34,965 | *.*.*.40 | 762
Key cache hit for sstable 14 | 17:41:34,965 | *.*.*.40 | 852
Seeking to partition beginning in data file | 17:41:34,965 | *.*.*.40 | 876
Key cache hit for sstable 16 | 17:41:34,965 | *.*.*.40 | 1275
Seeking to partition beginning in data file | 17:41:34,965 | *.*.*.40 | 1293
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones | 17:41:34,966 | *.*.*.40 | 1504
Merging data from memtables and 2 sstables | 17:41:34,966 | *.*.*.40 | 1526
Read 1 live and 0 tombstoned cells | 17:41:34,966 | *.*.*.40 | 1592
Enqueuing response to /*.*.*.211 | 17:41:34,966 | *.*.*.40 | 1755
Sending message to /*.*.*.211 | 17:41:34,966 | *.*.*.40 | 1834
Message received from /*.*.*.40 | 17:41:34,980 | *.*.*.211 | 57679
Processing response from /*.*.*.40 | 17:41:34,981 | *.*.*.211 | 57785
Read 99880 live and 0 tombstoned cells | 17:41:36,315 | *.*.*.211 | 1392676
Request complete | 17:41:37,541 | *.*.*.211 | 2618390
The table schema is:
CREATE TABLE ga_page_visits (
site_global_key ascii,
reported_at timestamp,
timeuuid ascii,
bounces int,
campaign text,
channel ascii,
conversions int,
device ascii,
keyword text,
medium text,
new_visits int,
page_id int,
page_views int,
referral_path text,
site_search_engine_id int,
social_network text,
source text,
value decimal,
visits int,
PRIMARY KEY (site_global_key, reported_at, timeuuid)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
I suspect you've got an issue with your data model. It looks like you're pushing in an absolute ton of data into a single partition. That will become a problem as your data set gets bigger. I recommend this type of structure instead:
CREATE TABLE ga_page_visits (
site_global_key ascii,
day timestamp,
timeuuid ascii,
bounces int,
campaign text,
channel ascii,
conversions int,
device ascii,
keyword text,
medium text,
new_visits int,
page_id int,
page_views int,
referral_path text,
site_search_engine_id int,
social_network text,
source text,
value decimal,
visits int,
PRIMARY KEY ((site_global_key, day), timeuuid)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
You already have a timeuuid, that has a time embedded in it, so you don't need a reported_at timestamp.
You are going to have a ridiculous number of rows in this partition. You should use a composite primary key of site_global_key & the date (do not add the time, let it be 00:00:00.)
This way, each day will live on a separate partition and is more easily queried. Do 1 query for each day. This will perform much better.
As I'm looking at your model requirements, this is really looking like a data warehouse type design. Jon nailed it with the rollups. If you need low latency, high frequency creating those tighter views of your data is done often in production with high scaling requirements. I don't want to spray out links but there are examples of this if you google.

Cassandra query very slow when database is large

I have a table with 45 million keys.
Compaction strategy is LCS.
Single node Cassandra version 2.0.5.
No key or row caching.
Queries are very slow. They were several times faster when the database was smaller.
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------+--------------+-------------+----------------
execute_cql3_query | 17:47:07,891 | 10.72.9.151 | 0
Parsing select * from object_version_info where key = 'test10-Client9_99900' LIMIT 10000; | 17:47:07,891 | 10.72.9.151 | 80
Preparing statement | 17:47:07,891 | 10.72.9.151 | 178
Executing single-partition query on object_version_info | 17:47:07,893 | 10.72.9.151 | 2513
Acquiring sstable references | 17:47:07,893 | 10.72.9.151 | 2539
Merging memtable tombstones | 17:47:07,893 | 10.72.9.151 | 2597
Bloom filter allows skipping sstable 1517 | 17:47:07,893 | 10.72.9.151 | 2652
Bloom filter allows skipping sstable 1482 | 17:47:07,893 | 10.72.9.151 | 2677
Partition index with 0 entries found for sstable 1268 | 17:47:08,560 | 10.72.9.151 | 669935
Seeking to partition beginning in data file | 17:47:08,560 | 10.72.9.151 | 669956
Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 17:47:09,411 | 10.72.9.151 | 1520279
Merging data from memtables and 1 sstables | 17:47:09,411 | 10.72.9.151 | 1520302
Read 1 live and 0 tombstoned cells | 17:47:09,411 | 10.72.9.151 | 1520351
Request complete | 17:47:09,411 | 10.72.9.151 | 1520615

Cassandra 1.2 merging data from memtables and sstables takes too long

Here is a trace from a 4 node cassandra cluster, running 1.2.6. I'm seeing a timeout with a simple select when the cluster is under no load and I need some help getting to the bottom of it.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 05:21:00,848 | 100.69.176.51 | 0
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 05:21:00,848 | 100.69.176.51 | 77
Peparing statement | 05:21:00,848 | 100.69.176.51 | 225
Executing single-partition query on user_scores | 05:21:00,849 | 100.69.176.51 | 589
Acquiring sstable references | 05:21:00,849 | 100.69.176.51 | 626
Merging memtable tombstones | 05:21:00,849 | 100.69.176.51 | 676
Key cache hit for sstable 34 | 05:21:00,849 | 100.69.176.51 | 817
Seeking to partition beginning in data file | 05:21:00,849 | 100.69.176.51 | 836
Key cache hit for sstable 32 | 05:21:00,849 | 100.69.176.51 | 1135
Seeking to partition beginning in data file | 05:21:00,849 | 100.69.176.51 | 1153
Merging data from memtables and 2 sstables | 05:21:00,850 | 100.69.176.51 | 1394
Request complete | 05:21:20,881 | 100.69.176.51 | 20033807
Here is the schema. You can see that is includes a few collections.
create table user_scores
(
user_id varchar,
post_type varchar,
score double,
team_to_score_map map<varchar, double>,
affiliation_to_score_map map<varchar, double>,
campaign_to_score_map map<varchar, double>,
person_to_score_map map<varchar, double>,
primary key(user_id, post_type)
)
with compaction =
{
'class' : 'LeveledCompactionStrategy',
'sstable_size_in_mb' : 10
};
I added the leveled compaction strategy as it was supposed to help with read latency.
I'd like to understand what could cause the cluster to timeout during the merge phase. Not all queries timeout. It appears to happen more frequently with rows that have maps with a larger number of entries.
Here is another trace of a failure for good measure. It is very reproducable:
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 05:51:34,557 | 100.69.176.51 | 0
Message received from /100.69.176.51 | 05:51:34,195 | 100.69.184.134 | 102
Executing single-partition query on user_scores | 05:51:34,199 | 100.69.184.134 | 3512
Acquiring sstable references | 05:51:34,199 | 100.69.184.134 | 3741
Merging memtable tombstones | 05:51:34,199 | 100.69.184.134 | 3890
Key cache hit for sstable 5 | 05:51:34,199 | 100.69.184.134 | 4040
Seeking to partition beginning in data file | 05:51:34,199 | 100.69.184.134 | 4059
Merging data from memtables and 1 sstables | 05:51:34,200 | 100.69.184.134 | 4412
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 05:51:34,558 | 100.69.176.51 | 91
Peparing statement | 05:51:34,558 | 100.69.176.51 | 238
Enqueuing data request to /100.69.184.134 | 05:51:34,558 | 100.69.176.51 | 567
Sending message to /100.69.184.134 | 05:51:34,558 | 100.69.176.51 | 979
Request complete | 05:51:54,562 | 100.69.176.51 | 20005209
And a trace from when it works:
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 05:55:07,772 | 100.69.176.51 | 0
Message received from /100.69.176.51 | 05:55:07,408 | 100.69.184.134 | 53
Executing single-partition query on user_scores | 05:55:07,409 | 100.69.184.134 | 1014
Acquiring sstable references | 05:55:07,409 | 100.69.184.134 | 1087
Merging memtable tombstones | 05:55:07,410 | 100.69.184.134 | 1209
Partition index with 0 entries found for sstable 5 | 05:55:07,410 | 100.69.184.134 | 1681
Seeking to partition beginning in data file | 05:55:07,410 | 100.69.184.134 | 1732
Merging data from memtables and 1 sstables | 05:55:07,411 | 100.69.184.134 | 2415
Read 1 live and 0 tombstoned cells | 05:55:07,412 | 100.69.184.134 | 3274
Enqueuing response to /100.69.176.51 | 05:55:07,412 | 100.69.184.134 | 3534
Sending message to /100.69.176.51 | 05:55:07,412 | 100.69.184.134 | 3936
Parsing select * from user_scores where user_id='305722020' LIMIT 10000; | 05:55:07,772 | 100.69.176.51 | 96
Peparing statement | 05:55:07,772 | 100.69.176.51 | 262
Enqueuing data request to /100.69.184.134 | 05:55:07,773 | 100.69.176.51 | 600
Sending message to /100.69.184.134 | 05:55:07,773 | 100.69.176.51 | 847
Message received from /100.69.184.134 | 05:55:07,778 | 100.69.176.51 | 6103
Processing response from /100.69.184.134 | 05:55:07,778 | 100.69.176.51 | 6341
Request complete | 05:55:07,778 | 100.69.176.51 | 6780
Looks like I was running into a performance issue with 1.2. Fortunately a patch had just been applied to the 1.2 branch, so when I built from source my problem went away.
see https://issues.apache.org/jira/browse/CASSANDRA-5677 for a detailed explanation.

Resources