Counting partition size in cassandra - cassandra

I'm implementing unique entry counter in Cassandra. The counter may be represented just as a set of tuples:
counter_id = broadcast:12345, token = user:123
counter_id = broadcast:12345, token = user:321
where value for counter broadcast:12345 may be counted as size of corresponding entries set. Such counter can be effectively stored as a table with counter_id being partition key. My first thought was that since single counter value is basically size of partition, i can do count(1) WHERE counter_id = ? query, which won't need to read data and would be super-duper fast. However, i see following trace output:
cqlsh > select count(1) from token_counter_storage where id = '1';
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+------------+----------------
Execute CQL3 query | 2016-06-10 11:22:42.809000 | 172.17.0.2 | 0
Parsing select count(1) from token_counter_storage where id = '1'; [SharedPool-Worker-1] | 2016-06-10 11:22:42.809000 | 172.17.0.2 | 260
Preparing statement [SharedPool-Worker-1] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 565
Executing single-partition query on token_counter_storage [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1256
Acquiring sstable references [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1350
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1465
Merging data from memtables and 0 sstables [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1546
Read 10 live and 0 tombstone cells [SharedPool-Worker-2] | 2016-06-10 11:22:42.811000 | 172.17.0.2 | 1826
Request complete | 2016-06-10 11:22:42.811410 | 172.17.0.2 | 2410
I guess that this trace confirms data being read from disk. Am i right in this conclusion, and if yes, is there any way to simply fetch partition size using index without any excessive disk hits?

Related

Cassandra OOM with large RangeTombstoneList objects in memory

We have had problems with our cassandra nodes going OOM for some time, so finally we got them configured so that we could get a heap dump to try and see what was causing the OOM.
In the dump there where 16 threads (named SharedPool-Worker-XX) each executing a SliceFromReadCommand from the same table. Each of the 16 threads had a RangeTombstoneList object retaining between 200 and 240mb of memory.
The table in question are used as a queue (I know, far from the ideal use case for cassandra, but that is how it is) between two applications, where one writes and the other reads. So it is not unlikely that there is a large number of tombstones in the table...how ever I have been unable to find them.
I did a trace on the query issued against the table, and it resulted in the following:
cqlsh:ddp> select json_data
... from file_download
... where ti = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
ti | uuid | json_data
----+------+-----------
(0 rows)
Tracing session: b2f2be60-01c8-11e8-ae90-fd18acdea80d
activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------
Execute CQL3 query | 2018-01-25 12:10:14.214000 | 10.60.73.232 | 0
Parsing select json_data\nfrom file_download\nwhere ti = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'; [SharedPool-Worker-1] | 2018-01-25 12:10:14.260000 | 10.60.73.232 | 105
Preparing statement [SharedPool-Worker-1] | 2018-01-25 12:10:14.262000 | 10.60.73.232 | 197
Executing single-partition query on file_download [SharedPool-Worker-4] | 2018-01-25 12:10:14.263000 | 10.60.73.232 | 442
Acquiring sstable references [SharedPool-Worker-4] | 2018-01-25 12:10:14.264000 | 10.60.73.232 | 491
Merging memtable tombstones [SharedPool-Worker-4] | 2018-01-25 12:10:14.265000 | 10.60.73.232 | 517
Bloom filter allows skipping sstable 2444 [SharedPool-Worker-4] | 2018-01-25 12:10:14.270000 | 10.60.73.232 | 608
Bloom filter allows skipping sstable 8 [SharedPool-Worker-4] | 2018-01-25 12:10:14.271000 | 10.60.73.232 | 665
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-4] | 2018-01-25 12:10:14.273000 | 10.60.73.232 | 700
Merging data from memtables and 0 sstables [SharedPool-Worker-4] | 2018-01-25 12:10:14.274000 | 10.60.73.232 | 707
Read 0 live and 0 tombstone cells [SharedPool-Worker-4] | 2018-01-25 12:10:14.274000 | 10.60.73.232 | 754
Request complete | 2018-01-25 12:10:14.215148 | 10.60.73.232 | 1148
cfstats also shows avg./max tombstones pr. slice to be 0.
This makes me question whether or not the OOM was actually due to large amount of tombstones or something else?
We run cassandra v2.1.17

Using partition key along with secondary index

Following are the two queries that I need to perform.
select * from where dept = 100 and emp_id = 1;
select * from where dept = 100 and name = 'One';
Which of the below options is better ?
Option 1: Use secondary index along with a partition key. I assume this way query will be executed faster as there is no need to go different nodes and index needs to be searched only locally.
cqlsh:d2> desc table emp_by_dept;
CREATE TABLE d2.emp_by_dept (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, emp_id)
) WITH CLUSTERING ORDER BY (emp_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX emp_by_dept_name_idx ON d2.emp_by_dept (name);
cqlsh:d2> select * from emp_by_dept where dept = 100;
dept | emp_id | name
------+--------+------
100 | 1 | One
100 | 2 | Two
100 | 10 | Ten
(3 rows)
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2015-06-15 17:36:55.860000 | 10.0.2.16 | 0
Parsing select * from emp_by_dept where dept = 100; [SharedPool-Worker-1] | 2015-06-15 17:36:55.861000 | 10.0.2.16 | 202
Preparing statement [SharedPool-Worker-1] | 2015-06-15 17:36:55.861000 | 10.0.2.16 | 418
Executing single-partition query on emp_by_dept [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10525
Acquiring sstable references [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10564
Merging memtable tombstones [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10635
Key cache hit for sstable 1 [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10748
Seeking to partition beginning in data file [SharedPool-Worker-3] | 2015-06-15 17:36:55.871000 | 10.0.2.16 | 10757
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18141
Merging data from memtables and 1 sstables [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18166
Read 3 live and 0 tombstoned cells [SharedPool-Worker-3] | 2015-06-15 17:36:55.879000 | 10.0.2.16 | 18335
Request complete | 2015-06-15 17:36:55.928174 | 10.0.2.16 | 68174
cqlsh:d2> select * from emp_by_dept where dept = 100 and name = 'One';
dept | emp_id | name
------+--------+------
100 | 1 | One
(1 rows)
Tracing session: c56e70a0-1357-11e5-ab8b-fb5400f1b4af
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 0
Parsing select * from emp_by_dept where dept = 100 and name = 'One'; [SharedPool-Worker-1] | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 12
Preparing statement [SharedPool-Worker-1] | 2015-06-15 17:42:20.010000 | 10.0.2.16 | 19
Computing ranges to query [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 881
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1144
Submitting range requests on 1 ranges with a concurrency of 1 (0.003515625 rows per range expected) [SharedPool-Worker-1] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1238
Executing indexed scan for [100, 100] [SharedPool-Worker-2] | 2015-06-15 17:42:20.011000 | 10.0.2.16 | 1703
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 1827
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=name, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=emp_by_dept_name_idx, indexType=COMPOSITES}]}:1. Scanning with emp_by_dept.emp_by_dept_name_idx. [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 1929
Executing single-partition query on emp_by_dept.emp_by_dept_name_idx [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2058
Acquiring sstable references [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2087
Merging memtable tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2173
Key cache hit for sstable 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.012000 | 10.0.2.16 | 2352
Seeking to partition indexed section in data file [SharedPool-Worker-2] | 2015-06-15 17:42:20.012001 | 10.0.2.16 | 2377
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.014000 | 10.0.2.16 | 4300
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-06-15 17:42:20.014000 | 10.0.2.16 | 4322
Submitted 1 concurrent range requests covering 1 ranges [SharedPool-Worker-1] | 2015-06-15 17:42:20.031000 | 10.0.2.16 | 21798
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 21989
Executing single-partition query on emp_by_dept [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22374
Acquiring sstable references [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22385
Merging memtable tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22433
Key cache hit for sstable 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22514
Seeking to partition indexed section in data file [SharedPool-Worker-2] | 2015-06-15 17:42:20.032000 | 10.0.2.16 | 22523
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22963
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22972
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 22991
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-06-15 17:42:20.033000 | 10.0.2.16 | 23096
Request complete | 2015-06-15 17:42:20.033227 | 10.0.2.16 | 23227
Option 2: Create 2 tables as below.
CREATE TABLE d2.emp_by_dept (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, emp_id)
) WITH CLUSTERING ORDER BY (emp_id ASC);
select * from emp_by_dept where dept = 100 and emp_id = 1;
CREATE TABLE d2.emp_by_dept_name (
dept int,
emp_id int,
name text,
PRIMARY KEY (dept, name)
) WITH CLUSTERING ORDER BY (name ASC);
select * from emp_by_dept_name where dept = 100 and name = 'One';
Normally it is a good approach to use secondary indexes together with the partition key, because - as you say - the secondary key lookup can be performed on a single machine.
The other concept that needs to be taken into account is the cardinality of the secondary index. In your case emp_id is probably unique, and name is almost unique, so the index will most probably return a single row, and therefore it is not too efficient. For a good explanation I recommend this article: http://www.wentnet.com/blog/?p=77.
As consequence, if query time is critical and you can update both tables in the same time, I recommend using your option 2.
It would also be interesting to measure the two options with some generated data.
Option one won't be possible, as Cassandra does not support queries using both primary keys and secondary keys. Your best bet, would be to go with option two.
Although the similarities are many, don't think of it as a 'relational table'. Instead think of it as a nested, sorted map data structure.
Cassandra believes in de-normalization and duplication of data for better read performance. Therefore, option 2 is completely normal and within the best practices of Cassandra.
Few links which you might find useful - http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
How do secondary indexes work in Cassandra?
Hope this helps.
Since maintaining two tables is harder than maintaining a single, the first option would be more preferable.
Query1 = select * from <> where dept = 100 and emp_id = 1;
Query2 = select * from <> where dept = 100 and name = 'One';
Option 1:
Write : time to write to emp_by_dept + time to update index
Read : Query1 will be a direct read from emp_by_dept, Query2 will be a read from emp_by_dept + get the location from index table + read the value from emp_by_dept
Option 2:
Write : time to write to emp_by_dept + time to write to emp_by_dept_name
Read: Query1 will be a direct read from emp_by_dept, Query2 will be a direct read from emp_by_dept_name (the required data is already sorted and kept )
So I assume write time should be almost the same in both cases (I have not tested this)
If your read response time is more important, then go for Option2.
If you are worried about maintaining 2 tables, go for option 1.
Thanks everyone for your inputs.

How to get tombstone count for a cql query?

I am trying to evaluate number of tombstones getting created in one of tables in our application. For that I am trying to use nodetool cfstats. Here is how I am doing it:
create table demo.test(a int, b int, c int, primary key (a));
insert into demo.test(a, b, c) values(1,2,3);
Now I am making the same insert as above. So I expect 3 tombstones to be created. But on running cfstats for this columnfamily, I still see that there are no tombstones created.
nodetool cfstats demo.test
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Now I tried deleting the record, but still I don't see any tombstones getting created. Is there any thing that I am missing here? Please suggest.
BTW a few other details,
* We are using version 2.1.1 of the Java driver
* We are running against Cassandra 2.1.0
For tombstone counts on a query your best bet is to enable tracing. This will give you the in depth history of a query including how many tombstones had to be read to complete it. This won't give you the total tombstone count, but is most likely more relevant for performance tuning.
In cqlsh you can enable this with
cqlsh> tracing on;
Now tracing requests.
cqlsh> SELECT * FROM ascii_ks.ascii_cs where pkey = 'One';
pkey | ckey1 | data1
------+-------+-------
One | One | One
(1 rows)
Tracing session: 2569d580-719b-11e4-9dd6-557d7f833b69
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------+--------------+-----------+----------------
execute_cql3_query | 08:26:28,953 | 127.0.0.1 | 0
Parsing SELECT * FROM ascii_ks.ascii_cs where pkey = 'One' LIMIT 10000; | 08:26:28,956 | 127.0.0.1 | 2635
Preparing statement | 08:26:28,960 | 127.0.0.1 | 6951
Executing single-partition query on ascii_cs | 08:26:28,962 | 127.0.0.1 | 9097
Acquiring sstable references | 08:26:28,963 | 127.0.0.1 | 10576
Merging memtable contents | 08:26:28,963 | 127.0.0.1 | 10618
Merging data from sstable 1 | 08:26:28,965 | 127.0.0.1 | 12146
Key cache hit for sstable 1 | 08:26:28,965 | 127.0.0.1 | 12257
Collating all results | 08:26:28,965 | 127.0.0.1 | 12402
Request complete | 08:26:28,965 | 127.0.0.1 | 12638
http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2

Cassandra 1.2 merging data from memtables and sstables takes too long

Here is a trace from a 4 node cassandra cluster, running 1.2.6. I'm seeing a timeout with a simple select when the cluster is under no load and I need some help getting to the bottom of it.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 05:21:00,848 | 100.69.176.51 | 0
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 05:21:00,848 | 100.69.176.51 | 77
Peparing statement | 05:21:00,848 | 100.69.176.51 | 225
Executing single-partition query on user_scores | 05:21:00,849 | 100.69.176.51 | 589
Acquiring sstable references | 05:21:00,849 | 100.69.176.51 | 626
Merging memtable tombstones | 05:21:00,849 | 100.69.176.51 | 676
Key cache hit for sstable 34 | 05:21:00,849 | 100.69.176.51 | 817
Seeking to partition beginning in data file | 05:21:00,849 | 100.69.176.51 | 836
Key cache hit for sstable 32 | 05:21:00,849 | 100.69.176.51 | 1135
Seeking to partition beginning in data file | 05:21:00,849 | 100.69.176.51 | 1153
Merging data from memtables and 2 sstables | 05:21:00,850 | 100.69.176.51 | 1394
Request complete | 05:21:20,881 | 100.69.176.51 | 20033807
Here is the schema. You can see that is includes a few collections.
create table user_scores
(
user_id varchar,
post_type varchar,
score double,
team_to_score_map map<varchar, double>,
affiliation_to_score_map map<varchar, double>,
campaign_to_score_map map<varchar, double>,
person_to_score_map map<varchar, double>,
primary key(user_id, post_type)
)
with compaction =
{
'class' : 'LeveledCompactionStrategy',
'sstable_size_in_mb' : 10
};
I added the leveled compaction strategy as it was supposed to help with read latency.
I'd like to understand what could cause the cluster to timeout during the merge phase. Not all queries timeout. It appears to happen more frequently with rows that have maps with a larger number of entries.
Here is another trace of a failure for good measure. It is very reproducable:
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 05:51:34,557 | 100.69.176.51 | 0
Message received from /100.69.176.51 | 05:51:34,195 | 100.69.184.134 | 102
Executing single-partition query on user_scores | 05:51:34,199 | 100.69.184.134 | 3512
Acquiring sstable references | 05:51:34,199 | 100.69.184.134 | 3741
Merging memtable tombstones | 05:51:34,199 | 100.69.184.134 | 3890
Key cache hit for sstable 5 | 05:51:34,199 | 100.69.184.134 | 4040
Seeking to partition beginning in data file | 05:51:34,199 | 100.69.184.134 | 4059
Merging data from memtables and 1 sstables | 05:51:34,200 | 100.69.184.134 | 4412
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 05:51:34,558 | 100.69.176.51 | 91
Peparing statement | 05:51:34,558 | 100.69.176.51 | 238
Enqueuing data request to /100.69.184.134 | 05:51:34,558 | 100.69.176.51 | 567
Sending message to /100.69.184.134 | 05:51:34,558 | 100.69.176.51 | 979
Request complete | 05:51:54,562 | 100.69.176.51 | 20005209
And a trace from when it works:
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 05:55:07,772 | 100.69.176.51 | 0
Message received from /100.69.176.51 | 05:55:07,408 | 100.69.184.134 | 53
Executing single-partition query on user_scores | 05:55:07,409 | 100.69.184.134 | 1014
Acquiring sstable references | 05:55:07,409 | 100.69.184.134 | 1087
Merging memtable tombstones | 05:55:07,410 | 100.69.184.134 | 1209
Partition index with 0 entries found for sstable 5 | 05:55:07,410 | 100.69.184.134 | 1681
Seeking to partition beginning in data file | 05:55:07,410 | 100.69.184.134 | 1732
Merging data from memtables and 1 sstables | 05:55:07,411 | 100.69.184.134 | 2415
Read 1 live and 0 tombstoned cells | 05:55:07,412 | 100.69.184.134 | 3274
Enqueuing response to /100.69.176.51 | 05:55:07,412 | 100.69.184.134 | 3534
Sending message to /100.69.176.51 | 05:55:07,412 | 100.69.184.134 | 3936
Parsing select * from user_scores where user_id='305722020' LIMIT 10000; | 05:55:07,772 | 100.69.176.51 | 96
Peparing statement | 05:55:07,772 | 100.69.176.51 | 262
Enqueuing data request to /100.69.184.134 | 05:55:07,773 | 100.69.176.51 | 600
Sending message to /100.69.184.134 | 05:55:07,773 | 100.69.176.51 | 847
Message received from /100.69.184.134 | 05:55:07,778 | 100.69.176.51 | 6103
Processing response from /100.69.184.134 | 05:55:07,778 | 100.69.176.51 | 6341
Request complete | 05:55:07,778 | 100.69.176.51 | 6780
Looks like I was running into a performance issue with 1.2. Fortunately a patch had just been applied to the 1.2 branch, so when I built from source my problem went away.
see https://issues.apache.org/jira/browse/CASSANDRA-5677 for a detailed explanation.

Cassandra 1.2 huge read latency

I'm working on a 4 node cassandra 1.2.6 cluster with a single keyspace, replication factor of 2 (3 originally, but dropped to 2) and 10 or so column families. It is running the Oracle 1.7 jvm. It has a mix of reads and writes, with probably two to three times as many writes as reads.
Even under a small amount of load, I am seeing very large read latencies, and I get quite a few read timeouts (using the datastax java driver). Here is an example output of nodetool cfstats for one of the column families:
Column Family: user_scores
SSTable count: 1
SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 7539098
Space used (total): 7549091
Number of Keys (estimate): 42112
Memtable Columns Count: 2267
Memtable Data Size: 1048576
Memtable Switch Count: 2
Read Count: 2101
**Read Latency: 272334.202 ms.**
Write Count: 24947
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 55376
Compacted row minimum size: 447
Compacted row maximum size: 219342
Compacted row mean size: 1051
as you can see, I tried using a level base compaction strategy to try and improve read latency, but as you can also see the latency is huge. I'm a bit stumped. I had a cassandra 1.1.6 cluster working beautifully, but no luck so far with 1.2.
The cluster is running on VM's with 4 CPU's and 7 Gb of ram. The data drive is setup as a striped raid across 4 disks. The machine doesn't seem to be IO bound.
I'm running a pretty vanilla configuration, with all the defaults.
I do see strange CPU behavior where the CPU is spiking even under smaller load. Sometimes I see compactions running, but they are niced so I don't think are the culprit.
I'm trying to figure out where to go next. Any help appreciated!
[update with rpc_timeout trace]
Still playing with this. Here is an example trace. It looks like the merge step is taking way too long.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 04:57:18,882 | 100.69.176.51 | 0
Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 04:57:18,884 | 100.69.176.51 | 1981
Peparing statement | 04:57:18,885 | 100.69.176.51 | 2997
Executing single-partition query on user_scores | 04:57:18,885 | 100.69.176.51 | 3657
Acquiring sstable references | 04:57:18,885 | 100.69.176.51 | 3724
Merging memtable tombstones | 04:57:18,885 | 100.69.176.51 | 3779
Key cache hit for sstable 32 | 04:57:18,886 | 100.69.176.51 | 3910
Seeking to partition beginning in data file | 04:57:18,886 | 100.69.176.51 | 3930
Merging data from memtables and 1 sstables | 04:57:18,886 | 100.69.176.51 | 4211
Request complete | 04:57:38,891 | 100.69.176.51 | 20009870
Older traces below:
[newer trace]
After addressing the problem noted in the logs by completely rebuilding the cluster data repository, I still ran into the problem, although it took quite a bit longer. Here is a trace I grabbed when in the bad state:
Tracing session: a6dbefc0-ea49-11e2-84bb-ef447a7d9a48
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 16:48:02,755 | 100.69.196.124 | 0
Parsing select * from user_scores limit 1; | 16:48:02,756 | 100.69.196.124 | 1774
Peparing statement | 16:48:02,759 | 100.69.196.124 | 4006
Determining replicas to query | 16:48:02,759 | 100.69.196.124 | 4286
Enqueuing request to /100.69.176.51 | 16:48:02,763 | 100.69.196.124 | 8849
Sending message to cdb002/100.69.176.51 | 16:48:02,764 | 100.69.196.124 | 9456
Message received from /100.69.196.124 | 16:48:03,449 | 100.69.176.51 | 160
Message received from /100.69.176.51 | 16:48:09,646 | 100.69.196.124 | 6891860
Processing response from /100.69.176.51 | 16:48:09,647 | 100.69.196.124 | 6892426
Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 16:48:10,288 | 100.69.176.51 | 6838754
Seeking to partition beginning in data file | 16:48:10,289 | 100.69.176.51 | 6839689
Read 1 live and 0 tombstoned cells | 16:48:10,289 | 100.69.176.51 | 6839927
Seeking to partition beginning in data file | 16:48:10,289 | 100.69.176.51 | 6839998
Read 1 live and 0 tombstoned cells | 16:48:10,289 | 100.69.176.51 | 6840082
Scanned 1 rows and matched 1 | 16:48:10,289 | 100.69.176.51 | 6840162
Enqueuing response to /100.69.196.124 | 16:48:10,289 | 100.69.176.51 | 6840229
Sending message to /100.69.196.124 | 16:48:10,299 | 100.69.176.51 | 6850072
Request complete | 16:48:09,648 | 100.69.196.124 | 6893029
[update]
I should add that things work just dandy with a solo cassandra instance on my macbook pro. AKA Works on my machine...:)
[update with trace data]
Here is some trace data. This is from the java driver. The downside is I can only trace the queries that succeed. I make it a total of 67 queries before every query starts timing out. What is weird is that it doesn't look that bad. The at query 68, I no longer get a response, and two of the servers are running hot.
2013-07-11 02:15:45 STDIO [INFO] ***************************************
66:Host (queried): cdb003/100.69.198.47
66:Host (tried): cdb003/100.69.198.47
66:Trace id: c95e51c0-e9cf-11e2-b9a9-5b3c0946787b
66:-----------------------------------------------------+--------------+-----------------+--------------
66: Enqueuing data request to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 200
66: Enqueuing digest request to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 265
66: Sending message to /100.69.196.124 | 02:15:42.045 | /100.69.198.47 | 570
66: Sending message to /100.69.176.51 | 02:15:42.045 | /100.69.198.47 | 574
66: Message received from /100.69.176.51 | 02:15:42.107 | /100.69.198.47 | 62492
66: Processing response from /100.69.176.51 | 02:15:42.107 | /100.69.198.47 | 62780
66: Message received from /100.69.198.47 | 02:15:42.508 | /100.69.196.124 | 31
66: Executing single-partition query on user_scores | 02:15:42.508 | /100.69.196.124 | 406
66: Acquiring sstable references | 02:15:42.508 | /100.69.196.124 | 473
66: Merging memtable tombstones | 02:15:42.508 | /100.69.196.124 | 577
66: Key cache hit for sstable 11 | 02:15:42.508 | /100.69.196.124 | 807
66: Seeking to partition beginning in data file | 02:15:42.508 | /100.69.196.124 | 849
66: Merging data from memtables and 1 sstables | 02:15:42.509 | /100.69.196.124 | 1500
66: Message received from /100.69.198.47 | 02:15:43.379 | /100.69.176.51 | 60
66: Executing single-partition query on user_scores | 02:15:43.379 | /100.69.176.51 | 399
66: Acquiring sstable references | 02:15:43.379 | /100.69.176.51 | 490
66: Merging memtable tombstones | 02:15:43.379 | /100.69.176.51 | 593
66: Key cache hit for sstable 7 | 02:15:43.380 | /100.69.176.51 | 1098
66: Seeking to partition beginning in data file | 02:15:43.380 | /100.69.176.51 | 1141
66: Merging data from memtables and 1 sstables | 02:15:43.380 | /100.69.176.51 | 1912
66: Read 1 live and 0 tombstoned cells | 02:15:43.438 | /100.69.176.51 | 59094
66: Enqueuing response to /100.69.198.47 | 02:15:43.438 | /100.69.176.51 | 59225
66: Sending message to /100.69.198.47 | 02:15:43.438 | /100.69.176.51 | 59373
66:Started at: 02:15:42.04466:Elapsed time in micros: 63105
2013-07-11 02:15:45 STDIO [INFO] ***************************************
67:Host (queried): cdb004/100.69.184.134
67:Host (tried): cdb004/100.69.184.134
67:Trace id: c9f365d0-e9cf-11e2-a4e5-7f3170333ff5
67:-----------------------------------------------------+--------------+-----------------+--------------
67: Message received from /100.69.184.134 | 02:15:42.536 | /100.69.198.47 | 36
67: Executing single-partition query on user_scores | 02:15:42.536 | /100.69.198.47 | 273
67: Acquiring sstable references | 02:15:42.536 | /100.69.198.47 | 311
67: Merging memtable tombstones | 02:15:42.536 | /100.69.198.47 | 353
67: Key cache hit for sstable 8 | 02:15:42.536 | /100.69.198.47 | 436
67: Seeking to partition beginning in data file | 02:15:42.536 | /100.69.198.47 | 455
67: Merging data from memtables and 1 sstables | 02:15:42.537 | /100.69.198.47 | 811
67: Read 1 live and 0 tombstoned cells | 02:15:42.550 | /100.69.198.47 | 14242
67: Enqueuing response to /100.69.184.134 | 02:15:42.550 | /100.69.198.47 | 14456
67: Sending message to /100.69.184.134 | 02:15:42.551 | /100.69.198.47 | 14694
67: Enqueuing data request to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 | 323
67: Sending message to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 | 565
67: Message received from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 | 17029
67: Processing response from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 | 17230
67:Started at: 02:15:43.021
67:Elapsed time in micros: 17622
And here is a trace using cqlsh:
Tracing session: d0f845d0-e9cf-11e2-8882-ef447a7d9a48
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------+--------------+----------------+----------------
execute_cql3_query | 19:15:54,833 | 100.69.196.124 | 0
Parsing select * from user_scores where user_id='39333433' LIMIT 10000; | 19:15:54,833 | 100.69.196.124 | 103
Peparing statement | 19:15:54,833 | 100.69.196.124 | 455
Executing single-partition query on user_scores | 19:15:54,834 | 100.69.196.124 | 1400
Acquiring sstable references | 19:15:54,834 | 100.69.196.124 | 1468
Merging memtable tombstones | 19:15:54,835 | 100.69.196.124 | 1575
Key cache hit for sstable 11 | 19:15:54,835 | 100.69.196.124 | 1780
Seeking to partition beginning in data file | 19:15:54,835 | 100.69.196.124 | 1822
Merging data from memtables and 1 sstables | 19:15:54,836 | 100.69.196.124 | 2562
Read 1 live and 0 tombstoned cells | 19:15:54,838 | 100.69.196.124 | 4808
Request complete | 19:15:54,838 | 100.69.196.124 | 5810
The trace seems to show that much of the time is doing or waiting for network operations. Perhaps your network has problems?
If only some operations fail, perhaps you have a problem with only one of your nodes. When that node is not needed, things work, but when it is needed things go badly. It might be worth looking at the log files on the other nodes.
Looks like I was running into a performance issue with 1.2. Fortunately a patch had just been applied to the 1.2 branch, so when I built from source my problem went away.
see https://issues.apache.org/jira/browse/CASSANDRA-5677 for a detailed explanation.
Thanks all!

Resources