cassandra, bad performance of time series table - cassandra

I got a 3x nodes cluster (on the same 16 core box, in virtual box via lxc but each node on a 3TB disk on it's own).
My table is this:
CREATE TABLE history (
id text,
idx bigint,
data bigint,
PRIMARY KEY (id, idx)
) WITH CLUSTERING ORDER BY (idx DESC)
id will store an id which is a string , idx is a time in ms and data are my data. According to all examples I found, this seems to be a correct schema for time series data.
My query is :
select idx,data from history where id=? limit 2
This returns the 2 most recent (based on idx) rows.
Since id is the partition key and idx the clustering key, docs I found claim that this is very performant with cassandra. But my benchmarks say otherwise.
I've populated a 400GB in total (split in those 3 nodes) and now I am running queries from a 2ndary box. Using 16 or 32 threads, I am running the mentioned query but the performance is really low for 3 nodes running on 3 separate disks:
throughput: 61 avg time: 614,808 μs
throughput: 57 avg time: 519,651 μs
throughput: 52 avg time: 569,245 μs
So , ~55 queries per second, each query taking half second (sometimes they do take 200ms)
I find this really low.
Can someone please tell me if my schema is correct and if not suggest a schema? If my schema is correct, how can I find what is going wrong?
Disk IO on the 16core box:
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 0.00 0.00 0.00 0 0
sdb 135.00 6.76 0.00 6 0
sdc 149.00 6.99 0.00 6 0
sdd 124.00 7.21 0.00 7 0
The cassandras don't use more than 1 cpu core each.
EDIT: With tracing on I get a lot of lines like the following when I run a simple query for 1 id:
Key cache hit for sstable 33259 | 20:16:26,699 | 127.0.0.1 | 5830
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 5833
Bloom filter allows skipping sstable 33256 | 20:16:26,699 | 127.0.0.1 | 5923
Bloom filter allows skipping sstable 33255 | 20:16:26,699 | 127.0.0.1 | 5932
Bloom filter allows skipping sstable 33252 | 20:16:26,699 | 127.0.0.1 | 5938
Key cache hit for sstable 33247 | 20:16:26,699 | 127.0.0.1 | 5948
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 5951
Bloom filter allows skipping sstable 33246 | 20:16:26,699 | 127.0.0.1 | 6072
Bloom filter allows skipping sstable 33243 | 20:16:26,699 | 127.0.0.1 | 6081
Key cache hit for sstable 33242 | 20:16:26,699 | 127.0.0.1 | 6092
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 6095
Bloom filter allows skipping sstable 33240 | 20:16:26,699 | 127.0.0.1 | 6187
Key cache hit for sstable 33237 | 20:16:26,699 | 127.0.0.1 | 6198
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 6201
Key cache hit for sstable 33235 | 20:16:26,699 | 127.0.0.1 | 6297
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 6301
Bloom filter allows skipping sstable 33234 | 20:16:26,699 | 127.0.0.1 | 6393
Key cache hit for sstable 33229 | 20:16:26,699 | 127.0.0.1 | 6404
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 6408
Bloom filter allows skipping sstable 33228 | 20:16:26,699 | 127.0.0.1 | 6496
Key cache hit for sstable 33227 | 20:16:26,699 | 127.0.0.1 | 6508
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 6511
Key cache hit for sstable 33226 | 20:16:26,699 | 127.0.0.1 | 6601
Seeking to partition beginning in data file | 20:16:26,699 | 127.0.0.1 | 6605
Key cache hit for sstable 33225 | 20:16:26,700 | 127.0.0.1 | 6692
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 6696
Key cache hit for sstable 33223 | 20:16:26,700 | 127.0.0.1 | 6785
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 6789
Key cache hit for sstable 33221 | 20:16:26,700 | 127.0.0.1 | 6876
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 6880
Bloom filter allows skipping sstable 33219 | 20:16:26,700 | 127.0.0.1 | 6967
Key cache hit for sstable 33377 | 20:16:26,700 | 127.0.0.1 | 6978
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 6981
Key cache hit for sstable 33208 | 20:16:26,700 | 127.0.0.1 | 7071
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 7075
Key cache hit for sstable 33205 | 20:16:26,700 | 127.0.0.1 | 7161
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 7166
Bloom filter allows skipping sstable 33201 | 20:16:26,700 | 127.0.0.1 | 7251
Bloom filter allows skipping sstable 33200 | 20:16:26,700 | 127.0.0.1 | 7260
Key cache hit for sstable 33195 | 20:16:26,700 | 127.0.0.1 | 7276
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 7279
Bloom filter allows skipping sstable 33191 | 20:16:26,700 | 127.0.0.1 | 7363
Key cache hit for sstable 33190 | 20:16:26,700 | 127.0.0.1 | 7374
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 7377
Bloom filter allows skipping sstable 33189 | 20:16:26,700 | 127.0.0.1 | 7463
Key cache hit for sstable 33186 | 20:16:26,700 | 127.0.0.1 | 7474
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 7477
Key cache hit for sstable 33183 | 20:16:26,700 | 127.0.0.1 | 7563
Seeking to partition beginning in data file | 20:16:26,700 | 127.0.0.1 | 7567
Bloom filter allows skipping sstable 33182 | 20:16:26,701 | 127.0.0.1 | 7663
Bloom filter allows skipping sstable 33180 | 20:16:26,701 | 127.0.0.1 | 7672
Bloom filter allows skipping sstable 33178 | 20:16:26,701 | 127.0.0.1 | 7679
Bloom filter allows skipping sstable 33177 | 20:16:26,701 | 127.0.0.1 | 7686
Maybe most important is the end of the trace:
Merging data from memtables and 277 sstables | 20:21:29,186 | 127.0.0.1 | 607001
Read 3 live and 0 tombstoned cells | 20:21:29,186 | 127.0.0.1 | 607205
Request complete | 20:21:29,186 | 127.0.0.1 | 607714

Do look at tracing to confirm, but if sdb,sdc, and sdd are spinning disks, you are seeing the correct order of magnitude of tps, and are very likely random disk I/O bound on the read-side.
If that is the case, then you only have two options (with any system, not specific to Cassandra):
Switch to SSDs. My personal testing has demonstrated up to 3 orders of magnitude increased random read performance when the workload was entirely bound by the tps of the disks.
Ensure that a very large percentage of your reads are cached. If you are doing random reads across 400GB of data, that is probably not going to be feasible.\
Cassandra can do roughly 3k-5K operations (read or write) per CPU core, but only if the disk subsystem isn't the limiting factor.

Related

Cassandra OOM with large RangeTombstoneList objects in memory

We have had problems with our cassandra nodes going OOM for some time, so finally we got them configured so that we could get a heap dump to try and see what was causing the OOM.
In the dump there where 16 threads (named SharedPool-Worker-XX) each executing a SliceFromReadCommand from the same table. Each of the 16 threads had a RangeTombstoneList object retaining between 200 and 240mb of memory.
The table in question are used as a queue (I know, far from the ideal use case for cassandra, but that is how it is) between two applications, where one writes and the other reads. So it is not unlikely that there is a large number of tombstones in the table...how ever I have been unable to find them.
I did a trace on the query issued against the table, and it resulted in the following:
cqlsh:ddp> select json_data
... from file_download
... where ti = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
ti | uuid | json_data
----+------+-----------
(0 rows)
Tracing session: b2f2be60-01c8-11e8-ae90-fd18acdea80d
activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------
Execute CQL3 query | 2018-01-25 12:10:14.214000 | 10.60.73.232 | 0
Parsing select json_data\nfrom file_download\nwhere ti = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'; [SharedPool-Worker-1] | 2018-01-25 12:10:14.260000 | 10.60.73.232 | 105
Preparing statement [SharedPool-Worker-1] | 2018-01-25 12:10:14.262000 | 10.60.73.232 | 197
Executing single-partition query on file_download [SharedPool-Worker-4] | 2018-01-25 12:10:14.263000 | 10.60.73.232 | 442
Acquiring sstable references [SharedPool-Worker-4] | 2018-01-25 12:10:14.264000 | 10.60.73.232 | 491
Merging memtable tombstones [SharedPool-Worker-4] | 2018-01-25 12:10:14.265000 | 10.60.73.232 | 517
Bloom filter allows skipping sstable 2444 [SharedPool-Worker-4] | 2018-01-25 12:10:14.270000 | 10.60.73.232 | 608
Bloom filter allows skipping sstable 8 [SharedPool-Worker-4] | 2018-01-25 12:10:14.271000 | 10.60.73.232 | 665
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-4] | 2018-01-25 12:10:14.273000 | 10.60.73.232 | 700
Merging data from memtables and 0 sstables [SharedPool-Worker-4] | 2018-01-25 12:10:14.274000 | 10.60.73.232 | 707
Read 0 live and 0 tombstone cells [SharedPool-Worker-4] | 2018-01-25 12:10:14.274000 | 10.60.73.232 | 754
Request complete | 2018-01-25 12:10:14.215148 | 10.60.73.232 | 1148
cfstats also shows avg./max tombstones pr. slice to be 0.
This makes me question whether or not the OOM was actually due to large amount of tombstones or something else?
We run cassandra v2.1.17

Counting partition size in cassandra

I'm implementing unique entry counter in Cassandra. The counter may be represented just as a set of tuples:
counter_id = broadcast:12345, token = user:123
counter_id = broadcast:12345, token = user:321
where value for counter broadcast:12345 may be counted as size of corresponding entries set. Such counter can be effectively stored as a table with counter_id being partition key. My first thought was that since single counter value is basically size of partition, i can do count(1) WHERE counter_id = ? query, which won't need to read data and would be super-duper fast. However, i see following trace output:
cqlsh > select count(1) from token_counter_storage where id = '1';
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+------------+----------------
Execute CQL3 query | 2016-06-10 11:22:42.809000 | 172.17.0.2 | 0
Parsing select count(1) from token_counter_storage where id = '1'; [SharedPool-Worker-1] | 2016-06-10 11:22:42.809000 | 172.17.0.2 | 260
Preparing statement [SharedPool-Worker-1] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 565
Executing single-partition query on token_counter_storage [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1256
Acquiring sstable references [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1350
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1465
Merging data from memtables and 0 sstables [SharedPool-Worker-2] | 2016-06-10 11:22:42.810000 | 172.17.0.2 | 1546
Read 10 live and 0 tombstone cells [SharedPool-Worker-2] | 2016-06-10 11:22:42.811000 | 172.17.0.2 | 1826
Request complete | 2016-06-10 11:22:42.811410 | 172.17.0.2 | 2410
I guess that this trace confirms data being read from disk. Am i right in this conclusion, and if yes, is there any way to simply fetch partition size using index without any excessive disk hits?

How to get tombstone count for a cql query?

I am trying to evaluate number of tombstones getting created in one of tables in our application. For that I am trying to use nodetool cfstats. Here is how I am doing it:
create table demo.test(a int, b int, c int, primary key (a));
insert into demo.test(a, b, c) values(1,2,3);
Now I am making the same insert as above. So I expect 3 tombstones to be created. But on running cfstats for this columnfamily, I still see that there are no tombstones created.
nodetool cfstats demo.test
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Now I tried deleting the record, but still I don't see any tombstones getting created. Is there any thing that I am missing here? Please suggest.
BTW a few other details,
* We are using version 2.1.1 of the Java driver
* We are running against Cassandra 2.1.0
For tombstone counts on a query your best bet is to enable tracing. This will give you the in depth history of a query including how many tombstones had to be read to complete it. This won't give you the total tombstone count, but is most likely more relevant for performance tuning.
In cqlsh you can enable this with
cqlsh> tracing on;
Now tracing requests.
cqlsh> SELECT * FROM ascii_ks.ascii_cs where pkey = 'One';
pkey | ckey1 | data1
------+-------+-------
One | One | One
(1 rows)
Tracing session: 2569d580-719b-11e4-9dd6-557d7f833b69
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------+--------------+-----------+----------------
execute_cql3_query | 08:26:28,953 | 127.0.0.1 | 0
Parsing SELECT * FROM ascii_ks.ascii_cs where pkey = 'One' LIMIT 10000; | 08:26:28,956 | 127.0.0.1 | 2635
Preparing statement | 08:26:28,960 | 127.0.0.1 | 6951
Executing single-partition query on ascii_cs | 08:26:28,962 | 127.0.0.1 | 9097
Acquiring sstable references | 08:26:28,963 | 127.0.0.1 | 10576
Merging memtable contents | 08:26:28,963 | 127.0.0.1 | 10618
Merging data from sstable 1 | 08:26:28,965 | 127.0.0.1 | 12146
Key cache hit for sstable 1 | 08:26:28,965 | 127.0.0.1 | 12257
Collating all results | 08:26:28,965 | 127.0.0.1 | 12402
Request complete | 08:26:28,965 | 127.0.0.1 | 12638
http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2

Cassandra query very slow when database is large

I have a table with 45 million keys.
Compaction strategy is LCS.
Single node Cassandra version 2.0.5.
No key or row caching.
Queries are very slow. They were several times faster when the database was smaller.
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------+--------------+-------------+----------------
execute_cql3_query | 17:47:07,891 | 10.72.9.151 | 0
Parsing select * from object_version_info where key = 'test10-Client9_99900' LIMIT 10000; | 17:47:07,891 | 10.72.9.151 | 80
Preparing statement | 17:47:07,891 | 10.72.9.151 | 178
Executing single-partition query on object_version_info | 17:47:07,893 | 10.72.9.151 | 2513
Acquiring sstable references | 17:47:07,893 | 10.72.9.151 | 2539
Merging memtable tombstones | 17:47:07,893 | 10.72.9.151 | 2597
Bloom filter allows skipping sstable 1517 | 17:47:07,893 | 10.72.9.151 | 2652
Bloom filter allows skipping sstable 1482 | 17:47:07,893 | 10.72.9.151 | 2677
Partition index with 0 entries found for sstable 1268 | 17:47:08,560 | 10.72.9.151 | 669935
Seeking to partition beginning in data file | 17:47:08,560 | 10.72.9.151 | 669956
Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 17:47:09,411 | 10.72.9.151 | 1520279
Merging data from memtables and 1 sstables | 17:47:09,411 | 10.72.9.151 | 1520302
Read 1 live and 0 tombstoned cells | 17:47:09,411 | 10.72.9.151 | 1520351
Request complete | 17:47:09,411 | 10.72.9.151 | 1520615

what are cqlsh tracing entries meaning?

activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 06:30:52,479 | 192.168.11.23 | 0
Parsing select adid from userlastadevents where userid = '90000012' and type in (1,2,3) LIMIT 10000; | 06:30:52,479 | 192.168.11.23 | 44
Peparing statement | 06:30:52,479 | 192.168.11.23 | 146
Executing single-partition query on userlastadevents | 06:30:52,480 | 192.168.11.23 | 665
Acquiring sstable references | 06:30:52,480 | 192.168.11.23 | 680
Executing single-partition query on userlastadevents | 06:30:52,480 | 192.168.11.23 | 696
Acquiring sstable references | 06:30:52,480 | 192.168.11.23 | 704
Merging memtable tombstones | 06:30:52,480 | 192.168.11.23 | 706
Merging memtable tombstones | 06:30:52,480 | 192.168.11.23 | 721
Bloom filter allows skipping sstable 37398 | 06:30:52,480 | 192.168.11.23 | 758
Bloom filter allows skipping sstable 37426 | 06:30:52,480 | 192.168.11.23 | 762
Bloom filter allows skipping sstable 35504 | 06:30:52,480 | 192.168.11.23 | 768
Bloom filter allows skipping sstable 36671 | 06:30:52,480 | 192.168.11.23 | 771
Merging data from memtables and 0 sstables | 06:30:52,480 | 192.168.11.23 | 777
Merging data from memtables and 0 sstables | 06:30:52,480 | 192.168.11.23 | 780
Executing single-partition query on userlastadevents | 06:30:52,480 | 192.168.11.23 | 782
Acquiring sstable references | 06:30:52,480 | 192.168.11.23 | 791
Read 0 live and 0 tombstoned cells | 06:30:52,480 | 192.168.11.23 | 797
Read 0 live and 0 tombstoned cells | 06:30:52,480 | 192.168.11.23 | 800
Merging memtable tombstones | 06:30:52,480 | 192.168.11.23 | 815
Bloom filter allows skipping sstable 37432 | 06:30:52,480 | 192.168.11.23 | 857
Bloom filter allows skipping sstable 36918 | 06:30:52,480 | 192.168.11.23 | 866
Merging data from memtables and 0 sstables | 06:30:52,480 | 192.168.11.23 | 874
Read 0 live and 0 tombstoned cells | 06:30:52,480 | 192.168.11.23 | 898
Request complete | 06:30:52,479 | 192.168.11.23 | 990
Above is the tracing output from cassandra cqlsh for a single query, but I couln't understand some the entries, at first the column "source_elapsed" what does it mean, does it mean time elapsed to execute particular task or cumulative time elapsed up to this task. second "timestamp" doesn't maintain chronology like "Request Complete" timestamp is 06:30:52,479 but "Merging data from memtables and 0 sstables" is 06:30:52,480 which is suppose to happen earlier but timestamp shows it happens later.
And couldn't understand some of the activities as well,
Executing single-partition query -- doesn't it mean all the task as a whole or is it a starting point? what are the job it includes? And why is it repeating three times? is it link to replication factor.
Acquiring sstable references -- What does it mean, does it checks all the sstable's bloom filters whether that contains a particular key we search for? An then find the reference in data file with the help of "Partition Index".
Bloom filter allows skipping sstable -- when does it happen? How does it happen? it is taking same amount of time of finding sstable references.
Request complete -- what does it mean? is it the finishing line or it is some job which takes most amount of time?
Did you see the request tracing in Cassandra link that explains different tracing scenarios?
source_elapsed: is the cumulative execution time on a specific node (if you check the above link it will be clearer)
Executing single-partition query: (seems to represent) the start time
Request complete: all work has been done for this request
For the rest you'd be better off reading the Reads in Cassandra docs as that would be much more detailed than I could summarize it here.

Resources