Low read throughput with cassandra (timeseries data) - cassandra

We are in the process of researching a move to Cassandra (2.0.10) and we are testing the write and read performance.
While reading we are seeing what seems to be low read throughput, 14MB/s on avg.
Our current testing environment is only one node, Xeon E5-1620 # 3.7GHZ with 32GB of RAM, windows 7.
Cassandra heap was set to 8GB with default concurrent read and writes, key cache size is set to 400mb, the data sits on a local RAID10 array which is doing sustained avg of 300MB/s sequential read performance using 64KB and higher block sizes.
We are storing hourly sensor data with the current model:
CREATE TABLE IF NOT EXISTS sensor_data_by_day (
sensor_id int,
date text,
event_time timestamp,
load float,
PRIMARY KEY ((sensor_id,date),event_time))
Reading is done on the sensor, date and a range of event time.
Current data set is 2 years worth of data for 100K sensors, about 30GB on disk.
Data is inserted by numerous threads (So the inserts are not sorted by event time, if that matters)
Reading back a day worth of data takes about 2m with a throughput of 14MB/s.
Reading is done using the java-cassandara-connector with a prepared statement:
Select event_time, load from sensor_data_by_day where sensor_id = ? and date in ('2014-02-02') and event_time >= ? and event_time < ?
We create one connection and submitting tasks (100K queries as the number of sensors) to an executor service with pool of 100 threads.
Reading when the data is in the cache takes about 7s.
It's probably not a client problem, we tested when the data was located on an SSD and the total time went down from 2m to 10s (~170MB/s), which is understandably better given it's an SSD.
The read performance looks like a block read size issue, which can cause the low reads if Cassandra was reading in 4KB blocks. I read the default was 256 but didn't find the setting anywhere to confirm it or perhaps a random I/O issue?
Is this the kinds of read performance you should expect from Cassandra when using mechanical disks? Perhaps a modeling problem?
Output of cfhistograms:
SSTables per Read
1 sstables: 844726
2 sstables: 90
Write Latency (microseconds)
No Data
Read Latency (microseconds)
5 us: 418
6 us: 15252
7 us: 12884
8 us: 15447
10 us: 34211
12 us: 48972
14 us: 48421
17 us: 56641
20 us: 12484
24 us: 8325
29 us: 6602
35 us: 4953
42 us: 5427
50 us: 3610
60 us: 1784
72 us: 2414
86 us: 11208
103 us: 38395
124 us: 82050
149 us: 64840
179 us: 40161
215 us: 30891
258 us: 17691
310 us: 8787
372 us: 4171
446 us: 2305
535 us: 1588
642 us: 1187
770 us: 913
924 us: 811
1109 us: 716
1331 us: 602
1597 us: 513
1916 us: 513
2299 us: 516
2759 us: 595
3311 us: 776
3973 us: 1086
4768 us: 1502
5722 us: 2212
6866 us: 3264
8239 us: 4852
9887 us: 7586
11864 us: 11429
14237 us: 17236
17084 us: 22285
20501 us: 26163
24601 us: 26799
29521 us: 24311
35425 us: 22101
42510 us: 19420
51012 us: 16497
61214 us: 13830
73457 us: 11356
88148 us: 8749
105778 us: 6243
126934 us: 4406
152321 us: 2751
182785 us: 1754
219342 us: 977
263210 us: 497
315852 us: 233
379022 us: 109
454826 us: 60
545791 us: 21
654949 us: 10
785939 us: 2
943127 us: 0
1131752 us: 1
Partition Size (bytes)
179 bytes: 151874
215 bytes: 0
258 bytes: 0
310 bytes: 0
372 bytes: 5071
446 bytes: 0
535 bytes: 4170
642 bytes: 3724
770 bytes: 3454
924 bytes: 3416
1109 bytes: 3489
1331 bytes: 9179
1597 bytes: 11616
1916 bytes: 12435
2299 bytes: 19038
2759 bytes: 20653
3311 bytes: 10245454
3973 bytes: 25121333
Cell Count per Partition
4 cells: 151874
5 cells: 0
6 cells: 0
7 cells: 0
8 cells: 5071
10 cells: 0
12 cells: 4170
14 cells: 0
17 cells: 3724
20 cells: 3454
24 cells: 3416
29 cells: 3489
35 cells: 3870
42 cells: 9982
50 cells: 13521
60 cells: 20108
72 cells: 16678
86 cells: 51646
103 cells: 35323903

What kind of compaction do you use? If you are having bad read latency from disks it mostly because of the number of SS Tables.
My Suggestions:
If you are looking for better read latency, i would suggest use Leveled compaction. Configure the SS Table size to avoid too many compactions.
With leveled compaction you should get only have max number of reads as the levels. So the performance will be much better.
This comes at the cost of increased number of compaction(if the sstable size is lower) and higher disk IO.
What is your current bloom filter size? Increasing it will decrease the probability of false negatives again improving reads
You seem to have a pretty good key cache set up, if you guys have specific rows that might be read frequently you can turn on row cache. This is generally not recommended as the advantage is minimal for most of the applications.
If the data is always going to be time series, may be use date tiered compaction?

Related

How do I create a loop for reading lines then printing it out with a format (Python 3.x)

Hi I'm trying to create a readline loop and then print it out individually with a format but everytime i do it just repeats itself.
here is my code:
LF = open('fees.txt', 'r')
print('Now the final table\n')
print("Airline", format("1st bag",">15"),format("2nd bag",">15"), \
format("Change Fee",">15"),format("Other Fee",">15"), \
format("Feel Like",">15"),'\n')
line = LF
while line != '':
line = str(line)
line = LF.readline()
line = line.rstrip('\n')
print(line, format(line,'>10'),format(line,'>15'), format(line,'>15'), \
format(line,'>15'), format(line,'>15'),'\n')
LF.close()
print('===================================================\n')
and the result always turns like this:
Now the final table
Airline 1st bag 2nd bag Change Fee Other Fee Feel Like
Southwest Southwest Southwest Southwest Southwest Southwest
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Yes! Yes! Yes! Yes! Yes! Yes!
JetBlue JetBlue JetBlue JetBlue JetBlue JetBlue
20 20 20 20 20 20
35 35 35 35 35 35
75 75 75 75 75 75
125 125 125 125 125 125
Yikes Yikes Yikes Yikes Yikes Yikes
Alaska Airlines Alaska Airlines Alaska Airlines Alaska Airlines Alaska Airlines Alaska Airlines
25 25 25 25 25 25
25 25 25 25 25 25
125 125 125 125 125 125
155 155 155 155 155 155
Ooof Ooof Ooof Ooof Ooof Ooof
Delta Delta Delta Delta Delta Delta
25 25 25 25 25 25
35 35 35 35 35 35
200 200 200 200 200 200
150 150 150 150 150 150
We Lost Track We Lost Track We Lost Track We Lost Track We Lost Track We Lost Track
United United United United United United
25 25 25 25 25 25
35 35 35 35 35 35
200 200 200 200 200 200
250 250 250 250 250 250
Whaaaaat? Whaaaaat? Whaaaaat? Whaaaaat? Whaaaaat? Whaaaaat?
Am. Airlines Am. Airlines Am. Airlines Am. Airlines Am. Airlines Am. Airlines
25 25 25 25 25 25
35 35 35 35 35 35
200 200 200 200 200 200
205 205 205 205 205 205
Arrrgh! Arrrgh! Arrrgh! Arrrgh! Arrrgh! Arrrgh!
Spirit Spirit Spirit Spirit Spirit Spirit
30 30 30 30 30 30
40 40 40 40 40 40
100 100 100 100 100 100
292 292 292 292 292 292
Really?? Really?? Really?? Really?? Really?? Really??
===================================================
how do I fix it that it would turn out like this.
Southwest 0 0 0 0 Yes !
Jetblue 20 35 75 125 Yikes!
and so on and so forth.

How to read the cassandra nodetool histograms percentile and other columns?

How to read the cassandra nodetool histograms percentile and other coulmns?
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 14.24 4055.27 149 2
75% 35.00 17.08 17436.92 149 2
95% 35.00 24.60 74975.55 642 2
98% 86.00 35.43 129557.75 770 2
99% 103.00 51.01 186563.16 770 2
Min 0.00 2.76 51.01 104 2
Max 124.00 36904729.27 12359319.16 924 2
They show the distribution of the metrics. For example, in your data the write latency for 95% of the requests were 24.60 microseconds or less. 95% of the partitions are 642 bytes or less with 2 cells. The SStables column is how many sstables are touched on a read, so 95% or read requests are looking at 35 sstables (this is fairly high).

Issues with Scaling horizontally with Cassandra NoSQL

I am trying to configure and benchmark my AWS EC2 instances for Cassandra distributions with Datstax Community Edition. I'm working with 1 cluster so far, and I'm having issues with the horizontal scaling.
I'm running cassandra-stress tool to stress the nodes and I'm not seeing the horizontal scaling. My command is run under an EC2 instance that is on the same network as the nodes but not on the node (ie i'm not using one of the node to launch the command)
I have inputted the following:
cassandra-stress write n=1000000 cl=one -mode native cql3 -schema keyspace="keyspace1" -pop seq=1..1000000 -node ip1,ip2
I started with 2 nodes, and then 3, and then 6. But the numbers don't tell me what Cassandra is suppose to do: more nodes to a cluster should speed up read/write.
Results: 2 Nodes: 1M 3 Nodes: 1M 3 Nodes: 2M 6 Nodes: 1M 6 Nodes: 2M 6 Nodes: 6M 6 Nodes: 10M
op rate 6858 6049 6804 7711 7257 7531 8081
partition rate 6858 6049 6804 7711 7257 7531 8081
row rate 6858 6049 6804 7711 7257 7531 8081
latency mean 29.1 33 29.3 25.9 27.5 26.5 24.7
latency median 24.9 32.1 24 22.6 23.1 21.8 21.5
latency 95th percentile 57.9 73.3 62 50 56.2 52.1 40.2
latency 99th percentile 76 92.2 77.4 65.3 69.1 61.8 46.4
latency 99.9th percentile 87 103.4 83.5 76.2 75.7 64.9 48.1
latency max 561.1 587.1 1075 503.1 521.7 1662.3 590.3
total gc count 0 0 0 0 0 0 0
total gc mb 0 0 0 0 0 0 0
total gc time (s) 0 0 0 0 0 0 0
avg gc time(ms) NAN NaN NaN NaN NaN NaN NaN
stdev gc time(ms) 0 0 0 0 0 0 0
Total operation time 0:02:25 0:02:45 0:04:53 0:02:09 00:04.35 0:13:16 0:20:37
Each with the default keyspace1 that was provided.
I've tested at 3 Nodes: 1M, 2M iteration. 6 Nodes I've tried 1M,2M, 6M, and 10M. As I increased Iteration I'm marginally increasing the OP Rate.
Am I doing something wrong or do I have Cassandra backward. Right now RF = 1 as I don't want to insert latency for replications. I Just want to see in the longterm the horizontal scaling which I'm not seeing it.
Help?

Does Cassandra read the whole row when limiting the number of requested results?

I am using cassandra 2.0.6. and have this table:
CREATE TABLE t (
id text,
idx bigint,
data bigint,
PRIMARY KEY (id, idx)
)
So say I got these rows:
id / idx / data
x 1 data1
x 2 data2
x 3 data3
.... goes on say 1000 rows for x
If I query :
select * from t where id='x' order by idx limit 1
Will cassandra fetch all the 1000 rows , or only a small part of it?
Reading articles like http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/#.UzrvLKZx2PI , it seems it will fetch only a small part of it. But running some stress tests and the more data I have in the table, the more MB/sec disk IO I get.
For 8GB of data I was getting 3MB/sec IO (reads)
For 12GB of data I was getting 15MB/sec IO (reads)
For 20GB of data, I am currently getting 35MB/sec IO (reads)
I don't see anything weird in cfhistograms:
SSTables per Read
1 sstables: 421010
2 sstables: 552
3 sstables: 9
4 sstables: 0
5 sstables: 254
6 sstables: 3221
7 sstables: 3063
8 sstables: 1029
10 sstables: 143
Read Latency (microseconds)
12 us: 6
14 us: 36
17 us: 471
20 us: 2795
24 us: 10799
29 us: 18594
35 us: 24693
42 us: 43078
50 us: 67438
60 us: 68872
72 us: 70718
86 us: 47300
103 us: 23471
124 us: 11752
149 us: 4509
179 us: 1437
215 us: 832
258 us: 3444
310 us: 7883
372 us: 2374
446 us: 736
535 us: 624
642 us: 581
770 us: 1875
924 us: 1715
1109 us: 2889
1331 us: 3705
1597 us: 2197
1916 us: 1320
2299 us: 826
2759 us: 639
3311 us: 431
3973 us: 312
4768 us: 213
5722 us: 106
6866 us: 72
8239 us: 44
9887 us: 36
11864 us: 25
14237 us: 16
17084 us: 23
20501 us: 20
24601 us: 15
29521 us: 28
35425 us: 21
42510 us: 20
51012 us: 49
61214 us: 49
73457 us: 29
88148 us: 23
105778 us: 35
126934 us: 23
152321 us: 17
182785 us: 13
219342 us: 10
263210 us: 8
315852 us: 3
379022 us: 8
454826 us: 10
You get more I/O as you are ordering and limiting on the fly. If you are sure about the order in which you want to fetch the data , use clusterordering on the column family at the time of creation itself
create table tablename(.......) with cluster order by (idx desc)
By this way, all your inserts are ordered by idx in descending order by default. Hence , when you apply limit on it,you shall reduce the disk I/O
Once you have done the clustering order , your ordering time is saved now. If you are facing problem with large amounts of data, it will be due to the compaction strategy used. I feel you are using a size tiered compaction strategy on read heavy column family. Try the same scenario with Leveled compaction strategy.
When you use size tiered compaction, you are spreading your data across multiple stables and you are bound to get data out of all each time. So , a read heavy column family doesn't bode well with this.
I found out that I was actually accidentally exhausting the resultset iterator, fixed that and now IO is normal.

Most efficient compression extremely large data set

I'm currently generating an extremely large data set on a remote HPC (high performace computer). We are talking about 3 TB at the moment, and it could reach up to 10 TB once I'm done.
Each of the 450 000 files ranges from a few KB to about 100 MB and contains lines of integers with no repetitive/predictable patterns. Moreover they are split among 150 folders (I use the path to classify them according to the input parameters). Now that could be fine, but my research group is technically limited to 1TB of disk space on the remote server, although the admin are willing to close their eyes until the situation gets sorted out.
What would you recommend to compress such a dataset?
A limitation is that tasks can't run more than 48 hours at a time on this computer. So long but efficient compression methods are possible only if 48 hours is enough... I really have no other options as neither me, neither my group own enough disk space on other machines.
EDIT: Just to clarify, this a remote computer that runs on some variation of linux. All standard compression protocols are available. I don't have super user rights.
EDIT2: As request by Sergio, here is a sample output (first 10 lines of a files)
27 42 46 63 95 110 205 227 230 288 330 345 364 367 373 390 448 471 472 482 509 514 531 533 553 617 636 648 667 682 703 704 735 740 762 775 803 813 882 915 920 936 939 942 943 979 1018 1048 1065 1198 1219 1228 1513 1725 1888 1944 2085 2190 2480 5371 5510 5899 6788 7728 9514 10382 11946 13063 13808 16070 23301 23511 24538
93 94 106 143 157 164 168 181 196 293 299 334 369 372 439 457 508 527 547 557 568 570 573 592 601 668 701 704 799 838 848 870 875 882 890 913 953 959 1022 1024 1037 1046 1169 1201 1288 1615 1684 1771 2043 2204 2348 2387 2735 3149 4319 4890 4989 5321 5588 6453 7475 9277 9649 9654 11433 16966
1463
183 469 514 597 792
25 50 143 152 205 244 253 424 433 446 461 476 486 545 552 570 632 642 647 665 681 682 718 735 746 772 792 811 830 851 891 903 925 1037 1115 1147 1171 1612 1979 2749 3074 3158 6042 12709 20571 20859
24 30 86 312 726 875 1023 1683 1799
33 36 42 65 110 112 122 227 241 262 274 284 305 328 353 366 393 414 419 449 462 488 489 514 635 690 732 744 767 772 812 820 843 844 855 889 893 925 936 939 981 1015 1020 1060 1064 1130 1174 1304 1393 1477 1939 2004 2200 2205 2208 2216 2234 3284 4456 5209 6810 6834 8067 10811 10895 12771 15291
157 761 834 875 1001 2492
21 141 146 169 181 256 266 337 343 367 397 402 405 433 454 466 513 527 656 684 708 709 732 743 811 883 913 938 947 986 987 1013 1053 1190 1215 1288 1289 1333 1513 1524 1683 1758 2033 2684 3714 4129 6015 7395 8273 8348 9483 23630
1253
All integers are separated by one whitespace, and each line corresponds to a given element. I use implicit line numbers to store this information, because my data is assosiative i.e. the 0th element is associated to elements 27 42 46 63 110.. etc. I believe that there is no extra information whatsoever.
A few points that may help:
It looks like your numbers are sorted. If this is always the case, then it will be more efficient to compress the differences between adjacent numbers rather than the numbers themselves (since the differences will be somewhat smaller on average)
There are good ways of encoding small integer values in binary format, that are probably better than encoding them in text format. See the technique used by Google in their protocol buffers: (https://developers.google.com/protocol-buffers/docs/encoding)
Once you have applied the above techniques, then zipping / some standard form of compression should improve everything even further.
There is some research done at this LINK that breaks down the pro/cons of using gzip, bzip2, and lzma. Hopefully this can let you make an informed decision on your best approach.
All your numbers seem to be increasing in size (each line). A rather common approach in database technology would be to only store the size difference, making a line like
24 30 86 312 726 875 1023 1683 1799
to something like
6 56 226 414 149 148 660 116
Other lines of your example would even show more benefit, as the differences are smaller. This also works when the numbers decrease in-between, but you have to be able to deal with negative differences then.
Second thing to do would be changing the encoding. While compression will reduce this overhead, you're currently using 8 bit per digit, whereas you only need 4 bit of those (0-9, space as divisor). Implementing your own "4 bit character set" will already cut your storage requirements to half of the current size! In the end, this would be some kind of binary encoding of numbers of arbitrary length.

Resources