The table has 24 sstables with size tiered compaction, when I run nodetool tablehistograms I see 99% percentile of the queries are showing up 24 as the number of sstables. But the read latency is very low, my understanding from the tableshistograms's sstable column is - it shows how many sstables were read to complete the query. If so reading 24 sstables should take sometime, at least maybe couple of seconds. Am I missing something here? Does checking against index/bloom filters count towards sstable counter as well?
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 24.00 17.08 17436.92 310 6
75% 24.00 24.60 20924.30 446 6
95% 24.00 42.51 62479.63 770 10
98% 24.00 51.01 74975.55 1597 17
99% 24.00 61.21 74975.55 3311 24
Min 18.00 2.30 4866.32 87 0
Max 24.00 943.13 89970.66 545791 17084
This value keeps on changing with set of queries running on particular table .
Here , size of records are very less 3311 bytes and table consists of very few columns .
Are you selecting all records while running the query or you have single partition only .
Thats why , your query reads all stables.
Related
Can I get more info about the column of certain row in Cassandra ? Like the size of a row or a column?
Or some information about the size of the row like using a tool like nodetool ?
Maybe something based on the primary key / clustering key ?
Because in Astra Cassandra I have only access to CQL Console ...
<uat>root#cs01:~# nodetool cfstats mykeyspace.series;
Total number of tables: 249
----------------
Keyspace : mykeyspace
Read Count: 18547
Read Latency: 0.36771666576804873 ms
Write Count: 3147
Write Latency: 0.11854496345726087 ms
Pending Flushes: 0
Table: series
SSTable count: 11
Space used (live): 17919747207
Space used (total): 17919747207
Space used by snapshots (total): 0
Off heap memory used (total): 16091840
SSTable Compression Ratio: 0.13888102122935306
Number of partitions (estimate): 177144
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 7
Local read count: 11399
Local read latency: NaN ms
Local write count: 1753
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 1
Bloom filter false ratio: 0.00000
Bloom filter space used: 282544
Bloom filter off heap memory used: 282456
Index summary off heap memory used: 81176
Compression metadata off heap memory used: 15728208
Compacted partition minimum bytes: 36
Compacted partition maximum bytes: 14530764
Compacted partition mean bytes: 622218
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
----------------
Many thanks !
Given that Astra DB is a fully-managed Cassandra instance, you will not be able to run operator commands such as nodetool against it. That's just the nature of fully-managed services.
The best you can do is to look at the metrics of your DB's health dashboard. However, this won't really help you if you're not able to write the data to your DB.
As a friendly note, you shouldn't ask multiple questions in the same post. I would suggest you create a new post for your first question that includes (1) sample data, (2) sample schema, and (3) minimal code which replicates the problem. Cheers!
I'm learning Cassandra, and as a practice data set, I'm grabbing historical stock data from Yahoo. There is going to be one record for each trading day.
Obviously, I need to make the stock symbol as a part of the partitioning key. I'm seeing conflicting information on whether I should make the date as part of the partitioning key, or make it a clustering column?
Realistically, the stock market is open ~253 days per year. So a single stock will have ~253 records per year. I'm not building a full scale database, but would like to design it to accommodate / correctly.
If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?
If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?
Yes, correct on both accounts. That modeling approach is called "time bucketing," and its primary use case is for time/event data that grows over time. The good news is, that you wouldn't need to do that, unless your partitions were projected to get big. With your current projection of 253 rows written per partition per year, that's only going to be < 40kb each year (see calculation with nodetool tablehistograms below).
For your purposes I think partitioning by symbol and clustering by day should suffice.
CREATE TABLE stockquotes (
symbol text,
day date,
price decimal,
PRIMARY KEY(symbol, day))
WITH CLUSTERING ORDER BY (day DESC);
With most time-based use cases, we tend to care about recent data more (which may or may not be true with your case). If so, then writing the data in descending order by day will improve the performance of those queries.
Then (after writing some data), date range queries like this will work:
SELECT * FROM stockquotes
WHERE symbol='AAPL'
AND day >= '2020-08-01' AND day < '2020-08-08';
symbol | day | price
--------+------------+--------
AAPL | 2020-08-07 | 444.45
AAPL | 2020-08-06 | 455.61
AAPL | 2020-08-05 | 440.25
AAPL | 2020-08-04 | 438.66
AAPL | 2020-08-03 | 435.75
(5 rows)
To verify the partition sizes can use nodetool tablehistograms (once the data is flushed to disk).
bin/nodetool tablehistograms stackoverflow.stockquotes
stackoverflow/stockquotes histograms
Percentile Read Latency Write Latency SSTables Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 124 5
75% 0.00 0.00 0.00 124 5
95% 0.00 0.00 0.00 124 5
98% 0.00 0.00 0.00 124 5
99% 0.00 0.00 0.00 124 5
Min 0.00 0.00 0.00 104 5
Max 0.00 0.00 0.00 124 5
Partition size each year = 124 bytes x 253 = 31kb
Given the tiny partition size, this model would probably be good for at least 30 years of data before any slow-down (I recommend keeping partitions <= 1mb). Perhaps bucketing on something like quartercentiry might suffice? Regardless, in the short term, it'll be fine.
Edit:
Seems like any date portion used in the PK would spread the data across nodes, no?
Yes, a date portion used in the partition key would spread the data across nodes. That's actually the point of doing it. You don't want to end up with the anti-pattern of unbound row growth, because the partitions will eventually get so large that they'll be unusable. This idea is all about ensuring adequate data distribution.
lets say 1/sec and I need to query across years, etc. How would that bucketing work?
So the trick with time bucketing, is to find a "happy medium" between data distribution and query flexibility. Unfortunately, there will likely be edge cases where queries will hit more than one partition (node). But the idea is to build a model to handle most of them well.
The example question here of 1/sec for a year, is a bit extreme. But the idea to solve it is the same. There are 86400 seconds in a day. Depending on row size, that may even be too much to bucket by day. But for sake of argument, say we can. If we bucket on day, the PK looks like this:
PRIMARY KEY ((symbol,day),timestamp)
And the WHERE clause starts to look like this:
WHERE symbol='AAPL' AND day IN ('2020-08-06','2020-08-07');
On the flip side of that, a few days is fine but querying for an entire year would be cumbersome. Additionally, we wouldn't want to build an IN clause of 253 days. In fact, I don't recommend folks exceed single digits on an IN.
A possible approach here, would be fire 253 asynchronous queries (one for each day) from the application, and then assemble and sort the result set there. Using Spark (to do everything in a RDD) is a good option here, too. In reality, Cassandra isn't a great DB for a reporting API, so there is value in exploring some additional tools.
I've seen references to a ‘Number of key(estimate) from running nodetool cfstats, but at least in my system (Cassandra version 3.11.3), I don't see it:
Table: XXXXXX
SSTable count: 4
Space used (live): 2393755943
Space used (total): 2393755943
Space used by snapshots (total): 0
Off heap memory used (total): 2529880
SSTable Compression Ratio: 0.11501749368144083
Number of partitions (estimate): 1146
Memtable cell count: 296777
Memtable data size: 147223380
Memtable off heap memory used: 0
Memtable switch count: 127
Local read count: 9
Local read latency: NaN ms
Local write count: 44951572
Local write latency: 0.043 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 2144
Bloom filter off heap memory used: 2112
Index summary off heap memory used: 240
Compression metadata off heap memory used: 2527528
Compacted partition minimum bytes: 447
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 13547448
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Is there some way to approximate select count(*) from XXXXXX with this version of Cassandra?
The "number of keys" is the same as "the number of partitions" - again an estimate. If your partition key is the primary key (no clustering columns), then you'll have an estimate for the number of rows on that node. Otherwise, it's simply that, the estimate of number of partition key values.
-Jim
This was changed with CASSANDRA-13722. The "number of keys" estimate always meant "number of partitions" anyway, this just makes it apparent.
To approximate the number of rows in a large table, you could take that value (number of partitions) as a starting point. Then approximate an average of the number of clustering key combinations (rows), and you should be able to make an educated guess at it.
Another thought, would figure out the size (in bytes) of one row. Then look at the P50 of the output of nodetool tablehistograms keyspacename.tablename:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 2.00 35.43 4866.32 124 1
Divide the P50 (50th percentile) of Partition Size by the size of one row. That should give you the average number of rows returned for that table. Then multiply that by the "number of partitions" and you should have your number for that node.
How does one get the size of one row in Cassandra?
$ bin/cqlsh 127.0.0.1 -u aaron -p yourPasswordSucks -e "SELECT * FROM system.local WHERE key='local';" > local.txt
$ ls -al local.txt
-rw-r--r-- 1 z001mj8 DHC\Domain Users 2321 Sep 16 15:08 local.txt
Obviously, you'll want to take things out like pipe delimiters and the row header (not to mention accounting for the size difference in strings vs. numerics), but the final byte size of the file should put you in the ballpark.
I am worried about the "Compacted partition maximum bytes" value as it seems pretty high with 89MB.
Does this indicate a broken model or some other issue?
Application side there are no issues observed.
Data stored to the table is packed into weekly buckets for each device using the week_first_day, device_id partition key.
The data model for the table:
CREATE TABLE device_data (
week_first_day timestamp,
device_id uuid,
nano_since_epoch bigint,
sensor_id uuid,
source text,
unit text,
username text,
value double,
PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
)
nodetool cfstats
Table: device_data
SSTable count: 5
Space used (live): 447558297
Space used (total): 447558297
Space used by snapshots (total): 0
Off heap memory used (total): 211264
SSTable Compression Ratio: 0.2610509614736755
Number of partitions (estimate): 939
Memtable cell count: 458
Memtable data size: 63785
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 458
Local write latency: 0.058 ms
Pending flushes: 0
Percent repaired: 99.83
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 2216
Bloom filter off heap memory used: 2176
Index summary off heap memory used: 672
Compression metadata off heap memory used: 208416
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 89970660
Compacted partition mean bytes: 1100241
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
This really depends on the access patterns for data in that partitions - if you're reading the whole partition often, then this could cause the problem, but if you're reading only pieces of it, then it shouldn't be a problem. You may break up partitions by using the day as bucket, for example.
Look to the talk Myths of Big Partitions from Cassandra Summit 2 years ago - it has more details on how this is handled in Cassandra 3.x.
On DSE restart, I can see that the Read and Write counts/latency values reset to zero.
Does the following values also get reset?
Compression metadata off heap memory used: 1123544
Compacted partition minimum bytes: 87
Compacted partition maximum bytes: 129557750
Compacted partition mean bytes: 48702
Average live cells per slice (last five minutes): 238.6153846153846
Maximum live cells per slice (last five minutes): 888.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
After DSE restart, I am seeing that the values changed from what it was showing earlier. How does cfstats work?
The bottom 4
Average live cells per slice (last five minutes): 238.6153846153846
Maximum live cells per slice (last five minutes): 888.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
are stats on number of cells and tombstones touched on reads. These will reset on restarts. But
Compression metadata off heap memory used: 1123544
Compacted partition minimum bytes: 87
Compacted partition maximum bytes: 129557750
Compacted partition mean bytes: 48702
Are stats on the sstables for the table, they are on disk and will not reset.