KairosDB built in top of cassandra, but the cassandra have row key partition limit, Is that partition limit is applicable to this also?
Yes, it has the same limitation -- in the Getting Started section you can read what follows:
Using with Cassandra [...]
The default configuration for Cassandra is to use wide rows. Each row
is set to contain 3 weeks of data. The reason behind setting it to 3
weeks is if you wrote a metric every millisecond for 3 weeks it would
be just over 1 billion columns. Cassandra has a 2 billion column
limit.
HTH,
Carlo
we already discussedvia the kairosDB discussion group, but you raise an interesting question here.
I am sure that if you come over 2^64 you will have much more problems than the index capability of Cassandra. Imagine you're only using 1 byte per series, that means 1.84e19 bytes... Only Google or facebook know yet how to store 18 exabytes, which is a cosmic size.
Related
I am new to Cassandra, I have a cassandra cluster with 6 nodes. I am trying to find the partition size,
Tried to fetch it with this basic command
nodetool tablehistograms keyspace.tablename
Now, I am wondering how is it calculated and why the result has only 5 records other than min, max, while the number of nodes are 6. Does node size and number of partitions for a table has any relation?
Fundamentally, what I know is partition key is used to hash and distribute data to be persisted across various nodes
When exactly should we go for bucketing? I am assuming that Cassandra has got a partitioner that take care of distributed persistence across nodes.
The number of entries in this column is not related to the number of nodes. It shows the distribution of the values - you have min, max, and percentiles (50/75/95/98/99).
Most of the nodetool commands doesn't show anything about other nodes - they are tools for providing information about current node only.
P.S. This document would be useful in explaining how to interpret this information.
As the name of the command suggests, tablehistograms reports the distribution of metadata for the partitions held by a node.
To add to what Alex Ott has already stated, the percentiles (not percentages) provide an insight on the range of metadata values. For example:
50% of the partitions for the given table have a size of 74KB or less
95% are 263KB or less
98% are 455KB or less
These metadata don't have any correlation with the number of partitions or the number of nodes in your cluster.
You are correct in that the partition key gets hashed and the resulting value determines where the partition (and its associated rows) get stored (distributed among nodes in the cluster). If you're interested, I've explained in a bit more detail with some examples in this post -- https://community.datastax.com/questions/5944/.
As far as bucketing is concerned, you would typically do that to reduce the number of rows in a partition and therefore reducing its size. The general recommendation is to keep your partition sizes less than 100MB for optimal performance but it's not a hard rule -- you can have larger partitions as long as you are aware of the tradeoffs.
In your case, the larges partition is only 455KB so size is not a concern. Cheers!
I have 1000 partitions per table and cust_id is partition key and bucket_id and timestamp are the cluster keys.
Every hour one bucket_id and timestamp entry are recorded per cust_id.
Each day 24 * 1 = 24 rows will be recorded per partiton.
One year approx 9000 records per partion.
Partion size is 4MB approx.
---> 20 nodes Cassandra cluster single DC and RF=3
I want to select random five buckets for last 90 days data using IN query.
select cust_id,bucket_id,timestamp from customer_data where
cust_id='tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' and
bucket_id IN (0,2,5,7,8)
and timestamp >='2020-03-01 00:00:00' and
timestamp <='2020-06-01 00:00:00';
Please confirm, does this approach cause any issues with coordinator pressure and query timeouts?
How much data can a coordinator bear and return data without any issue?
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
If I run same kind of query for 10 Mil customers, does this affect coordinator pressure? Does it increase the chances to get a read timeout error?
It's could be hard to get definitive yes/no answer to these questions - there are some unknowns in them. For example, what version of Cassandra, how much memory is allocated for instance, what disks are used for data, what compaction strategy is used for a table, what consistency level do you use for reading the data, etc.
Overall, on the recent versions of Cassandra and when using SSDs, I won't expect problems with that, until you have hundreds of items in the IN list, especially if you're using consistency level LOCAL_ONE and prepared queries - all drivers use token-aware load balancing policy by default, and will route request to the node that holds the data, so it will be both coordinator & data node. Use of other consistency levels would put more pressure to the coordinating node, but it still should work quite good. The problem with read timeouts could start if you use HDDs, and overall incorrectly size the cluster.
Regarding the 10Mil customers - in your query you're doing select by partition key, so query is usually sent to a replica directly (if you use prepared statements). To avoid problems you shouldn't do IN for partition key column (cust_id in your case) - if you do queries for individual customers, driver will spread queries over the whole cluster & you'll avoid increased pressure on coordinator nodes.
But as usual, you need to test your table schema & cluster setup to prove this. I would recommend to use NoSQLBench - benchmark/load testing tool that was recently open sourced by DataStax - it was built for quick load testing of cluster and checking data models, and incorporates a lot of knowledge in area of performance testing.
Please try to ask one question per question.
Regarding the how much a coordinator node can handle, Alex is correct in that there are several factors which contribute to that.
Size of the result set.
Heap/RAM available on the coordinator node.
Network consistency between nodes.
Storage config (spinning, SSD, NFS, etc).
Coordinator pressure will vary widely based on these parameters. My advice, is to leave all timeout threshold settings at their defaults. They are there to protect your nodes from becoming overwhelmed. Timeouts are Cassandra's way of helping you figure out how much it can handle.
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
Based on your description, the primary key definition should look like this:
PRIMARY KEY ((cust_id),bucket_id,timestamp)
The data will be stored on disk by partition, and sorted by the cluster keys, similar to this (assuming ascending order on bucket_id and descending order on timestamp:
cust_id bucket_id timestamp
'tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' 0 2020-03-02 04:00:00
2020-03-01 22:00:00
1 2020-03-27 16:00:00
2 2020-04-22 05:00:00
2020-04-01 17:00:00
2020-03-05 22:00:00
3 2020-04-27 19:00:00
4 2020-03-27 17:00:00
5 2020-04-12 08:00:00
2020-04-01 12:00:00
Cassandra reads through the SSTable files in that order. It's important to remember that Cassandra reads sequentially off disk. When queries force it to perform random reads, that's when things may start to get a little slow. The read path has structures like partition offsets and bloom filters which help it figure out which files (and where inside them) have the data. But within a partition, it will need to scan clustering keys and figure out what to skip and what to return.
Depending on how many updates these rows have taken, it's important to remember that the requested data may stretch across multiple files. Reading one file is faster than reading more than one.
At the very least, you're forcing it to stay on one node by specifying the partition key. But you'll have to test how much a coordinator can return before causing problems. In general, I wouldn't specify double digits of items in an IN clause.
In terms of optimizing file access, Jon Haddad (now of Apple) has a great article on this: Apache Cassandra Performance Tuning - Compression with Mixed Workloads It focuses mainly on the table compression settings (namely chunk_length_in_kb) and has some great tips on how to improve data access performance. Specifically, the section "How Data is Read" is of particular interest:
We pull chunks out of SSTables, decompress them, and return them to the client....During the read path, the entire chunk must be read and decompressed. We’re not able to selectively read only the bytes we need. The impact of this is that if we are using 4K chunks, we can get away with only reading 4K off disk. If we use 256KB chunks, we have to read the entire 256K.
The point of this ^ relevant to your question, is that by skipping around (using IN) the coordinator will likely read data that it won't be returning.
If the only thing I have available is a com.datastax.driver.core.Session, is there a way to get a rough estimate of row count in a Cassandra table from a remote server? Performing a count is too expensive. I understand I can get a partition count estimate through JMX but I'd rather not assume JMX has been configured. (I think that result must be multiplied by number of nodes and divided by replication factor.) Ideally the estimate would include cluster keys too, but everything is on the table.
I also see there's a size_estimates table in the system keyspace but I don't see much documentation on it. Is it periodically refreshed or do the admins need to run something like nodetool flush?
Aside from not including cluster keys, what's wrong with using this as a very rough estimate?
select sum(partitions_count)
from system.size_estimates
where keyspace_name='keyspace' and table_name='table';
The size estimates is updated on a timer every 5 minutes (overridable with -Dcassandra.size_recorder_interval).
This is a very rough estimate, but you could from the token of the partition key find the range it belongs in and on each of the replicas pull from this table (its local replication and unique to each node, not global) and divide out the size and the number of partitions for a very vague approximate estimate of the partition size. There are so many assumptions and averaging that occurs in this path even before writing to this table. Cassandra errs on efficiency side at cost of accuracy and is more for general uses like spark bulk reading so take it with a grain of salt.
Its not useful now but looking towards the future post 4.0 freeze there will be many new virtual tables, including possibly ones to get accurate statistics on specific and ranges of partitions on demand.
I am using Casandra 2.0
My write load is somewhat similar to the queueing antipattern mentioned here: datastax
I am looking at pushing 30 - 40GB of data into cassandra every 24 hours and expiring that data within 24 hours. My current approach is to set a TTL on everything that I insert.
I am experimenting with how I partition my data as seen here: cassandra wide vs skinny rows
I have two column families. The first family contains metadata and the second contains data. There are N metadata to 1 data and a metadata may be rewritten M times throughout the day to point to a new data.
I suspect that the metadata churn is causing problems with reads in that finding the right metadata may require scanning all M items.
I suspect that the data churn is leading to excessive work compacting and garbage collecting.
It seems like creating a keyspace for each day and dropping the old keyspace after 24 hours would remove remove the need to do compaction entirely.
Aside from having to handle issues with what keyspace the user reads from on requests that overlap keyspaces, are there any other major flaws with this plan?
From my practice using partitioning is much better idea than using ttl.
It reduces cpu pressure
It partitions your data in Oracle manner, so searches are faster.
You can change your mind and keep the old data; using ttl it is difficult(I see one option - to migrate data before deletion)
If your rows are wide your can make them narrower.
Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!
This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.
The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.
Use RabbitMQ. Cassandra is probably a bad choice for this application.